I’m eCo’s Chief Technology Officer.   One of the reasons I do the job that I do is that I’m passionately interested in technology.   I’ve always been particularly interested in parallel computing, since the days of the Transputer, so GPU programming and CUDA tick all the boxes for me.  The code is interesting to write and debug, and getting the best performance means understanding how the hardware behaves at a fairly low level.

In early versions of the CUDA toolkit there was a simulation mode that enabled CUDA code to be developed on machines that lacked an NVidia GPU. This was removed in 2010, so since then NVidia hardware (or a third-party simulator) has been required.

In a previous post I described how to create a CUDA development environment using AWS. In this post I describe an alternative approach, which is to edit code locally (in my case, using a Macbook Air) and use a Jetson TK1 for build and test.

The Jetson TK1 is a single board computer that has a 192 core Kepler GPU.  It’s designed for embedded applications, managing 325 GFLOPS in a 10W power budget, but it is its low cost that makes it a great development platform.

The CUDA toolkit comes with NSight, which is an Eclipse IDE customised for CUDA development.  Although there is a graphical desktop on the Jetson, we can’t run NSight as it is not supported.  There is however an alternative arrangement that works very well – run NSight locally on another machine and have the build done remotely on the Jetson.  The code can also be run or debugged remotely from the IDE.   This is known as ‘Synchronised Project’ mode.

In this post, I will describe how to set up a project using a Macbook Air to run NSight and the Jetson for build and debug.


We will need to run NSight locally, so the first step is to download and install the CUDA toolkit for the local machine.  If you do not intend to run code locally, there is no need to install the CUDA driver.  As I’m using a Macbook Air, my local machine does not have the required hardware to run locally, so there’s no choice to be made.

I’ve assumed that the Jetson has already been configured with the CUDA toolkit. If this is not the case, there are some excellent tutorials elsewhere.  Once installed, add the toolkit’s bin directory into the path, and the lib directory into the LD_LIBRARY_PATH.

NSight will use Git to synchronise the project to the Jetson, so you will need to install the Git client on there too (sudo apt-get install git).

With the toolkit installed both on the Jetson and on the local machine, we are ready to launch NSight.

Creating the Project

The next step is to create a new project. Select File | New | CUDA C/C++ Project to create the project.


I’ve called my project Lab5 – I intend to develop a solution for the Week 5 Lab exercise for the Coursera Heterogeneous Parallel Processing course.

Next we select the details of the hardware. The Jetson TK1 has an NVIDIA Kepler™ GPU so we select SM30 PTX intermediate code and SM32 GPU binary code :



Note that the IDE is complaining that there are no CUDA compatible devices available, as we are running NSight on a Macbook that has no discrete GPU.

Next, we set up the target system. The default dialog looks like this:


Select ‘Manage…’ to set up the Jetson as a remote target.  You will need to enter the IP address of the Jetson:


Then, select this target from the drop-down to add the remote build target. Select the directory that contains the CUDA toolkit on the Jetson, and the directory where you wish to place the project, also on the Jetson:



Finally, remove the target for Local System – we can’t run locally, as we don’t have the hardware.


Click Finish to complete the creation of the project.

Adding the source code

In the first instance, I’m going to use a template with some CUDA code to make sure that the build is set up and working correctly. Select File | New | Source File and set up the details as shown below:


The editor will display the generated code:


Now we’re ready to build. When we click the build button, the project will be synchronised over to the Jetson using git.   The commands required to build the project will be issued over SSH, using the connection details we provided when we set up the remote target. The output from the build is shown below:

22:57:27 **** Build of configuration Debug for project Lab5 ****
make all -C /users/wip/cuda/Lab5/Debug
make: Entering directory `/users/wip/cuda/Lab5/Debug'
Building file: ../main.cu
Invoking: NVCC Compiler
/usr/local/cuda-6.0/bin/nvcc -G -g -O0 -ccbin arm-linux-gnueabihf-g++-4.6 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_32,code=sm_32 --target-cpu-architecture ARM -m32 -odir "" -M -o "main.d" "../main.cu"
/usr/local/cuda-6.0/bin/nvcc --compile -G -O0 -g -gencode arch=compute_30,code=compute_30 -gencode arch=compute_32,code=sm_32 --target-cpu-architecture ARM -m32 -ccbin arm-linux-gnueabihf-g++-4.6 -x cu -o "main.o" "../main.cu"
Finished building: ../main.cu
Building target: Lab5
Invoking: NVCC Linker
/usr/local/cuda-6.0/bin/nvcc --cudart static --target-cpu-architecture ARM -m32 -ccbin arm-linux-gnueabihf-g++-4.6 -link -o "Lab5" ./main.o
Finished building target: Lab5
make: Leaving directory `/users/wip/cuda/Lab5/Debug'
> Shell Completed (exit code = 0)

09:57:41 Build Finished (took 13s.426ms)

Note that the build command is /usr/local/cuda-6.0/bin/nvcc, which is the compiler on the Jetson, not on the Macbook.

Debugging the code

Next, we’ll check that we can debug the code.   Select ‘Debug configurations…’ from the debug drop-down to display the following dialog:


Add a new C/C++ Remote Application config. By default the remote connection will be ‘Local’ – select the Jetson configuration from the drop-down:


You will also need to tell the IDE where to find the Remote Executable, using the ‘Browse’ button. Don’t forget that Browse will bring up the remote file system, not the local one.


Click ‘Debug’ to start the program.  The debugging session starts by using SSH to run gdb-server on the remote machine.  NSight can then connect to this and use it to debug the application:

Last login: Mon Feb 08 23:11:31 2015 from
echo $PWD'>'
/bin/sh -c "cd \"/users/wip/cuda/Lab5/Debug\";export LD_LIBRARY_PATH=\"/usr/local/cuda-6.0/lib\":\${LD_LIBRARY_PATH};\"/usr/local/cuda-6.0/bin/cuda-gdbserver\" :2345 \"/users/wip/cuda/Lab5/Debug/Lab5\"";exit
gavin@tegra-ubuntu:~$ echo $PWD'>'
gavin@tegra-ubuntu:~$ /bin/sh -c "cd \"/users/wip/cuda/Lab5/Debug\";export LD_LI BRARY_PATH=\"/usr/local/cuda-6.0/lib\":\${LD_LIBRARY_PATH};\"/usr/local/cuda-6.0 /bin/cuda-gdbserver\" :2345 \"/users/wip/cuda/Lab5/Debug/Lab5\"";exit
Process /users/wip/cuda/Lab5/Debug/Lab5 created; pid = 3309
Listening on port 2345
Remote debugging from host

By default the debugger will break in main() – this can be switched off in the debug configuration.

Set a breakpoint in the bitreverse kernel and press F8 to resume execution. The debugger will stop in the kernel:


Note that the debug window is indicating that we are in thread (0,0,0) of block (0,0,0).

Any output sent to the console will be redirected to the Console window of the IDE.

Remote debugging from host
Input value: 0, device output: 0, host output: 0
Input value: 1, device output: 128, host output: 128
Input value: 2, device output: 64, host output: 64
Input value: 3, device output: 192, host output: 192
Input value: 255, device output: 255, host output: 255
Child exited with status 0
GDBserver exiting

Now, we can simply replace the main.cu with the outline code for the lab exercise and we’re ready to start coding.