Sometimes you realise that the hardware you’ve currently got available is not enough to get the job done.  To quote Chief Brody, “You’re going to need a bigger boat”.

As a result of a previous blog post, a number of people have expressed to me an interest in using Amazon Web Services (AWS) to gain access to GPU hardware.   If you’re interested in experimenting with GPUs, but don’t have the right hardware, AWS offers a low-cost way of dipping your toe in the water.  Amazon’s GPU instance type (g2.2xlarge) (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cluster_computing.html) is currently only $0.70 per hour and sports an NVIDIA Grid processor with 1536 cores and 4GB of its own RAM. The AWS environment is a good choice for developers, as in the headless environment the GPU is not being used to render the desktop.  This might otherwise be a barrier to debugging (support for debugging on the same GPU as used to render the desktop is an experimental feature available on Linux only). There are many tutorials available to guide you through the setting up of a GPU instance for running CUDA-dependent code, but none that I’m aware of addresses the developer’s desire to build and debug GPU code in an IDE.

Step 1: Chose the operating system

The first decision we need to make is the choice of operating system.  I’ve elected to run Linux.

In order to avoid compatibility issues between the kernel and the NVIDIA drivers, I strongly recommend selecting an operating system that is listed as being supported by NVIDIA.  Also, the available Amazon Machine Images (AMIs) are stripped-down server builds and do not have a desktop environment in which to run Nsight, the CUDA IDE.  Thus we will want a version of Linux that is straightforward to retro-fit with a desktop environment (this rules out the Amazon Linux AMIs).

Finally, we will need to select an AMI that uses HVM (Hardware-assisted Virtual Machine) as the virtualisation mechanism, as that is a requirement to access the grid processor. For the purposes of this tutorial, I’ve chosen Ubuntu 12.04.

Step 2: Create the server instance

This step assumes you already have an AWS account and are logged in to the AWS console.   It also assumes that you’ve selected a region that has GPU instances (I am running mine in the EU region).  Select Launch Instance and choose the AMI (I’m using a community image that is tagged as using HVM) :

1_aws_instance_creation_1

Next, we specify the virtual hardware.  Select the GPU instance type:

2_aws_instance_creation_2

The default instance details are fine, so accept those and move on to storage: 3_aws_instance_creation_3

The image size is 8GB, but we’ll be adding to that significantly by installing the desktop environment and the CUDA toolkit, and we’ll need some elbow room. I’m setting this to 20GB. The next screen sets tags for the instance. Set the ‘name’ tag to something recognisable and click through to setting up the security group:

4_aws_instance_creation_4

We’ll set up two rules for in-bound connections.  We’ll need to log in via SSH to set the machine up, and we’ll also want to make a VNC connection to the desktop.  For this we’ll need ports 22 and 5901 open.  I’ve restricted access to these such that they are only accessible from my own IP address.  If you don’t have a fixed IP address, you will need to edit this when your address changes, although in practice this does not happen frequently.

Finally, click ‘Review and launch’.  You’ll be prompted to generate or provide a key with which to log in (Linux AMIs do not have passwords).  I’ve saved mine locally as ‘workstation.pem’.  The instance will start up.  AWS will automatically assign a (temporary) IP address on the public network – make a note of it and then connect via SSH:

# ssh -i Workstation.pem  ubuntu@<your server’s public IP address>

Note that for my chosen AMI, the default user ID is “ubuntu”, not “ec2-user”.

Step 3: Install the pre-requisites for CUDA

As installing the desktop environment takes a long time (about half an hour), I recommend getting CUDA working first. After all, if CUDA doesn’t work, we’re wasting our time.

Before installing the toolkit we need to install its prerequisites. The CUDA installer needs access to a compiler, and we’ll need that ourselves later to build our own code:

# sudo –s
# apt-get update
# apt-get install build-essential gcc g++ make binutils linux-headers-`uname -r`

We’ll need a device driver to access the GPU hardware. The simplest way of achieving this is to let the CUDA toolkit installer install the driver, as that will ensure that the driver is compatible with the OS (as we selected a supported one) and that it is at a high enough version for the toolkit we’re installing.

Step 4: Install the CUDA toolkit

The CUDA toolkit v6 can be downloaded here: https://developer.nvidia.com/cuda-downloads.   It’s best to download directly to the AMI, as it’s quite large:

# wget http://developer.download.nvidia.com/compute/cuda/6_0/rel/installers/cuda_6.0.37_linux_64.run

Make the file executable and run it:

# chmod u+x cuda_6.0.37_linux_64.run
# ./cuda_6.0.37_linux_64.run

Accept all the defaults, especially the CUDA samples – we’ll use one of those to verify our installation in a later step. If the installation fails for any reason, a log file will be generated containing diagnostic information.   The most likely failure is a missing dependency; make sure you’ve installed the pre-requisites correctly.

Step 5: Verify the CUDA installation

In order to verify the CUDA installation, we’ll build and run one of the utilities that is provided as a sample in the toolkit:

$ cd NVIDIA_CUDA-6.0_Samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GRID K520"
CUDA Driver Version / Runtime Version          6.0 / 6.0
CUDA Capability Major/Minor version number:    3.0
Total amount of global memory:                 4096 MBytes (4294770688 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
GPU Clock rate:                                797 MHz (0.80 GHz)
Memory Clock rate:                             2500 Mhz
Memory Bus Width:                              256-bit
L2 Cache Size:                                 524288 bytes
Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  2048
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z):  (1024, 1024, 64)
Max dimension size of a grid size   (x,y,z):   (2147483647, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Disabled
Device supports Unified Addressing (UVA):      Yes
Device PCI Bus ID / PCI location ID:           0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
 
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA Runtime Version = 6.0, NumDevs = 1, Device0 = GRID K520
Result = PASS

We can see from this that our application is talking to the CUDA hardware, in this case the GRID K520 provided by the G2.2XLarge instance type.

Now that the CUDA installation is working correctly, it’s worth investing the time in getting the IDE running.  First, we’ll need a desktop.

Step 6: Install the desktop

We currently have a headless server, so we’ll need to add the packages for running the desktop. We’ll be running over a VNC connection, so I’ve opted for a light-weight desktop environment – XFCE.

There are apparently faster connection mechanisms than VNC, but I found the added complication of running on AWS made these tricky to configure, whereas VNC is straightforward.

Firstly, we’ll need to install the desktop packages. This may take a while to complete – about 20 minutes or so:

# apt-get install xubuntu-desktop xfce4 -y

Next, we need to get the VNC server package:

# apt-get install vnc4server -y

I’m going to create myself a user ID with which to connect:

# adduser gavin
# su gavin
$ cd ~

Running the VNC server for the first time will write some default configuration files, so we start / stop it once to do this:

$ vncserver
$ vncserver –kill :1

Next, we edit the VNC configuration to start an XFCE desktop session:

$ vim ~/.vnc/xstartup

Replace the contents of the file with the following:

#!/bin/sh
unset SESSION_MANAGER
unset DBUS_SESSION_BUS_ADDRESS
startxfce4 &

[ -x /etc/vnc/xstartup ] && exec /etc/vnc/xstartup
[ -r $HOME/.Xresources ] && xrdb $HOME/.Xresources
xsetroot -solid grey
vncconfig -iconic &

Now we’re ready to restart the VNC server and connect to the desktop. I’m going to set the geometry such that it’s the same size as my screen, which is convenient for running it in full-screen mode:

$ vncserver -geometry 1366x768

New 'ip-10-0-0-248:1 (gavin)' desktop is ip-10-0-0-248:1
Starting applications specified in /home/gavin/.vnc/xstartup
Log file is /home/gavin/.vnc/ip-10-0-0-248:1.log

NB: The IP address listed above is on an internal subnet in my setup – it’s not the IP address that we’ll use to connect to the VNC server. The IP address we need is the one listed in the AWS console as the ‘public IP’. You may prefer to assign an elastic IP address to the machine (and maybe assign it a DNS entry) but Amazon charges for elastic IPs that are assigned to stopped instances, and I plan to use my instance intermittently.

In order to connect to the desktop environment you’ll need a VNC client. I’m connecting from a Mac, so I’m using the OS X built-in client.  I’l launch it from the command line so that I can specify the port to use (remember 5901 is the port for the session we set up):

$ open vnc://<your server's public IP address>:5901

The client will prompt for the session password,  which is the one you specified when you first ran the VNC server.

6_vnc_connect

If you are having difficulty connecting, first check that the VNC server is running, and use lsof or netstat to make sure it is listening on the right port (the port number uses  a base of 5900 by default, and adds 1 for each session, hence ours is on 5901).  If the server is listening on the correct port, check your security rules in the AWS console – the port must be opened in order to allow a connection.

Step 6a : (Optional) Start the VNC server at boot time

You may find it convenient to have the VNC server start at boot-time rather than start it manually.  If so, there’s a good set of instructions here:  http://www.namhuy.net/3106/install-vnc-server-ubuntu-14-04.html including a script to start and stop the server.

Although these instructions are labeled for Ubuntu 14.04, they will work fine for 12.04.

Step 7 : Running the IDE

Open a terminal window and start Nsight:

$ /usr/local/cuda/nsight &

5_launch_ide

You’ll be prompted to create a workspace, and then the IDE will open.

 

7_ide_ready

Finally, the environment is set up and we’re ready to write a CUDA kernel.

Step 8:  Writing and debugging a kernel

From the file menu, select ‘New | CUDA C/C++ Project':

8_create_project

Select ‘empty project’, which will create a project with the default settings for a CUDA project.   Accept the defaults and the project is created.

Next, we’ll add a new source file.  For the purposes of this tutorial we’ll need a kernel to use as a demonstration.  Helpfully, when creating a new source file the IDE will provide demo code; we’ll choose the CUDA C Bitreverse application:

9_create_source

Name the source file (I’ve called it hello.cu) and click finish to generate the code.   Then, click build to generate a binary.

In order to run or debug our kernel, we’ll need a launch configuration:

10_debug_config

Click ‘Debug’ to launch the application.

If this is the first time through, you’ll be asked about switching to the debug perspective; select Yes to set this as the default behaviour.

The debugger will stop at the first line of code. Set a breakpoint in the kernel and press F8 to run the application. The debugger will stop inside the kernel:

11_debug_stopped

Step 9: Terminating the instance

If you’re using AWS to try out CUDA, it’s likely that you’ll only be using the instance that you’ve set up for a few hours per week.  If that’s the case, remember to stop (but not terminate)  the instance.   A stopped instance does not incur any charges apart from the EBS storage it is using for its disk image.   Although a G2.2Xlarge instance is only $0.70 per hour, that will soon mount up over the course of a month if you leave it running.   A stopped instance is analogous to “powered off” – all data on disk is preserved but the instance must be booted up to use it.   Only terminate the instance if you never want to use it again – on termination, the disk image will also be destroyed and a new instance built next time you need one.

If you are worried about accidentally getting a big bill by leaving an instance running, set up billing alerts.  These allow you to set a limit and receive warnings when the spend goes over that limit (although the bill is not capped – you will still get billed if you don’t stop the instance).  Secondly, get in to the habit of checking what you’ve got running – there’s an AWS iPhone/iPad app to make this more convenient.