NVIDIA GPU support

NOTE: Support for NVIDIA GPUs via the CUDA backend is currently experimental and many features may be missing or incomplete.

The experimental CUDA backend provides support for CUDA-capable NVIDIA GPUs under Linux or macOS. The goal of this backend is to provide an open-source alternative to the proprietary NVIDIA OpenCL implementation. This makes use of the NVPTX backend in LLVM and the CUDA driver API.

Building PoCL with CUDA support

1) Install prerequisites

Aside from the usual PoCL dependencies, you will also need the CUDA toolkit. Currently the backend is only regularly tested against CUDA 11.6, but it should be possible to build against other versions.

If you experience build failures regarding missing CUDA headers or libraries, you may need to add the include directory containing cuda.h to your header search path, and/or the library directory containing libcuda.{so,dylib} to your library search path.

The CUDA backend requires LLVM built with the NVPTX backend enabled.

2) Build PoCL

To enable the CUDA backend, add -DENABLE_CUDA=ON to your CMake configuration command line.

Otherwise, build and install PoCL as normal.

3) Run tests

After building PoCL, you can smoke test the CUDA backend by executing the subset of PoCL’s tests that are known to pass on NVIDIA GPUs:

../tools/scripts/run_cuda_tests

4) Configuration

Use POCL_DEVICES=CUDA to select only CUDA devices. If the system has more than one GPU, specify the CUDA device multiple times (e.g. POCL_DEVICES=CUDA,CUDA for two GPUs).

The CUDA backend currently has a runtime dependency on the CUDA toolkit. If you receive errors regarding a failure to load libdevice, you may need to set the POCL_CUDA_TOOLKIT_PATH environment variable to tell PoCL where the CUDA toolkit is installed. Set this variable to the root of the toolkit installation (the directory containing the nvvm directory).

The POCL_CUDA_GPU_ARCH environment variable can be set to override the target GPU architecture (e.g. POCL_CUDA_GPU_ARCH=sm_35), which may be necessary in cases where LLVM doesn’t yet support the architecture.

The POCL_CUDA_DUMP_NVVM environment variable can be set to 1 to dump the LLVM IR that is fed into the NVPTX backend for debugging purposes (requires POCL_DEBUG=1).

The POCL_CUDA_DISABLE_QUEUE_THREADS environment variable indicates usage of background threads for handling command submission. Defaults to 1, set to 0 to enable background threads. Disabled threads can potentially reduce command launch latency, but can cause problems if using user events or sharing a context with a non-CUDA device.

CUDA backend status

(last updated: 2017-06-02)

The CUDA backend currently passes 73 tests from PoCL’s internal testsuite, and is capable of running various real OpenCL codes. Unlike NVIDIA’s proprietary OpenCL implementation, PoCL supports SPIR-V consumption, and so this backend has also been able to run (for example) SYCL codes using Codeplay’s ComputeCpp implementation on NVIDIA GPUs. Since it uses CUDA under-the-hood, this backend also works with all of the NVIDIA CUDA profiling and debugging tools, many of which don’t work with NVIDIA’s own OpenCL implementation.

Conformance status

The Khronos OpenCL 1.2 conformance tests are available here. The following test categories are known to pass on at least one NVIDIA GPU using PoCL’s CUDA backend:

  • allocations

  • api

  • atomics

  • basic

  • commonfns

  • computeinfo

  • contractions

  • events

  • profiling

  • relationals

  • thread_dimensions

  • vec_step

Tested platforms

The CUDA backend has been tested on Linux (CentOS 7.3) with SM_35, SM_52, SM_60, and SM_61 capable NVIDIA GPUs.

The backend is also functional on macOS, with just one additional test failure compared to Linux (test_event_cycle).

Known issues

The following is a non-comprehensive list of known issues in the CUDA backend:

  • image types and samplers are unimplemented

  • printf format support is incomplete

Additionally, there has been little effort to optimize the performance of this backend so far - the current effort is on implementing remaining functionality. Once the core functionality is completed, optimization of the code generation and runtime can begin.

Support

For bug reports and questions, please use PoCL’s GitHub issue tracker. Pull requests and other contributions are also very welcome.

This work has primarily been done by James Price from the University of Bristol’s High Performance Computing Group.