Skip to content

NVIDIA CUDA Samples - Technical Overview

High-Level Architecture

How CUDA Programming Works

CUDA Thread Hierarchy

Key Concepts

CUDA Programming Model

ConceptDescription
KernelFunction that runs on the GPU, launched with <<<grid, block>>> syntax. Marked with __global__ qualifier.
ThreadSmallest execution unit. Each thread has unique IDs (threadIdx.x/y/z) and private registers.
Thread BlockGroup of threads that can synchronize and share memory. Max 1024 threads (Compute Capability 2.0+).
GridCollection of thread blocks executing the same kernel. Blocks execute independently.
Warp32 threads executing in SIMT (Single Instruction Multiple Threads) fashion. The GPU's scheduling unit.
Streaming Multiprocessor (SM)GPU processor that executes thread blocks. Contains CUDA cores, shared memory, and registers.

Memory Hierarchy

Memory Access Performance

Memory TypeLocationScopeLatencyBandwidth
RegistersOn-chipThread1 cycleHighest
Shared MemoryOn-chipBlock~5 cycles~1.5 TB/s
L1 CacheOn-chipSM~30 cyclesHigh
L2 CacheOn-chipDevice~200 cycles~2-3 TB/s
Global MemoryOff-chipDevice~400-800 cycles400-900 GB/s
Constant MemoryOff-chip (cached)Device1 cycle (cached)Broadcast

CUDA Sample Categories

Sample Category Details

CategoryPurposeKey Samples
0_IntroductionLearn basic CUDA concepts and runtime APIsvectorAdd, matrixMul, asyncAPI, cudaOpenMP
1_UtilitiesQuery device capabilities and benchmarkdeviceQuery, bandwidthTest, topologyQuery
2_Concepts_and_TechniquesCommon parallel programming patternsreduction, scan, histogram, sorting
3_CUDA_FeaturesAdvanced CUDA capabilitiescooperativeGroups, cudaGraphs, dynamicParallelism
4_CUDA_LibrariesUsing GPU-accelerated librariescuBLAS, cuFFT, cuSPARSE, NPP, cuRAND
5_Domain_SpecificReal-world application examplesnbody, fluidsGL, MonteCarloGPU, imageProcessing
6_libNVVMNVVM IR compilation and JITsimple, ptxgen
7_Platform_SpecificPlatform-specific featuresTegra, cuDLA, NvMedia, NvSci

Technical Details

Building CUDA Samples

bash
# Clone the repository
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples

# Linux build
mkdir build && cd build
cmake ..
make -j$(nproc)

# Windows build (Visual Studio 2019+)
cmake .. -G "Visual Studio 16 2019" -A x64
cmake --build . --config Release

# Optional: Enable GPU debugging
cmake .. -DENABLE_CUDA_DEBUG=True

# Optional: Build Tegra samples
cmake .. -DBUILD_TEGRA=True

Requirements

  • CUDA Toolkit: 11.0+ (latest: 13.1)
  • CMake: 3.20 or later
  • Compiler: GCC 7+, Clang 8+, or MSVC 2019+
  • GPU: NVIDIA GPU with Compute Capability 5.0+
  • Driver: Compatible with CUDA Toolkit version

Compute Capability Matrix

ArchitectureCompute CapabilityKey Features
Maxwell5.0-5.3Dynamic Parallelism (feature-complete in CUDA 13)
Pascal6.0-6.2Unified Memory, NVLink (feature-complete in CUDA 13)
Volta7.0Tensor Cores, Independent Thread Scheduling (feature-complete in CUDA 13)
Turing7.5RT Cores, INT8 Tensor Cores
Ampere8.0-8.63rd Gen Tensor Cores, Sparsity
Ada Lovelace8.94th Gen Tensor Cores, FP8
Hopper9.0Transformer Engine, Thread Block Clusters
Blackwell10.05th Gen Tensor Cores, FP4

CUDA Ecosystem

Key Facts (2025)

  • Repository Stats: 8.7k+ stars, 2.2k+ forks on GitHub
  • Current Version: CUDA Toolkit 13.1 (samples updated accordingly)
  • Sample Count: 100+ samples across 8 categories
  • Build System: CMake 3.20+ (migrated from Makefiles)
  • Platform Support: Linux, Windows, Tegra, QNX, DriveOS
  • Architecture Deprecation: Maxwell, Pascal, Volta are feature-complete in CUDA 13 (no new features)
  • Multi-Device Changes: Multi-device cooperative groups removed in CUDA 13
  • Performance: GPU-accelerated applications can be 50-400x faster than CPU-only implementations
  • Market Share: CUDA dominates GPU computing with ~95% market share in AI/ML workloads

Use Cases

Learning & Education

  • Beginner tutorials: vectorAdd, matrixMul introduce parallel thinking
  • Performance optimization: bandwidthTest, reduction teach optimization strategies
  • Memory management: Understanding global, shared, and constant memory

Development & Testing

  • GPU validation: deviceQuery confirms driver and hardware setup
  • Performance benchmarking: bandwidthTest measures actual vs theoretical bandwidth
  • Feature exploration: Test new CUDA features before production use

Application Development

DomainSampleDescription
Deep LearningcudaTensorCoreGemmMatrix multiplication using Tensor Cores
PhysicsnbodyN-body gravitational simulation
Fluid DynamicsfluidsGLReal-time fluid simulation with OpenGL
FinanceMonteCarloGPUOption pricing with Monte Carlo methods
Image ProcessingimageProcessingNPPGPU-accelerated image filters
Signal ProcessingconvolutionFFT2D2D convolution using cuFFT
GraphicssimpleVulkanVulkan-CUDA interoperability

Production Patterns

  • Cooperative Groups: Flexible thread synchronization patterns
  • CUDA Graphs: Reduce launch overhead for repetitive workflows
  • Unified Memory: Simplified memory management across CPU/GPU
  • Dynamic Parallelism: GPU-side kernel launches for adaptive algorithms

Security & Considerations

Hardware Requirements

  • NVIDIA GPU required (no AMD/Intel support)
  • Driver compatibility with CUDA Toolkit version is critical
  • Older architectures may lose support in future toolkit versions

Development Considerations

  • Memory management: Manual allocation/deallocation can lead to leaks
  • Race conditions: Improper synchronization causes data corruption
  • Bank conflicts: Shared memory access patterns affect performance
  • Occupancy: Thread block configuration impacts GPU utilization
  • Warp divergence: Conditional branches within warps reduce efficiency

Best Practices

  • Use cuda-memcheck or Compute Sanitizer for debugging
  • Profile with NVIDIA Nsight Systems and Nsight Compute
  • Test on multiple GPU architectures for compatibility
  • Monitor GPU memory usage to prevent out-of-memory errors
  • Use CUDA error checking (cudaGetLastError()) in production

Resources

Technical research and documentation