As the laws of physics are forcing chip makers to turn away from increasing the clock frequency and instead focus on increasing core-count, cheap and simple performance increases following Moore‟s law alone are no longer possible. The following rapid and broad hardware development has created a large set of diverse parallel hardware architectures, such as many-core CPUs, the Cell Broadband Engine, Tillera 64, Larrabee, Sun UltraSparc, Niagara [1] and recently also GPUs. Despite the rapid development in parallel computing architectures, leveraging this computing power is largely missing in software due to many of the great and costly difficulties with writing parallel code and applications…
Contents
1 Terminology
2 Introduction
3 The potential of the GPU as a heavy duty co-processor
4 GPGPU: A concise history
5 CUDA: History and future
6 An introduction to CUDA enabled hardware
7 Nvidia GT200 family GPU hardware
7.1 A brief introduction to the Warp
7.2 The SM‟s computational units
7.3 CUDA memory architecture
7.4 Implementation details
7.5 Precision and lack of IEEE-compliance
8 Nvidia Quadro FX 4800 graphics board
9 CUDA Programming
9.1 A programmers perspective
9.2 Programming environment
9.3 CUDA-suitable computations
9.4 On Arithmetic Intensity: AI
9.5 Grids, blocks and threads; the coarse and fine grained data parallel structural elements of CUDA
9.6 Hardware limitations
9.7 Synchronization and execution order
9.8 CUDA portability between the GT200 cards
10 CUDA Optimizing
10.1 High priority optimizations
10.1 High priority optimizations
10.2 Medium priority optimizations
10.3 Low priority optimizations
10.4 Advanced optimization strategies
10.5 Optimization conclusions
11 OpenCL
12 Experience of programming with CUDA and OpenCL
12.1 CUDA
12.2 OpenCL
12.3 Portability between CUDA and OpenCL
13 Fermi and the future of CUDA
14 Overview of benchmarks
14.1 High Performance Embedded Computing: HPEC
14.2 Other radar related benchmarks
14.3 Time-domain Finite Impulse Response: TDFIR
14.4 Frequency-domain Finite Impulse Response: FDFIR
14.5 QR-decomposition
14.6 Singular Value Decomposition: SVD
14.7 Constant False-Alarm Rate: CFAR
14.8 Corner Turn: CT
14.9 INT-Bi-C: Bi-cubic interpolation
14.10 INT-C: Cubic interpolation through Neville‟s algorithm
14.11 Synthetic Aperture Radar (SAR) inspired tilted matrix additions
14.12 Space-Time Adaptive Processing: STAP
14.13 FFT
14.14 Picture Correlation
14.15 Hardware and software used for the benchmarks
14.16 Benchmark timing definitions
15 Benchmark results
15.1 TDFIR
15.2 TDFIR with OpenCL
15.3 FDFIR
15.4 QR-decomposition
15.5 Singular Value Decomposition: SVD
15.6 Constant False-Alarm Rate: CFAR
15.7 Corner turn
15.8 INT-Bi-C: Bi-cubic interpolation
15.9 INT-C: Nevillefis algorithm
15.10 Int-C: Neville‟s algorithm with OpenCL
15.11 Synthetic Aperture Radar (SAR) inspired tilted matrix additions
15.12 Space Time Adaptive Processing : STAP
15.13 FFT
15.14 Picture correlation
15.15 Benchmark conclusions and summary
16 Feasibility for radar signal processing
17 Conclusion
17.1 Performance with the FX 4800 graphics board
17.2 Programming in CUDA and OpenCL
17.3 Feasibility for radar signal processing
18 References
Author: Persson, Staffan
Source: Uppsala University Library
Download URL 2: Visit Now