High-Priority Recommendations
- To get the maximum benefit from CUDA, focus first on finding ways to parallelize sequential code. (Preface)
- Use the effective bandwidth of your computation as a metric when measuring performance and optimization benefits. (Bandwidth)
- Minimize data transfer between the host and the device, even if it means running some kernels on the device that do not show performance gains when compared with running them on the host CPU. (Data Transfer Between Host and Device)
- Ensure global memory accesses are coalesced whenever possible. (Coalesced Access to Global Memory)
- Minimize the use of global memory. Prefer shared memory access where possible. (Memory Instructions)
- Avoid different execution paths within the same warp. (Branching and Divergence)