It is easy and informative to explore the ramifications of misaligned accesses using a simple copy kernel, such as the one in A copy kernel that illustrates misaligned accesses.
__global__ void offsetCopy(float *odata, float* idata, int offset) { int xid = blockIdx.x * blockDim.x + threadIdx.x + offset; odata[xid] = idata[xid]; }
In A copy kernel that illustrates misaligned accesses, data is copied from the input array idata to the output array, both of which exist in global memory. The kernel is executed within a loop in host code that varies the parameter offset from 1 to 32. (Figure 1 and Figure 2 correspond to offsets of 1 and 17, respectively.) The effective bandwidth for the copy with various offsets on an NVIDIA GeForce GTX 280 (with compute capability 1.3) and an NVIDIA GeForce GTX 8800 (compute capability 1.0) are shown in Figure 1.
For the NVIDIA GeForce GTX 8800 device, global memory accesses with no offset or with offsets that are multiples of 16 result in a single transaction per half warp and an effective bandwidth of approximately 74 GBps. Otherwise, 16 transactions are issued per half warp resulting in an effective bandwidth of approximately 7 GBps. This roughly 8x performance degradation is due to the fact that 32 bytes, the minimum transaction size, are fetched for each thread. However, only 4 bytes of data are used for each 32 bytes fetched—resulting in the 4/32=1/8 performance relative to the fully coalesced case. The two numbers also reflect the different data represented by effective bandwidth (4 bytes) versus actual bandwidth (32 bytes).
Because of this possible performance degradation, memory coalescing is the most critical aspect of performance optimization of device memory. For devices of compute capability 1.2 and 1.3, the situation is less dire for misaligned accesses because, in all cases, access by a half warp of threads in this kernel results in either one or two transactions.
On the NVIDIA GeForce GTX 280 device, this results in an effective bandwidth of between 120 GBps for a single transaction and 70 GBps for two transactions per half warp. The number of transactions issued for a half warp of threads depends on the offset and whether the warp is even- or odd-numbered. For offsets of 0 or 16, each half warp results in a single 64-byte transaction (Figure 1). For offsets of 1 through 7 or 9 through 15, even-numbered warps result in a single 128-byte transaction (Figure 1) and odd-numbered warps result in two transactions: one 64-byte and one 32-byte (Figure 2). For offsets of 8, even-numbered warps result in one 128-byte transaction and odd-numbered warps result in two 32-byte transactions. The two 32-byte transactions, rather than a 64- and a 32-byte transaction, are responsible for the blip at the offset of 8 in Figure 1.