ADM-XRC SDK 2.8.1 User Guide (Linux)
© Copyright 2001-2009 Alpha Data


What happens during a DMA transfer?

Every DMA transfer must be set up by the CPU, and when it has finished, must also be torn down by the CPU. Most operating systems attempt to hide the details of this process from the user (and even from drivers), but the setup and tear-down of a DMA transfer can be fairly involved on some platforms. The steps taken by the CPU for a DMA transfer in an idealised operating system are as follows:

  1. Make sure all virtual memory pages of the user-space buffer are memory-resident and locked down (ie. cannot be swapped out to disk). This is important to ensure that the user-space buffer doesn't "disappear" in the middle of the DMA transfer. In operating systems which do not use virtual memory, this step is a no-op.
  2. Make sure that the now memory-resident and locked-down pages can actually be "seen" by the PCI device. On many platforms, this step is a no-op. However, with 64-bit platforms becoming more common and allowing more than 4GB of physical memory, not all of the memory in a system can be accessed by a PCI device whose addresses are 32 bits long. In such cases, the operating system maintains a pool of "bounce buffers" in a region of memory that is guaranteed to be visible to PCI devices. If a page of memory can't be seen by a PCI device, the operating system uses a bounce buffer for that page of the DMA transfer. If the direction of the DMA transfer is memory-to-PCI, the OS copies the user-space data into bounce buffers at this point.
  3. Some platforms do not automatically maintain cache coherence during a DMA transfers*. Data caches are typically flushed at this point, either entirely or selectively for the specific pages of physical memory used in the DMA transfer.
  4. At last, the CPU can program the PCI device with the DMA transfer parameters and kick off the DMA transfer. The thread of execution that kicked off the DMA transfer typically moves onto some other task or goes to sleep.
  5. When the PCI device interrupts the CPU, the CPU may need to make its data caches coherent with memory again. This step is not required on all platforms, particularly those that automatically maintain cache coherency during DMA transfers*.
  6. On platforms that use bounce-buffers, the system may need to copy data out of bounce buffers into the user-space buffer, if the direction of the DMA transfer was PCI-to-memory.
  7. The system now unlocks the pages of the user-space buffer, so that its pages become swappable again. In operating systems which do not use virtual memory, this step is a no-op.

* Cache coherent-DMA can be implemented by having the chipset invalidate the cache lines involved in a DMA transfer, as it actually happens, via signals that are brought out on the CPU.

Note that steps 1 and 7 are not performed by the Alpha Data ADM-XRC driver when the ADMXRC2_DoDMA API function is used. This is because applications typically call ADMXRC2_SetupDMA during initialization, which effectively performs step 1. Similarly, applications typically call ADMXRC2_UnsetupDMA as they wind-down, which effectively performs step 7. If you know you will reuse a buffer for several DMA transfers, use of ADMXRC2_DoDMA can remove the nondeterminism and latency associated with steps 1 and 7.

Even with these potential overheads, DMA transfers are still a far better choice than Direct Slave transfers for bulk data transfer in almost all situations. The following figure illustrates a DMA transfer from host memory to a PCI device, on a fictitious platform with 8GB of memory, requiring the use of bounce buffers:

In this fictitious platform, the first 3GB of memory are accessible to PCI devices. In the figure above, one of the pages of the user buffer falls within the first 3GB of memory. Thus, that page need not be copied before the DMA transfer is kicked off on the PCI device. The other 3 pages, however, lie above the 3GB boundary, and thus are copied to bounce buffers. The bounce buffers lie below the 3GB boundary. It should be noted that on many platforms, a driver is presented with an abstract kernel-level DMA programming interface and thus has little choice about whether or not bounce buffers are used.

Large DMA transfers, from the point of view of the user application, might not be performed as a single DMA transfer. In fact, they may be performed in several chunks by the Alpha Data ADM-XRC driver. The operating system's resources for creating bounce buffers, scatter-gather tables etc. are finite and thus there is a limit on the size of a "chunk" of DMA transfer. On all supported platforms, the Alpha Data ADM-XRC driver attempts to make this chunk limit at least ~64kB. The driver splits large DMA transfers into chunks and performs each chunk sequentially, which means that there may be a short gap in the data transfer between chunks where the driver is setting up the next chunk:

* Steps 1 and 7 not performed if ADMXRC2_DoDMA is used.

Because of this, applications must not rely on DMA transfers being continuous from start to finish. In any case, there are other latencies besides the inter-chunk gap that can affect DMA transfers, and these arise from both the hardware and the operating system. The inter-chunk gap is merely one of the larger latencies; even if it were not present, the other latencies would remain and thus an application could still fail should it rely upon DMA transfers being continuous.

 


 Top of page