Peer-to-peer DMA

By Jake Edge
May 16, 2023

In a plenary session on the first day of the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit, Stephen Bates led a discussion about peer-to-peer DMA (P2PDMA). The idea is to remove the host system's participation in a transfer of data from one PCIe-connected device to another. The feature was originally aimed at NVMe SSDs so that data could simply be copied directly to and from the storage device without needing to move it to system memory and then from there to somewhere else.

Background

The idea goes back to 2012 or so, Bates said, when he and Logan Gunthorpe (who did "most of the real work") were working on NVMe SSDs, RDMA, and NVMe over fabrics (before it was a standard, he thought). Some customers suggested that being able to DMA directly between devices would be useful. With devices that exposed some memory (which would be called a "controller memory buffer" or CMB today) they got the precursor to P2PDMA working. There are some user-space implementations of the feature, including for SPDK and NVIDIA's GPUDirect Storage, which allows copies directly between NVMe namespaces and GPUs.

Traditional DMA has some downsides when moving data between two PCIe devices, such as an NVMe SSD and an RDMA network card. All of the DMA operations come into the system memory from one of the devices, then have to be copied out of the system memory to the other device, which doubles the amount of memory-channel bandwidth required. If user-space applications are also trying to access the RAM on the same physical DIMM as the DMA operation, there can be various quality-of-service problems as well.

P2PDMA avoids those problems, but comes with a number of challenges, he said. The original P2PDMA implementation for Linux was in-kernel-only; there were some hacks that allowed access from user space, but they were never merged into the mainline. More recently, though, the 6.2 kernel has support for user-space access to P2PDMA, at least in some circumstances. P2PDMA is available in the NVMe driver but only devices that have a CMB can be a DMA source or destination. NVMe devices are the only systems currently supported as DMA masters as well.

Bates is unsure whether Arm64 is a fully supported architecture currently, as there is "some weird problem" that Gunthorpe is working through, but x86 is fully supported. An IOMMU plays a big role for P2PDMA because it needs to translate physical and virtual addresses of various sorts between the different systems; "believe me, DMAing to the wrong place is never a good thing". The IOMMU can also play a safeguard role to ensure that errant DMA operations are not actually performed.

Currently, there is work on allowlists and blocklists for devices that do and do not work correctly, but the situation is generally improving. Perhaps because of the GPUDirect efforts, support for P2PDMA in CPUs and PCIe devices seems to be getting better. He pointed to his p2pmem-test repository for the user-space component that can be used to test the feature in a virtual machine (VM). As far as he knows, no other PCIe drivers beyond the NVMe driver implement P2PDMA, at least so far.

Future

Most NVMe drivers are block devices that are accessed via logical block addresses (LBAs), but there are devices with object-storage capabilities as well. There is also a computational storage interface coming soon (which was the topic of the next session) for doing computation (e.g. compression) on data that is present on a device. NVMe namespaces for byte-addressable storage are coming as well; those are not "load-store interfaces", which would be accessible from the CPU via load and store instructions as with RAM, but are instead storage interfaces available at byte granularity. Supporting P2PDMA for the NVMe persistent memory region (PMR), which is load-store accessible and backed by some kind of persistent data (e.g. battery backed-up RAM), is a possibility on the horizon, though he has not heard of any NVMe PMR drives in development. PMR devices could perhaps overlap the use cases of CXL, he said.

Better VM and IOMMU support is in the works. PCIe has various mechanisms for handling and caching memory-address translations, which could be used to improve P2PDMA. Adding more features to QEMU (e.g. SR-IOV) is important because it is difficult to debug problems using real hardware. Architecture support is also important; there may still be problems with Arm64 support, but there are other important architectures, like RISC-V, that need to have P2PDMA support added.

CXL had been prominently featured in the previous session, so Bates said he wanted to dig into it a bit. P2PDMA came about in a world where CXL did not exist, but now that it does, he thinks there are an interesting set of use cases for P2PDMA in a CXL world. Electrically and physically, CXL is the same as PCIe, which means that both types of devices can plug into the same bus slots. They are different at the data link layer, but work has been done on CXL.io, which translates PCIe to CXL.

That means that an NVMe drive that has support for CXL flow-control units (flits) can be plugged into a CXL port and can then be used as a storage device via the NVMe driver on the host. He and a colleague had modeled that using QEMU the previous week, which may be the first time it had ever been done. He believes it worked but more testing is needed.

Prior to CXL 3.0, doing P2PDMA directly between CXL memory and an NVMe SSD was not really possible because of cache-coherency issues. CXL 3.0 added a way for CXL to tell the CPUs that it was about to do DMA for a particular region of physical memory and ask the CPUs to update the CXL memory from their caches. The unordered I/O (UIO) feature added that ability, which can be used to move large chunks of data from or to storage devices at hardware speeds without affecting the CPU or its memory interconnects. Instead of a storage device, an NVMe network device could be used to move data directly out of CXL memory to the network.

Bates said that peer-to-peer transfers of this sort are becoming more and more popular, though many people are not using P2PDMA to accomplish them. That popularity will likely translate to more users of P2PDMA over time, however. At that point, LSFMM+BPF organizer Josef Bacik pointed out that time had expired on the slot, so the memory-management folks needed to head off to their next session, while the storage and filesystem developers continued the discussion.

David Howells asked if Bates had spoken with graphics developers about P2PDMA since it seems like they might be interested in using it to move, say, textures from storage to a GPU. Bates said that he had been focusing on cloud and enterprise kinds of use cases, so he had not contacted graphics developers. The large AI clusters are using peer-to-peer transfers to GPUs, though typically via the GPUDirect mechanism.

The NVMe community has been defining new types of namespaces lately, Bates said. The LBA namespace is currently used 99% of the time, but there are others coming as he had noted earlier. All of those namespace types and command sets can be used over both PCIe and CXL, but they can also be used over fabrics with RDMA or TCP/IP. Something that is not yet in the standard, but he hopes is coming, is providing a way to present an NVMe namespace (or a sub-region of it) as a byte-addressable, load-store region that P2PDMA can then take advantage of.

There was a digression on what was meant by load-store versus DMA for these kinds of operations. Bates said that for accessing data on a device, DMA means that some kind of descriptor is sent to a data-mover that would simply move the data as specified, whereas load-store means that a CPU is involved in doing a series of load and store operations. So there would be a new NVMe command requesting that a region be exposed as a CMB, a PMR, or "something new that we haven't invented yet"; the CPU (or some other device, such as a DMA data-mover) can then do load-store accesses on the region.

One use case that he described would be having an extremely hot (i.e. frequently accessed) huge file on an NVMe drive, but wanting to be able to access it directly with loads and stores. A few simple NVMe commands could prepare this data to be byte-accessible, which could then be mapped into the application's address space using mmap(); it would be like having the file in memory without the possibility of page faults when accessing it.

Index entries for this article
Kernel	Compute Express Link (CXL)
Kernel	NVMe
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

Peer-to-peer DMA

Posted May 16, 2023 15:38 UTC (Tue) by sbates (subscriber, #106518) [Link] (6 responses)

Thanks for the write up! I just wanted to clarify one important point in the first paragraph. "The idea is to remove the host system's participation in a transfer of data" is not technically correct. The host driver is still issuing the DMA requests to the PCIe device(s) and so it is still participating. The main difference now is that this DMA traffic ideally goes directly from one PCIe device to the other without needing to be "bounced" through system memory.

Stephen

Peer-to-peer DMA

Posted May 16, 2023 19:06 UTC (Tue) by MattBBaker (subscriber, #28651) [Link] (5 responses)

Yep, someone has to drive the queues. I keep hoping that we see a return of something like the Cell CPU where there was a high clock core that could be the designated DMA driver.

The biggest problem the larger community will suffer making these neat toys work is going to be that most programming models assume a 'sender/receiver' model where the local worker is either the sender or the receiver, with no provision for the running thread being 'neither'.

Peer-to-peer DMA

Posted May 17, 2023 10:54 UTC (Wed) by farnz (subscriber, #17727) [Link] (4 responses)

Why would you need a high clock core to drive DMA? My understanding is that you'd be getting the NVMe device (or the NIC, or the GPU) to do the actual DMA transfers, and the host's involvement is limited to sending DMA descriptors to the device doing the transfer.

For NVMe to GPU transfers on a modern system with large BARs, the host is effectively putting one descriptor into the queue for each transfer - the GPU exposes all its VRAM to the NVMe device, and an NVMe scatter-gather list can transfer an entire GPU's worth of data as a result of a single SGL programmed into the NVMe queue by the host. One interrupt per several gigabytes of data transfer doesn't need a high clock rate core dedicated to it.

Peer-to-peer DMA

Posted May 17, 2023 20:21 UTC (Wed) by MattBBaker (subscriber, #28651) [Link] (3 responses)

That works fine if your system is just one card that needs to be fed from one source. When you're pushing 6 cards on one system with a bunch of NVMe drives and multiple high performance NICs, suddenly the speed at which queues can be serviced both to issue new commands and drive completions matters a lot

Peer-to-peer DMA

Posted May 17, 2023 20:56 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

NVMe (and most DMA-capable devices are similar in this respect) already has command queues for this sort of purpose - I can put together a long queue (65,536 commands in the case of NVMe, each of which has its own SGL) of commands that do all the data transfers I want to do with that drive, and let it get on with the transfers. And NVMe allows for 65,535 queues per device, so I can have a lot of queued transfers.

Even with a few hundred drives to push, the limiting factor is the processing that has to be done after the device completes the DMA, and the Cell model of a high clock speed, low performance, low latency DMA driving CPU isn't helpful for that.

Peer-to-peer DMA

Posted May 24, 2023 20:54 UTC (Wed) by flussence (guest, #85566) [Link] (1 responses)

While those numbers are impressive-looking (compared to SATA), those are only theoretical upper limits in the protocol. No NVMe drive is going to have 16GB of onboard DRAM just for command buffers.

I'm not entirely sure the correct way to look up actual limits but I can make an educated guess: `nvme id-ctrl /dev/nvme0` gives me "sqes: 0x66, cqes: 0x44" which sounds more realistic.

Peer-to-peer DMA

Posted May 25, 2023 11:10 UTC (Thu) by farnz (subscriber, #17727) [Link]

You're looking at the wrong number there - SQES tells you the smallest and largest sizes permitted for an SQE, and is two 4 bit fields for maximum and minimum, as log₂ of the entry size. Thus 0x66 is an entry exactly 64 bytes in size. You'd need to access the MQES register to find out how many queue entries the device supports, and you can split those between as many queues are you need. Practically, one queue per CPU is common.

And if you don't like reasoning from the upper limits, there's a different direction; a high end NVMe device currently does between 1 and 2 million IOPS. To trigger a single IOP, you need to write a 64 byte queue entry, followed by a write to a doorbell to tell the NVMe device that there's a new entry to look at; note that queues do not have to be on the drive, they can be in host memory. A 64 bit CPU should be able to write at least 8 bytes per clock cycle, and read 8 bytes in the same clock cycle, unless limited by I/O or memory speeds, so 9 clock cycles is enough to trigger a fresh I/O - and because a completion entry is 32 bytes, you can detect completion overlapped with writing the next I/O entry . So, given an NVMe device that's more than an order of magnitude better than today's best, offering 100M IOPS, or more realistically, 50 of today's best devices, you need a 900 MHz CPU to keep the NVMe device saturated regardless of I/O size.

In practice, you're going to be doing larger I/Os if you can, which reduces the needed clock speed further. On the other hand, you're not simply submitting I/O all the time - you're also doing other work, which increases the needed performance (but notably not the clock speed - a 1 GHz CPU doing 5 IPC is the same as a 5 GHz CPU doing 1 IPC for this analysis).

Peer-to-peer DMA

Posted May 17, 2023 13:28 UTC (Wed) by bgoglin (subscriber, #7800) [Link]

The idea goes back to way before 2012. This paper from 2001 implemented the idea between SCSI disks and Myrinet NICs https://ieeexplore.ieee.org/document/923202

Peer-to-peer DMA synchronization

Posted May 18, 2023 3:00 UTC (Thu) by DemiMarie (subscriber, #164188) [Link]

How is synchronization and flow control handled? What happens if e.g. a NIC asks an NVMe device for data that isn’t ready yet?

Peer-to-peer DMA

Posted Jun 13, 2023 7:09 UTC (Tue) by daenzer (subscriber, #7050) [Link]

The amdgpu/amdkfd driver has supported PCIe P2PDMA between AMD GPUs since 6.0.