Device-to-device memory-transfer offload with P2PDMA

October 2, 2018

This article was contributed by Marta Rybczyńska

One of the most common tasks carried out by device drivers is setting up DMA operations for data transfers between main memory and the device. Often, data read into memory from one device will be immediately written, unchanged, to another device. Common examples include carrying the image between the camera and screen on a mobile phone, or downloading files to be saved on a disk. Those transfers have an impact on the CPU even if it does not use the data directly, due to higher memory use and effects like cache trashing. There are cases where it is possible to avoid usage of the system memory completely, though. A patch set (posted by Logan Gunthorpe with contributions by Christoph Hellwig and Steve Wise) has been in the works for some time that addresses this case for PCI devices using peer-to-peer (P2P) transfers, with a focus on offering an offload option for the NVMe fabrics target subsystem.

PCI peer-to-peer memory concepts

PCI devices expose memory to the host system in form of memory regions defined by base address registers (BARs). Those are regions mapped into the host's physical memory space. All regions are mapped into the same address space, and PCI DMA operations can use those addresses directly. It is thus possible for a driver to configure a PCI DMA operation to perform transfers between the memory zones of two devices while bypassing system memory completely. The memory region might be on a third device, in which case two transfers are still required, but even in that case the advantage is lower load on the system CPU, decreased memory usage, and possibly lower PCI bandwidth usage. In the specific case of the NVMe fabrics target [PDF], the data is transferred from a remote direct memory access (RDMA) network interface to a special memory region, then to the NVMe drive directly.

The difficulty is in obtaining the addresses and communicating them to the devices. This has been solved by introducing a new interface, called "p2pmem", that allows drivers to register suitable memory zones, discover zones that are available, allocate from them, and map them to the devices. Conceptually, drivers using P2P memory can play one or more of three roles: provider, client, and orchestrator:

Providers publish P2P resources (memory regions) to other drivers. In the NVMe fabrics implementation, this is the done by the NVMe PCI driver that exports zones of the NVMe devices.
Clients make use of the resources, setting up DMA transfers from and to them. In the NVMe fabrics implementation there are two clients: the NVMe PCI driver accepts buffers in P2P memory, and the RDMA driver uses it for DMA operations.
Finally, orchestrators manage flows between providers and clients; in particular, they collect the list of available memory regions and choose the one to use. In this implementation there are also two orchestrators: NVMe PCI again, and the NVMe target that sets up the connection between the RDMA driver and the NVMe PCI device.

Other scenarios are possible with the proposed interface; in particular, the memory region may be exposed by a third device. In this case two transfers will still be required, but without the use of the system memory.

Driver interfaces

For the provider role, registering device memory as being available for P2P transfers takes place using:

    int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
				u64 offset);

The driver specifies the parameters of the memory region (or parts of it). The zone will be represented by ZONE_DEVICE page structures associated with the device. When all resources are registered, the driver may publish them to make them available to orchestrators with:

    void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);

In the orchestrator role, the driver must create a list of all clients participating in a specific transaction so that a suitable range of P2P memory can be found. To that end, it should build that list with:

    int pci_p2pdma_add_client(struct list_head *head, struct device *dev);

The orchestrator can also remove the clients with pci_p2pdma_remove_client() and free the list completely with pci_p2p_client_list_free():

    void pci_p2pdma_remove_client(struct list_head *head, struct device *dev);
    void pci_p2pdma_client_list_free(struct list_head *head);

When the list is finished, the orchestrator can locate a suitable memory region available for all client devices with:

    struct pci_dev *pci_p2pmem_find(struct list_head *clients);

The choice of provider is determined by its "distance", defined as the number of hops in the PCI tree between two devices. It is zero if the two devices are the same, four if they are behind the same switch (up to the downstream port of the switch, up to the common upstream, then down to the other downstream port and the final hop to the device). The closest (to all clients) suitable provider will be chosen; if there is more than one at the same distance, one will be chosen at random (to avoid using the same one for all devices). Adding new clients to the list after locating the provider is possible if they are compatible; adding incompatible clients will fail.

There is a different path for the orchestrators that know which provider to use or that want to use different criteria for the choice. In such case, the driver should verify that the provider has available P2P memory with:

    bool pci_has_p2pmem(struct pci_dev *pdev);

Then it can calculate the cumulative distance from its clients to the memory with:

    int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients,
			    bool verbose);

When the orchestrator has found the desired provider, it can assign that provider to the client list using:

    bool pci_p2pdma_assign_provider(struct pci_dev *provider,
    				    struct list_head *clients);

This call returns false if any of the clients are unsupported. After the provider has been selected, the driver can allocate and free memory for DMA transactions from that device using:

    void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size);
    void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size);

Additional helpers exist for allocating scatter-gather lists with P2P memory:

    pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr);
    struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev, unsigned int *nents,
 					     u32 length);
    void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);

While passing the P2P memory for DMA, the addresses must be PCI bus addresses. The users of the memory (clients) need to change their DMA mapping routine to:

    int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
			  enum dma_data_direction dir);

A driver using P2P memory will use pci_p2pmem_map_sg() instead of dma_map_sg(). This routine is lighter, it just adjusts the bus offset, as the P2P uses bus addresses. To determine which mapping functions to use, drivers can benefit from this helper:

    bool is_pci_p2pdma_page(const struct page *page);

Special properties

One of the most important tradeoffs the authors faced was finding out which hardware system configurations can be expected to work for P2P DMA operations. In PCI, each root complex defines its own hierarchy. Some complexes do not support peer-to-peer transfers between different hierarchies and there is no reliable way to find out if they do (see the PCI Express specification r4.0, section 1.3.1). The authors have decided to allow the P2P functionality only if all devices involved are behind the same PCI host bridge; otherwise the user would be required to understand their PCI topology and understand all devices in their system. This restriction may be lifted with time.

Even so, the configuration requires user intervention, as it is necessary to to pass the kernel parameter disable_acs_redir that was introduced in 4.19. This disables certain parts of the PCI access control services functionality that might redirect P2P requests (the low-level details have been deeply discussed earlier in the development of this patch set).

P2P memories have special properties — they are I/O memories without side effects (they are not device-control registers) and they are not cache coherent. The code handling those memories should be prepared and avoid passing this memory to code that is not. iowrite*() and ioread*() are not necessary, as there are no side effects, but if the driver needs a spinlock to protect its accesses, it should use mmiowb() before unlocking. There are currently no checks in the kernel to ensure the correct usage of this memory.

Other subsystem changes

Using P2P transfers with the NVMe subsystem required some changes in other subsystems, too. The block layer gained an additional flag, QUEUE_FLAG_PCI_P2P, to indicate that the specific queue can target P2P memory. A driver that submits a request using P2P memory should make sure that this flag is set on the target queue. There has been a discussion if there should be an additional check, but the developers decided against it.

The NVMe driver was modified to use the new infrastructure; it also serves as an example of the implementation. The NVMe controller memory buffer (CMB) functionality, which is memory in the NVMe device that can be used to store commands or data, has been changed to use P2P memory. This means that, if P2P memory is not supported, the NVMe CMB functionality won't be available. The authors find that reasonable, since CMB is designed for P2P operations in the first place. Another change is that the request queues can benefit from P2P memory too.

RDMA, which is used for the NVMe fabrics, is now using flags to indicate if it should use P2P or regular allocations. The NVMe fabrics target itself allows the system administrator to choose to use P2P memory and specify the memory device using a configuration attribute that can be a boolean or PCI device name. In the first case it will use any suitable P2P memory, in the second — only from the specific P2P memory device.

Current state

The patch set has been under review for months now (see this presentation [PDF]), and the authors provide a long list of hardware it has been tested with. The pace of this patch set (up to version 8 as of this writing) is fast; it seems that it might be merged in the near future.

The patch set allows use cases that were not possible with the mainline kernel before and opens a window for other use cases (P2P can be used with graphics cards, for example). At this stage, the support is basic and there are numerous modifications and extensions to be added in the future; one direction will be to extend the range of supported configurations. Others would be to hide the API behind the generic DMA operations and use the optimization with other types of devices.

Index entries for this article
Kernel	Device drivers/Support APIs
Kernel	PCI
GuestArticles	Rybczynska, Marta

Device-to-device memory-transfer offload with P2PDMA

Posted Oct 2, 2018 21:51 UTC (Tue) by sbates (subscriber, #106518) [Link]

Thanks Marta for the excellent article summing up where we are with P2PDMA. I also gave a summary talk of P2PDMA at SNIA's Storage Developer Conference in September. The slides for that talk should be available at this link https://tinyurl.com/y8sazb79 and you might want to update the article to point to these slides as well as the older ones you mention.

Stephen

P2PDMA vs dmabuf?

Posted Oct 3, 2018 8:16 UTC (Wed) by shalem (subscriber, #4062) [Link] (1 responses)

I wonder how this relates to dmabuf, esp. given the comment about using P2P with GPU-s where dmabuf is already used ?

I guess dmabuf is tied to dmaing from/to main memory? So does P2PDMA allow (through e.g. some simple helpers) to use a dmabuf as source/dest of the P2P transfer?

P2PDMA vs dmabuf?

Posted Oct 5, 2018 1:42 UTC (Fri) by sbates (subscriber, #106518) [Link]

Hey

As I understand it dmabuf is all about exposing these buffers to userspace. P2PDMA is not quite ready to go that far but as we start looking at userspace interfaces we will definitely look at dmabuf.

Oh and if you want to look at extending P2PDMA to tie into dmabuf we'd be more than happy to review that work!

Cheers

Stephen

PCI devices

Posted Oct 3, 2018 10:17 UTC (Wed) by epa (subscriber, #39769) [Link] (1 responses)

When the article talks about PCI devices, does it really mean the old style 32- or 64-bit wide, 33MHz or 66MHz bus? Or should it be taken to include PCI Express (PCIe) as using the same kind of register setup, even though it's electrically rather different?

PCI devices

Posted Oct 3, 2018 16:22 UTC (Wed) by mrybczyn (subscriber, #81776) [Link]

In the PCI subsystem it is often understood as all variants, currently mainly PCI Express. The NVMe drives are only PCI Express, for example.

Device-to-device memory-transfer offload with P2PDMA

Posted Oct 3, 2018 11:38 UTC (Wed) by dullfire (guest, #111432) [Link] (3 responses)

There seems to be a mistake in the article.
While I have not read the patch set, it would not make sense to require "all devices involved are behind the same PCI bridge".
I suspect the term "PCI host bridge" was intended (because that would have the effect that the paragraph describes). Furthermore, since in PCIe all devices have their own PCI bridge (devices, not functions. Also, as a quick overview, a PCIe switch is made up of set of PCIe bridges, one for upstream... and one for each downstream port), it would effectively be impossible to have two PCIe device ever use this functionality. Which would render it moot.

Device-to-device memory-transfer offload with P2PDMA

Posted Oct 3, 2018 16:25 UTC (Wed) by mrybczyn (subscriber, #81776) [Link] (2 responses)

Yes, you're right. It would be more accurate to say "behind a host bridge". You will find more about it in the last part of the article when it talks about the use cases.

Device-to-device memory-transfer offload with P2PDMA

Posted Oct 4, 2018 21:13 UTC (Thu) by jgg (subscriber, #55211) [Link] (1 responses)

'behind the same bridge' is the right language, if not a little confusing. It doesn't mean 'behind the last bridge' but simply any bridge. Ie the upstream bridge of a switch is sufficient to satisfy the condition, even though there are later bridges before reaching the device.

Behind the same root port (for PCI-E) is not quite the same thing, ie two functions on the same device cannot do P2P DMA with this patch series if they are plugged directly into the root port.

All that aside, this series does have the requirement that the devices be behind a switch. You can't use it on a GPU and a NVMe drive plugged directly into root ports on your CPU, for instance. This greatly limits the utility, and hopefully will go away eventually when people can figure out how to white list root complexes and BIOSs that support this functionality.

Device-to-device memory-transfer offload with P2PDMA

Posted Dec 7, 2024 11:51 UTC (Sat) by sammythesnake (guest, #17693) [Link]

> two functions on the same device cannot do P2P DMA with this patch series if they are plugged directly into the root port

That seems like a fairly likely use case - passing off some data from one stage of processing to another, so hopefully this restriction is lifted soon. I imagine that's a direction in the developers' sights, though - I'm happy to assume that my negligible level of domain knowledge is outdone by theirs ;-)

A couple of possible factors that might make it less of an urgent need occur to me, though, how likely are these, I wonder...?

1. How common would it be for these related functions to be plugged into the root, rather than sharing a (device internal?) bridge?

2. I imagine such devices might simply share the memory between the stages and not need DMA at all for this kind of stage-to-stage handover...?

Device-to-device memory-transfer offload with P2PDMA

Posted Oct 4, 2018 15:57 UTC (Thu) by willy (subscriber, #9762) [Link] (1 responses)

Just because a parameter is called 'size' does not mean it should have type 'size_t'. In this case, it's a length of a (subset of a) BAR, and it can easily be 64-bit on a 32-bit kernel. Should probably be phys_addr_t (even though it's a length, not an address).

Device-to-device memory-transfer offload with P2PDMA

Posted Oct 5, 2018 1:40 UTC (Fri) by sbates (subscriber, #106518) [Link]

Willy

Logan just submitted v9 today. Perhaps comment on that with your size_t concerns. All input gratefully received ;-).

Stephen

Device-to-device memory-transfer offload with P2PDMA

Posted Oct 5, 2018 1:48 UTC (Fri) by sbates (subscriber, #106518) [Link] (1 responses)

Hey Marta

One thing the article did not comment on is the ARCH specific nature of P2PDMA. While the framework is ARCH agnostic we do rely on devm_memremap_pages() which relies on ZONE_DEVICE which *is* ARCH specific (and in turn relies on MEMORY_HOTPLUG). Right now this includes x86_64 but not (for example) aarch64. Interestingly for some, we are looking at adding ARCH_HAS_ZONE_DEVICE for riscv because we see that ARCH as an interesting candidate for P2PDMA.

Of course patches that add support for ZONE_DEVICE to other ARCH would be very cool.

Cheers

Stephen

Device-to-device memory-transfer offload with P2PDMA

Posted Oct 6, 2018 15:23 UTC (Sat) by mrybczyn (subscriber, #81776) [Link]

Hello Stephen,
You're right, there is the dependency on ZONE_DEVICE that I didn't mention as I think it's not going to matter for most potential users. The addition of support for other architectures and future integration with other subsystems (enabling usage with GPUs...) may be a subject for a follow-up.

Cheers
Marta

size requirement for pci_p2pdma_add_resource()?

Posted Oct 1, 2024 3:50 UTC (Tue) by KCLWN (guest, #173781) [Link]

I am calling pci_p2pdma_add_resource() with a portion of a 32MB BAR memory. I've been successful using size values of 16MB, 28MB, 30MB and 32MB. For size values of 29MB and 31MB, I get the following failure. Does the size value to pci_p2pdma_add_resource() need to be multiple of 2MB (large page size)?

[ 472.762396] ------------[ cut here ]------------
[ 472.762400] Misaligned __add_pages start: 0x600da000 end: 0x600dbeff
[ 472.762409] WARNING: CPU: 30 PID: 199 at mm/memory_hotplug.c:395 __add_pages+0x121/0x140
[ 472.762420] Modules linked in: dre_drv(OE+) qrtr cfg80211 intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm binfmt_misc irqbypass dax_hmem cxl_acpi rapl cxl_core nls_iso8859_1 ipmi_ssif ast i2c_algo_bit acpi_ipmi i2c_piix4 ccp k10temp ipmi_si ipmi_devintf ipmi_msghandler joydev input_leds mac_hid dm_multipath msr efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 hid_generic rndis_host usbhid cdc_ether usbnet hid mii crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 i40e nvme nvme_core ahci nvme_auth xhci_pci libahci xhci_pci_renesas aesni_intel crypto_simd cryptd [last unloaded: dre_drv(OE)]
[ 472.762567] CPU: 30 PID: 199 Comm: kworker/30:0 Tainted: G W OE 6.8.0-45-generic #45-Ubuntu
[ 472.762573] Hardware name: Supermicro AS -2025HS-TNR/H13DSH, BIOS 1.6a 03/28/2024
[ 472.762576] Workqueue: events work_for_cpu_fn
[ 472.762584] RIP: 0010:__add_pages+0x121/0x140
[ 472.762591] Code: bc c6 05 aa 6b 5c 01 01 e8 2c e4 f7 fe eb d3 49 8d 4c 24 ff 4c 89 fa 48 c7 c6 70 57 84 bc 48 c7 c7 50 42 e8 bc e8 ef ed ec fe <0f> 0b eb b4 0f b6 f3 48 c7 c7 50 02 84 bd e8 0c e8 6f ff eb b6 66
[ 472.762595] RSP: 0018:ff6b7dd90cedfbc0 EFLAGS: 00010246
[ 472.762600] RAX: 0000000000000000 RBX: 00000000600da000 RCX: 0000000000000000
[ 472.762604] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 472.762607] RBP: ff6b7dd90cedfbf0 R08: 0000000000000000 R09: 0000000000000000
[ 472.762609] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000600dbf00
[ 472.762612] R13: ff6b7dd90cedfca0 R14: 0000000000000000 R15: 00000000600da000
[ 472.762615] FS: 0000000000000000(0000) GS:ff3edc4137a00000(0000) knlGS:0000000000000000
[ 472.762619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 472.762623] CR2: 00007ffcfc44b9c0 CR3: 0000005453a26001 CR4: 0000000000f71ef0
[ 472.762626] PKRU: 55555554
[ 472.762629] Call Trace:
[ 472.762632] <TASK>
[ 472.762638] ? show_regs+0x6d/0x80
[ 472.762645] ? __warn+0x89/0x160
[ 472.762653] ? __add_pages+0x121/0x140
[ 472.762659] ? report_bug+0x17e/0x1b0
[ 472.762668] ? handle_bug+0x51/0xa0
[ 472.762673] ? exc_invalid_op+0x18/0x80
[ 472.762678] ? asm_exc_invalid_op+0x1b/0x20
[ 472.762688] ? __add_pages+0x121/0x140
[ 472.762696] add_pages+0x17/0x70
[ 472.762702] arch_add_memory+0x45/0x60
[ 472.762708] pagemap_range+0x232/0x420
[ 472.762717] memremap_pages+0x10e/0x2a0
[ 472.762722] ? srso_alias_return_thunk+0x5/0xfbef5
[ 472.762730] devm_memremap_pages+0x22/0x70
[ 472.762736] pci_p2pdma_add_resource+0x1c7/0x560
[ 472.762744] ? srso_alias_return_thunk+0x5/0xfbef5
[ 472.762750] ? DRE_dmDevMemAlloc+0x44a/0x580 [dre_drv]
[ 472.762811] DRE_drvProbe+0xc07/0xf30 [dre_drv]
[ 472.762852] local_pci_probe+0x44/0xb0
[ 472.762859] work_for_cpu_fn+0x17/0x30
[ 472.762864] process_one_work+0x16c/0x350
[ 472.762872] worker_thread+0x306/0x440
[ 472.762881] ? __pfx_worker_thread+0x10/0x10
[ 472.762887] kthread+0xef/0x120
[ 472.762893] ? __pfx_kthread+0x10/0x10
[ 472.762899] ret_from_fork+0x44/0x70
[ 472.762904] ? __pfx_kthread+0x10/0x10
[ 472.762910] ret_from_fork_asm+0x1b/0x30
[ 472.762921] </TASK>
[ 472.762924] ---[ end trace 0000000000000000 ]---
[ 472.774397] ------------[ cut here ]------------