LWN.net Weekly Edition for May 29, 2025 [LWN.net]

Welcome to the LWN.net Weekly Edition for May 29, 2025

This edition contains the following feature content:

Glibc project revisits infrastructure security: what is the best way to keep the GNU C Library code secure?
Cory Doctorow on how we lost the internet: a PyCon keynote on just how things went wrong.
System-wide encrypted DNS: work that has been done to make encrypted DNS just work on enterprise distributions.
Development statistics for the 6.15 kernel: a look at where the code for this release came from.
Ongoing LSFMM+BPF 2025 coverage:
- Long-duration stress-testing for filesystems: a discussion on filesystem testing aimed at finding more bugs before they are discovered in a production setting.
- Formally verifying the BPF verifier: a look at using Agni to prove parts of the BPF verifier correct.
- Verifying the BPF verifier's path-exploration logic: Srinivas Narayana shares a plan to extend Agni to cover more of the verifier.
- Allowing BPF programs more access to the network: new functions to allow BPF programs to send data over the network, and cleanly disconnect TCP connections.
Reports from OSPM 2025, day two: discussions on improvements to device suspend and resume, the status and future of sched_ext, the scx_lavd scheduler, improving the efficiency of load balancing, and hierarchical constant bandwidth server scheduling.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Glibc project revisits infrastructure security

By Joe Brockmeier
May 28, 2025

The GNU C Library (glibc) is the core C library for most Linux distributions, so it is a crucial part of the open-source ecosystem—and an attractive target for any attackers looking to carry out supply-chain attacks. With that being the case, securing the project's infrastructure using industry best practices and improving the security of its development practices are a frequent topic among glibc developers. A recent discussion suggests that improvements are not happening as quickly as some would like.

On May 9, glibc maintainer Carlos O'Donell wrote to the libc-alpha mailing list to ask other glibc developers to review a secure software development life-cycle process document that he had drafted for glibc. He also provided a similar top-level document for the GNU toolchain that includes GNU Binutils, GCC, glibc, and the GNU Project Debugger (GDB). The goal is to define "what we expect from the infrastructure, developer end points, and our process" in order to figure out what services are needed to create a more secure development process.

The glibc project is hosted on Sourceware, which provides project hosting for free-software toolchain and developer tools, including those that comprise the GNU Toolchain. O'Donell noted that some of the items in his document were taken from Sourceware Cyber Security FAQ in its section "suggested secure development policies for projects", but had been rearranged into a structure that matched the NIST Secure Software Development Framework, which is the standard he recommended as "the simplest and least prescriptive".

In a nutshell, the document suggests top-level practices to be adopted "in order to develop a secure and robust GNU C Library". This includes treating infrastructure as a zero-trust environment in which it is assumed that any of the services, developers, administrators, or systems have been compromised and attempting to limit the consequences of such a compromise. It carries a host of recommendations such as defining security requirements for developers, implementing security tooling, and separating services into distinct systems or VMs.

Hosting

O'Donell emphasized that he was not talking about where to host the project's infrastructure, though the document does discuss hosting. This was worth noting, as the topic of toolchain infrastructure and hosting has come up a number of times, almost as an annual ritual at this point. It was, for example, raised in 2023 by O'Donell, and again in 2024, with O'Donell unsuccessfully trying to drive a core-toolchain-infrastructure (CTI) project that would have moved some of glibc's core collaboration services to infrastructure managed by the Linux Foundation. The statement of work for CTI proposed moving glibc infrastructure to a cloud vendor, adding multiple points of redundancy, as well as 24/7 monitoring and engineering support to handle service outages or "high-risk security events". The annual running cost for CTI was estimated at $276,000.

Sourceware became a member of the non-profit Software Freedom Conservancy (SFC) in 2023, in part a response to a push by O'Donell and others to move GNU Toolchain services to the Linux Foundation in 2022. The current costs of glibc's infrastructure are somewhat nebulous, as much of Sourceware's infrastructure is provided as donations rather than billed by a provider.

The infrastructure and services for Sourceware are managed by volunteers, with hardware, bandwidth, and hosting, and other services donated by number of individuals, companies, and other organizations. Red Hat provides the main server (singular) for several services, as well as a backup server should that one fail. The Oregon State University Open Source Lab, which recently had a funding scare, hosts the server that provides automated source and documentation snapshots. There are a number of machines provided and administrated by other organizations and individuals for building software on various architectures, such as arm64, RISC-V, and s390x.

Mark Wielaard, who serves on the Sourceware project leadership committee, posted a report on Sourceware's second year with the SFC on May 27. According to that report, Sourceware's total income over the last year was about $3,000 from personal donations, and it has spent about $240 on PayPal fees and spare disks for its servers. In total, it has a little more than $10,000 in the bank.

According to Sourceware's infrastructure security page, the site hosts more than 25 projects that have more than 350 active developers and 1,000 contributors. The page has a plans section at the bottom with a list of high-level goals to improve the security of Sourceware's processes and services. This includes isolating services, modernizing account-management processes, improving the release-upload process, and hiring a part-time junior system administrator. The list of plans is unchanged since the page was first captured by the Internet Archive on May 28, 2024. Wielaard noted in his report that Sourceware is looking for sponsors to help "accelerate" its security plans.

CTI

The CTI discussion in 2024 was contentious, with glibc maintainers objecting to both the way the proposal was developed and the choice of Linux Foundation services. Zoë Kooyman weighed in on behalf of the Free Software Foundation (FSF) to say that it opposed the effort to move glibc to CTI. She noted that the proposal would mean that only Linux Foundation IT staff would have administrative access to the servers for CTI, thus no one outside the foundation would be able to improve, maintain, or audit the infrastructure. Sourceware, on the other hand, "accepts technical contributions, and LF IT could be making them right now".

Andrew Pinski asked why the proposal was not developed on the glibc development list, and said that it "gives the vibes of being too closed and being done in a rush" without thinking the proposal through. Alfred M. Szmidt complained that it smelled like a corporate push and was not something the community wanted. Wielaard questioned why O'Donell was "pushing for something that was already highly controversial" and received negatively when it had been proposed before:

I thought we had consensus that the community wasn't really helped by setting up a corporate controlled directed fund or by having a highly disruptive change of infrastructure providers. [...]

Personally, as a glibc developer, I don't want a messy migration of some of the services separating glibc from the rest of the core toolchain and developer tool projects hosted at Sourceware. And looking at some of the other replies I think there is sustained opposition to this idea.

That opposition has not abated. Pinski said of the new proposal that the glibc document was less about security and more about pushing glibc toward the CTI project. He said that it would be better to step back and discuss glibc's model of submitting patches and approvals. Wielaard thought that the proposed policy would be better and clearer if it concentrated solely on the secure-development process. "We have better/separate documents for the hosting infrastructure security parts."

Isolation

Joseph Myers, however, worried about Sourceware running many services that were not isolated to separate virtual machines or containers. That may have been fine 25 years ago, but the project should assume now that it is "at risk of targeted attacks from state-level actors". Its practices were outdated ten years ago, he said, and certainly outdated when he raised similar concerns during GNU Cauldron in 2022. That was likely a reference to a Birds-of-a-Feather session on Sourceware's toolchain infrastructure that included a presentation by O'Donell and David Edelsohn about using managed services from the Linux Foundation. LWN covered this session, which was "loud, contentious, and bordered on physical violence at one point".

In 2022, Myers floated the idea that Sourceware administrators could move to "a modern high-security setup with isolated services", so that compromises of one project or service would not impact other projects or services. He said he had not seen much progress on isolation since 2022, though there had been a few security improvements, such as disabling inactive user accounts:

If Sourceware doesn't do such a migration to more secure, isolated hosting of services (within a reasonable time starting from 2022), that also serves as evidence as far as I'm concerned of the advantages of different hosting arrangements. If in fact lots of such migrations have happened since the 2022 Cauldron and are continuing to happen for the last few unisolated services, that serves as evidence that Sourceware provides suitable hosting arrangements but needs to work on improving how configuration changes and administrative actions get reported to the hosted projects.

Wielaard said that the Sourceware organization was working on it, though progress might not be as fast as Myers might like. Sourceware had started isolating processes using systemd services and resource controls, and there would be an opportunity to move to separate containers or VMs in Q3 when Red Hat's "community cage" servers move to a datacenter in Raleigh, NC. (An update on this move, specific to Fedora's services, was posted in April by Kevin Fenzi on the Fedora Community Blog.)

Security check list

While much of the conversation focused on the project's hosting infrastructure, there was some discussion of the other elements of O'Donell's document. Wielaard questioned whether the NIST format was the right one. It contains useful elements, he said, but "in general it isn't really a good way for a community project to document its cyber security practices". He added that the topic of a "secure development policy champion" had come up during the Sourceware office hours the day before.

O'Donell replied and volunteered to be glibc's secure-development policy champion. He disagreed that the NIST framework was not suitable for glibc, and pointed to a document he had created that compared Sourceware's cybersecurity policy to NIST's framework. His analysis concludes that items in Sourceware's checklist "do not clearly flow from any top-level requirements for security e.g. why would I do this particular step?", and recommends that the checklist should be rewritten to match NIST's framework.

Wielaard said he appreciated that there were interesting points from NIST, but a free-software project is unlike the organizations described in its document. "Pretending it is distracts from the strengths of collaboratively working together on Free Software." He added that Sourceware had been mostly looking at the European Union Cyber Resilience Act (CRA), and the checklist aimed to help create a documented, verifiable project security policy to prepare for the CRA becoming law. He said that it was great that O'Donell was volunteering: the best way forward would be to go over the checklist to document things that are already implemented or how to adopt any item that glibc is not already doing:

At the meeting several people said that we shouldn't mandate any specific policy item, but that we should look at making it attractive for contributors to follow policies because they agree it is good for the project as a whole. At the moment only the retiring of inactive accounts is mandated.

One suggestion was to use some kind of gamification between projects to see who did most. e.g. each quarter we publish a "signed-commit census report". We could turn that into a kind of leaderboard by sorting the projects by number of signed commits or number of people pushing signed commits. Last quarter glibc had just 8% of signed commits, that percentage could certainly be higher for Q2!

With that, the conversation seems to have sputtered out for now. The matter of glibc process security will, no doubt, come up again in the future. The project, and Sourceware, do seem to be inching toward better security practices and more secure infrastructure. However, the current status is less than comforting given the importance of glibc and the overall GNU Toolchain. Given the history of attacks on free-software projects (like last year's XZ backdoor) and infrastructure, one might expect a little more urgency (and industry support) in seeing to those improvements.

Comments (9 posted)

Cory Doctorow on how we lost the internet

By Jake Edge
May 27, 2025

PyCon US

Cory Doctorow wears many hats: digital activist, science-fiction author, journalist, and more. He has also written many books, both fiction and non-fiction, runs the Pluralistic blog, is a visiting professor, and is an advisor to the Electronic Frontier Foundation (EFF); his Chokepoint Capitalism co-author, Rebecca Giblin, gave a 2023 keynote in Australia that we covered. Doctorow gave a rousing keynote on the state of the "enshitternet"—today's internet—to kick off the recently held PyCon US 2025 in Pittsburgh, Pennsylvania.

He began by noting that he is known for coining the term "enshittification" about the decay of tech platforms, so attendees were probably expecting to hear about that; instead, he wanted to start by talking about nursing. A recent study described how nurses are increasingly getting work through one of three main apps that "bill themselves out as 'Uber for nursing'". The nurses never know what they will be paid per hour prior to accepting a shift and the three companies act as a cartel in order to "play all kinds of games with the way that labor is priced".

In particular, the companies purchase financial information from a data broker before offering a nurse a shift; if the nurse is carrying a lot of credit-card debt, especially if some of that is delinquent, the amount offered is reduced. "Because, the more desperate you are, the less you'll accept to come into work and do that grunt work of caring for the sick, the elderly, and the dying." That is horrific on many levels, he said, but "it is emblematic of 'enshittification'", which is one of the reasons he highlighted it.

Platform decay

Enshittification is a three-stage process; he used Google to illustrate the idea. At first, Google minimized ads and maximized spending on engineering to produce a great search engine; while it was doing that, however, it was buying its way to dominance. "They bribed every service, every product that had a search box to make sure that that was a Google search box." No matter which browser, phone carrier, or operating system you were using, Google ensured that you were using its search by default; by the early 2020s, it was spending the equivalent of buying a Twitter every 18 months to do so, he said. That is the first stage of the process: when the provider is being good to its users, but is finding ways to lock them in.

The second phase occurs once the company recognizes that it has users locked in, so it will be difficult for them to switch away, and it shifts to making things worse for its users in order to enrich its business customers. For Google, those are the publishers and advertisers. A growing portion of the search results page is shifted over to ads "marked off with ever-subtler, ever-smaller, ever-grayer labels distinguishing them from the organic search results". While the platform is getting better for business customers—at the expense of the users—those customers are also getting locked in.

Phase three of enshittification is when the value of the platform is clawed back until all that is left is kind of a "homeopathic residue—the least value needed to keep both business customers and end users locked to the platform". We have gained a view into this process from the three monopoly cases that Google has lost over the last 18 months. In 2019, the company had 90% of the world's search traffic and its users were loyal; "everyone who searched on Google, searched everything on Google".

But that meant that Google's search growth had plateaued, so how was the company going to be able to grow? It could "raise a billion humans to adulthood and make them Google customers, which is Google Classroom, but that's a slow process". From the internal memos that came to light from the court cases, we can see what the company chose to do, he said: "they made search worse".

The accuracy of the search results was reduced, which meant that users needed to do two or three queries to the get the results they would have seen on the first page. That increased the number of ads that could be shown, which is obviously bad for searchers, but the company was also attacking its business customers at the same time. For example, "Google entered into an illegal, collusive arrangement with Meta, called Jedi Blue" that "gamed the advertising market" so that publishers got paid less and advertisers had to pay more, he said.

So that's how we have ended up at the Google of today, where the top of the search results page is "a mountain of AI slop", followed by five paid results "marked with the word 'Ad' in eight point, 90% gray-on-white type", ending with "ten spammy SEO [search-engine optimization] links from someone else who's figured out how to game Google". The amazing thing is "that we are still using Google because we're locked into it". It is a perfect example of the result of the "tragedy in three acts" that is enshittification.

Twiddling

The underlying technical means that allows this enshittification is something he calls "twiddling". Because the companies run their apps on computers, they can change a nearly infinite number of knobs to potentially alter "the prices, the cost, the search rankings, the recommendations" each time the platform is visited. Going back to the nursing example, "that's just twiddling, it's something you can only do with computers".

Legal scholar Veena Dubal coined the term "algorithmic wage discrimination" to describe this kind of twiddling for the "gig economy", which is "a major locus for enshittification"; the nursing apps, Uber, and others are examples of that economy. "Gig work is that place where your shitty boss is a shitty app and you're not allowed to call yourself an employee."

Uber invented a particular form of algorithmic wage discrimination; if its drivers are picky about which rides they accept, Uber will slowly raise the rates to entice those drivers—until they start accepting rides. Once a driver does accept a ride, "the wage starts to push down and down at random intervals in increments that are too small for human beings to readily notice". It is not really "boiling the frog", Doctorow said, so much as it is "slowly poaching it".

As anyone with a technical background knows, "any task that is simple, but time-consuming is a prime candidate for automation". This kind of "wage theft" would be tedious and expensive to do by hand, but it is trivial to play these games using computers. This kind of thing is not just bad for nurses, he said, its bad for those who are using their services.

Do you really think that paying nurses based on how desperate they are, at a rate calculated to increase their desperation so that they'll accept ever-lower wages, is going to result in us getting the best care when we see a nurse? Do you really want your catheter inserted by a nurse on food stamps who drove an Uber until midnight the night before and skipped breakfast this morning so that they could pay the rent?

Paying and products

It is misguided to say "if you're not paying for the product, you're the product", because it makes it seem like we are complicit in sustaining surveillance capitalism—and we are not. The thinking goes that if we were only willing to start paying for things, "we could restore capitalism to its functional non-surveillance state and companies would treat us better because we'd be customers and not products". That thinking elevates companies like Apple as "virtuous alternatives" because the company charges money and not attention, so it can focus on improving the experience for its customers.

There is a small sliver of truth there, he said; Apple rolled out a feature on its phones that allowed users to opt-out of third-party surveillance—notably Facebook tracking. 96% of users opted out, he said; the other 4% "were either drunk or Facebook employees or drunk Facebook employees".

So that makes it seem like Apple will not treat its customers as products, but at the same time it added the opt-out, the company secretly started gathering exactly the same information for its "own surveillance advertising network". There was no notice given to users and no way to opt out of that surveillance; when journalists discovered it and published their findings, Apple "lied about it". The "$1000 Apple distraction rectangle in your pocket is something you paid for", but that does not stop Apple from "treating you like the product".

It is not just end users that Apple treats like products; the app vendors are also treated that way with 30% fees for payment processing in the App Store. That's what is happening with gig-app nurses: "the nurses are the product, the patients are the product, the hospitals are the product—in enshittification, the product is anyone you can productize".

While it is tempting to blame tech, Doctorow said, these companies did not start out enshittified. He recounted the "magic" when Google debuted; "you could ask Jeeves questions for a thousand years and still not get an answer as crisp, as useful, as helpful as the answer you would get by typing a few vague keywords" into Google. Those companies spent decades producing great products, which is why people switched to Google, bought iPhones, and joined their friends on Facebook. They were all born digital, thus could have enshittified at any time, "but they didn't, until they did, and then they did it all at once".

He believes that changes to the policy environment is what has led to enshittification, not changes in technology. These changes to the rules of the game were "undertaken in living memory by named parties who were warned at the time of the likely outcomes"—and did it anyway. Those people are now extremely rich and respected; they have "faced no consequences, no accountability for their role in ushering in the Enshittocene". We have created a perfect breeding ground for the worst practices in our society, which allowed them to thrive and dominate decision-making for companies and governments "leading to a vast enshittening of everything".

That is a dismal outlook, he said, but there is a bit of good news hidden in there. This change did not come about because of a new kind of evil person or the weight of history, but rather because of specific policy choices that were made—and can be unmade. We can consign the enshitternet to the scrap heap as simply "a transitional state from the old good internet that we used to have and the new good internet that we could have".

All companies want to maximize profits and the equation to do so is simple: charge as much as you can, pay suppliers and workers as little as you can, and spend the smallest amount possible on quality and safety. The theoretically "perfect" company that charges infinity and spends nothing fails because no one wants to work for it—or buy anything from it. That shows that there are external constraints that tend to tamp down the "impulse to charge infinity and deliver nothing".

Four constraints

In technology, there are four constraints that help make companies better; they help push back against the impulse to enshittify. The first is markets; businesses that charge more and deliver less lose customers, all else being equal. This is the bedrock idea behind capitalism and it is also the basis of antitrust law, but the rules on antitrust have changed since the Sherman Antitrust Act was enacted in 1890. More than forty years ago, during the Reagan administration in the US, the interpretation of what it means to be a monopoly was changed, not just in US, but also with its major trading partners in the UK, EU, and Asia.

Under this interpretation, monopolies are assumed to be efficient; if Google has 90% of the market, it means that it deserves to be there because no one can possibly do search any better. No competitor has arisen because there is no room to improve on what Google is doing. This pro-monopoly stance did exactly what might be expected, he said, it gave us more monopolies: "in pharma, in beer, in glass bottles, vitamin C, athletic shoes, microchips, cars, mattresses, eyeglasses, and, of course, professional wrestling", he said to laughter.

Markets do not constrain technology firms because those firms do not compete with their rivals—they simply buy their rivals instead. That is confirmed by a memo from Mark Zuckerberg—"a man who puts all of his dumbest ideas in writing"—who wrote: "It is better to buy than to compete". Even though that anti-competitive behavior came to light before Facebook was allowed to buy Instagram in order to ensure that users switching would still be part of Facebook the platform, the Obama administration permitted the sale. Every government over the past 40 years, of all political stripes, has treated monopolies as efficient, Doctorow said.

Regulation is also a constraint, unless the regulators have already been captured by the industry they are supposed to oversee. There are several examples of regulatory capture in the nursing saga, but the most egregious is that anyone in the US can obtain financial information on anyone else in the country, simply by contacting a data broker. "This is because the US congress has not passed a new consumer privacy law since 1988." The Video Privacy Protection Act was aimed at stopping video-store clerks from telling newspapers what VHS video titles were purchased or rented, but no protections have been added since then.

The reason congress has not addressed privacy legislation "since Die Hard was in its first run in theaters" is neither a coincidence nor an oversight, he said. It is "expensively purchased inaction" by an industry that has "monetized the abuse of human rights at unimaginable scale". The coalition in favor of freezing privacy law keeps growing because there are so many ways to "transmute the systematic invasion of our privacy into cash".

Tech companies are not being constrained by either markets or governments, but there are two other factors that could serve to tamp down "the reproduction of sociopathic, enshittifying monsters" within these companies. The first is interoperability; in the non-digital world, it is a lot of work to, say, ensure that any light bulb can be used with any light socket. In the digital world, all of our programs run on the same "Turing-complete, universal Von Neumann machine", so a program that breaks interoperability can be undone with a program that restores it. Every ten-foot fence can be surmounted with an 11-foot ladder; if HP writes a program to ensure that third-party ink cannot be used with its printers, someone can write a program to undo that restriction.

DoorDash workers generally make their money on tips, but the app hides the amount of the tip until the driver commits to taking the delivery. A company called Para wrote a program that looked inside the JSON that was exchanged to find the tip, which it then displayed before the driver had to commit. DoorDash shut down the Para app, "because in America, apps like Para are illegal". The 1998 Digital Millennium Copyright Act (DMCA) signed by Bill Clinton "makes it a felony to 'bypass an access control for a copyrighted work'". So even just reverse-engineering the DoorDash app is a potential felony, which is why companies are so desperate to move their users to apps instead of web sites. "An app is just a web site that we have wrapped in a correct DRM [digital rights management] to make it a felony to protect your privacy while you use it", he said to widespread applause.

At the behest of the US trade representative, Europe and Canada have also enacted DMCA-like laws. This happened despite experts warning the leaders of those countries that "laws that banned tampering with digital locks would let American tech giants corner digital markets in their countries". The laws were a gift to monopolists and allowed companies like HP to continually raise the price of ink until it "has become the most expensive substance you, as a civilian, can buy without a permit"; printing a shopping list uses "colored water that costs more than the semen of a Kentucky-Derby-winning stallion".

The final constraint, which did hold back platform decay for quite some time, is labor. Tech workers have historically been respected and well-paid, without unions. The power of tech workers did not come from solidarity, but from scarcity, Doctorow said. The minute bosses ordered tech workers to enshittify the product they were loyally working on, perhaps missing various important social and family events to ship it on time, those workers could say no—perhaps in a much more coarse way. Tech workers could simply walk across the street "and have a new job by the end of the day" if the boss persisted.

So labor held off enshittification after competition, regulation, and interoperability were all systematically undermined and did so for quite some time—until the mass tech layoffs. There have been half a million tech workers laid off since 2023, more are announced regularly, sometimes in conjunction with raises for executive salaries and bonuses. Now, workers cannot turn their bosses down because there are ten others out there just waiting to take their job.

Reversing course

Until we fix the environment we find ourselves in, the contagion will spread to other companies, he said. The good news is that after 40 years of antitrust decline, there has been a lot of worldwide antitrust activity and it is coming from all over the political spectrum. The EU, UK, Australia, Germany, France, Japan, South Korea, "and China, yes, China" have passed new antitrust laws and launched enforcement actions. The countries often collaborate, so a UK study on Apple's 30% payment-processing fee was used by the EU to fine the company for billions of euros and ban Apple's payment monopoly; those cases then found their way to Japan and South Korea where Apple was further punished.

"There are no billionaires funding the project to make billionaires obsolete", Doctorow said, so the antitrust work has come from and been funded by grassroots efforts.

Europe and Canada have passed strong right-to-repair legislation, but those efforts "have been hamstrung by the anti-circumvention laws" (like the DMCA). Those laws can only be used if there are no locks to get around, but the manufacturers ensure that every car, tractor, appliance, medical implant, and hospital medical device has locks to prevent repair. That raises the question of why these countries don't repeal their versions of the DMCA.

The answer is tariffs, it seems. The US trade representative has long threatened countries with tariffs if they did not have such a law on their books. "Happy 'Liberation Day' everyone", he said with a smile, which resulted in laughter, cheering, and applause. The response of most countries when faced with the US tariffs (or threats thereof) has been to impose retaliatory tariffs, making US products more expensive for their citizens, which is a weird way to punish Americans. "It's like punching yourself in the face really hard and hoping someone else says 'ouch'."

What would be better is for the countries to break the monopolies of the US tech giants by making it legal to reverse-engineer, jailbreak, and modify American products and services. Let companies jailbreak Teslas and deliver all of the features that ship in the cars, but are disabled by software, for one price; that is a much better way to hurt Elon Musk, rather than by expressing outrage at his Nazi salutes, since he loves the attention. "Kick him in the dongle."

Or, let a Canadian company set up an App Store that only charges 3% for payment processing, which will give any content producer an immediate 25% raise, so publishers will flock to it. The same could be done for car and tractor diagnostic devices and more. "Any country in the world has it right now in their power to become a tech-export powerhouse." Doing so would directly attack the tech giants in their most profitable lines of business: "it takes the revenues from those rip-off scams globally from hundreds of billions of dollars to zero overnight". And "that is how you win a trade war", he said to more applause.

He finished with a veritable laundry list of all of the ills facing the world today (the "omni-shambolic poly-crisis"), both on and off the internet, and noted that the tech giants would willingly "trade a habitable planet and human rights for a 3% tax cut". But it did not have to be this way, "the enshitternet was not inevitable" and was, in fact, the product of policy choices made by known people in the last few decades. "They chose enshittification; we warned them what would come of it and we don't have to be eternal prisoners of the catastrophic policy blunders of clueless lawmakers of old."

There once was an "old good internet", Doctorow said, but it was too difficult for non-technical people to connect up to; web 2.0 changed that, making it easy for everyone to get online, but that led directly into hard-to-escape walled gardens. A new good internet is possible and needed; "we can build it with all of the technological self-determination of the old good internet and the ease of web 2.0". It can be a place to come together and organize in order to "resist and survive climate collapse, fascism, genocide, and authoritarianism". He concluded: "we can build it and we must".

His speech was well-received and was met with a standing ovation. Some of his harshest rhetoric (much of which was toned down here) may not have been popular with everyone, perhaps especially the PyCon sponsors who were named and shamed in the keynote, but it did seem to resonate within the crowd of attendees. Doctorow's perspective is always interesting—and he certainly pulls no punches.

A YouTube video of the talk is available.

[I would like to thank LWN's travel sponsor, the Linux Foundation, for supporting my travel to Pittsburgh for PyCon.]

Comments (75 posted)

System-wide encrypted DNS

May 28, 2025

Pavel Březina and Francisco Triviño García

The increasing sophistication of attackers has organizations realizing that perimeter-based security models are inadequate. Many are planning to transition their internal networks to a zero-trust architecture. This requires every communication on the network to be encrypted, authenticated, and authorized. This can be achieved in applications and services by using modern communication protocols. However, the world still depends on Domain Name System (DNS) services where encryption, while possible, is far from being the industry standard. To address this we, as part of a working group at Red Hat, worked on fully integrating encrypted DNS for Linux systems—not only while the system is running but also during the installation and boot process, including support for a custom certificate chain in the initial ramdisk. This integration is now available in CentOS Stream 9, 10, and the upcoming Fedora 43 release.

Zero-trust architecture

A common perimeter-based approach separates the network into two sectors—internal and external. While the external network is usually not trusted, there is an implicit trust in the internal network. Even though it is quite common to authenticate users to the services, it is expected that any host and communication inside the internal network is trustworthy; therefore there is no mutual authentication or the need for data encryption.

There is an increased risk of cyberattacks every year, and the designation of "internal" and "external" for network-connected devices is much less useful today. Companies are moving resources from internal networks into public clouds, and employees are working remotely or on devices not owned by the enterprise thanks to "bring your own device" policies. Implicit trust in "internal" networks is no longer acceptable, if it ever was.

Over the years, new extensions have been added to the DNS protocol to enhance its security. Domain Name System Security Extensions (DNSSEC) adds verification and data integrity. DNS over TLS (DoT) talks to the server over an encrypted channel. DNS over HTTPS (DoH) allows tunneling of DNS queries over HTTPS, and DNS over QUIC (DoQ) implements encryption on top of UDP.

Even though technology to implement DNS in zero-trust networks exists, it has not been widely adopted. And while it is possible for Linux users to manually configure encrypted DNS on their machine, there is no integration into the system. Multiple DNS lookups are usually performed while doing a Linux installation, and it is possible to boot from remote sources, which requires working DNS as well. This poses the question: how do you install and boot the operating system in a zero trust environment? The answer is to integrate encrypted DNS into the system.

System-wide encrypted DNS

There are two methods that applications typically use to talk to a DNS server: the system resolver using the POSIX API (getaddrinfo()) or by talking to a DNS server directly through a resolver library. When using the POSIX API the system resolver talks to the server that is configured in /etc/resolv.conf, however it does not support fully encrypted DNS. If an application implements its own resolver, it is often possible to configure a custom address for the DNS server (it typically defaults to contents of /etc/resolv.conf as well). Then it depends on the application whether encryption is supported—more often than not, encryption is not supported.

To avoid implementing encryption in all existing applications, it is possible to implement a local caching DNS resolver that can serve all local queries by forwarding them to the upstream DNS servers. This allows applications to talk to the local resolver using standard unencrypted UDP port 53, while the local resolver establishes an encrypted connection with the upstream DNS server and forwards all external communication over an encrypted channel. This local DNS resolver can be put into /etc/resolv.conf to let it be used automatically. This is demonstrated by the following figure:

Technology choices

Multiple components were considered to play the role of the local DNS caching resolver. The most promising were systemd-resolved and Unbound. The clear benefits of systemd-resolved were its existing integration within Fedora and NetworkManager. However, at the time of the decision, systemd-resolved had multiple longstanding issues that the upstream was not planning to fix, especially in the DNSSEC area, as shown in the systemd GitHub issues 24827, 23622, and 19227. Systemd-resolved also remains in technology preview in Red Hat Enterprise Linux (RHEL), which was our target distribution. After consulting with Red Hat's systemd and DNS developers, we chose Unbound as a small and reliable DNS caching resolver with good support for DNSSEC. Please note that some of the systemd-resolved issues were eventually fixed.

The choice of the communication protocol was more straightforward. DoT was selected for forwarding queries to the upstream DNS server, in favor of DoH or DoQ. Although the preferred solution would support all three protocols and let the user choose, the reality is that, while there is substantial support for downstream (receiving queries) DoT and DoH in DNS servers, only DoT is usually supported for upstream (forwarding) queries. Support for DoQ is not yet widely available on either side.

Integration

All users can manually configure the local caching DNS resolver as described above. However, changes to multiple components were required to fully integrate Unbound into the system and simplify the configuration and enforcing of encrypted DNS starting with the boot and installation processes, as well as in the running system.

The integration is centered around NetworkManager, which is the network-configuration daemon used by many distributions. From a user perspective, the main use case of NetworkManager is to obtain network information from DHCP or its configuration files and to set up the system networking properly.

Beniamino Galvani added (merge requests 2090 and 2123) new configuration and kernel options to set a static DNS server with DoT support to be used exclusively for all connections. NetworkManager already had built-in support for dnsmasq and systemd-resolved, but it did not have support for Unbound. A new plugin, dnsconfd, was added by Tomáš Korbař to handle the configuration of Unbound.

Dnsconfd is a new project that was created to sit between NetworkManager and the DNS caching resolver. It allows NetworkManager to focus on obtaining the list of upstream DNS servers instead of dealing with the configuration peculiarities of various local DNS services. It provides a generic D-Bus interface and translates calls to the interface into configuration of specific DNS resolvers. At this time, only Unbound is supported, but there is a plan to extend it for other resolvers as well.

To properly integrate encrypted DNS in the boot process, NetworkManager, dnsconfd, and OpenSSL must be included in and started from the initramfs image. Various distributions use different tools to create the image. We focused on dracut, which is used to build the initramfs image in Fedora and other distributions in the Red Hat family. Dracut has a modular architecture, where each module specifies which files are pulled into the image. NetworkManager already has its own dracut module that executes nm-initrd-generator to generate the network configuration, but it now supports the new NetworkManager options to enable the encrypted DNS. Further, Korbař and Pavel Valena implemented new dracut modules for OpenSSL and dnsconfd.

The last piece of the puzzle is to enable encrypted DNS during system installation (and of course in the installed system). Fedora and related distributions use the Anaconda installer, so we focused on this project. Since many DNS servers require a custom certificate chain to verify their TLS certificate, it is important to include this CA bundle in the installation process. For this, the Anaconda team implemented a new %certificate kickstart section that copies custom certificates during the installation.

Anaconda runs inside its own initramfs image where dracut modules generate the necessary configuration to enable DoT during the installation process. The installer then makes sure that all required services are started and copies all required configuration into the installed system. This makes sure that DoT can be used during installation, during boot, and in the freshly installed system and no unencrypted DNS query leaves the host—ever.

Encrypted DNS in identity management

FreeIPA ("identity, policy, audit") is an identity management solution—used widely with RHEL-type systems—that provides centralized authentication, authorization, and account information. It has introduced support for encrypted DNS via DoT in its integrated DNS service.

A typical FreeIPA deployment consists of one or more servers, optional replicas, and multiple clients. Servers act as the authoritative source of identity data and policies, while replicas provide scalability and redundancy. Both servers and replicas may optionally deploy the integrated DNS service, which allows them to manage DNS zones used in the identity infrastructure. Clients join the domain and interact with the servers for authentication, host enrollment, and service discovery.

In this topology, the golden rule for DNS security is clear: all DNS traffic leaving the host must be encrypted. This means clients must communicate with the DNS server over an encrypted channel (via DoT). Within the host, DNS queries may remain unencrypted as long as they occur over the loopback interface.

When a FreeIPA replica includes the integrated DNS service, it is treated similarly to a server, handling both incoming unencrypted queries from localhost and external encrypted queries. Replicas without the DNS service follow the client pattern: using Unbound as a local DoT resolver forwarding to an upstream encrypted DNS source. This distinction ensures consistent policy enforcement while accommodating different deployment needs as Triviño wrote in the design page.

The integration of DoT into FreeIPA is deliberately minimal in its first iteration, targeting new deployments and isolating the encrypted DNS logic into dedicated subpackages: freeipa-client-encrypted-dns and freeipa-server-encrypted-dns. This modular design ensures that existing installations remain unaffected when FreeIPA is upgraded, unless the user explicitly installs the new packages as implemented by Antonio Torres.

To implement DoT support, FreeIPA relies on Unbound as a local resolver and forwarder, sitting alongside the existing BIND 9.18-based DNS service. This architectural decision stems from current limitations in BIND's DoT forwarding capabilities, which are only addressed in the BIND 9.20 LTS release. The 9.20 release is not yet supported by FreeIPA due to the large number of architectural changes in BIND that requires a significant rewrite of bind-dyndb-ldap (a FreeIPA plugin that reads DNS zones from LDAP).

The integration of Unbound ensures encrypted external DNS queries while allowing BIND to continue handling internal DNS zone management and resolution. On the client side, Unbound is deployed as a local caching resolver. For servers and replicas, BIND handles internal and incoming DNS queries, both encrypted and unencrypted, while forwarding external requests through Unbound using TLS. See the image below for an illustration of this.

Certificate management is handled through FreeIPA's existing public-key infrastructure. Administrators can either provide their own TLS certificates or allow FreeIPA to issue and manage them via its Custodia subsystem. This flexibility enables integration into both enterprise-managed and automated deployments.

We have provided instructions for enabling system-wide encrypted DNS and FreeIPA's encrypted DNS feature on Fedora/RHEL-like systems as a separate guide.

Upstream and downstream availability

All of the work has been already upstreamed in Anaconda, dnsconfd, dracut, FreeIPA, NetworkManager, and the System Security Services Daemon (SSSD). It is released as part of the latest version of all affected components. Fedora users may already start experimenting with system-wide encrypted DNS in Fedora 42 (run time and boot time) and Fedora 43 (current rawhide, including encrypted DNS during installation). RHEL users will see the feature as part of 9.6 and 10.0 when they are released, or it can be used now in CentOS Stream 9 and 10.

Our working group continues to expand the encrypted DNS feature. The work that has been done so far was focused on enabling encrypted DNS for zero trust and enterprise requirements. One of the things on the road map is to implement support for DoH forwarding and also RFC 9463 "DHCP and Router Advertisement Options for the Discovery of Network-designated Resolvers" which allows the discovery of DoT or DoH servers from DHCP. We are also working on bind-dyndb-ldap rewrite to make it compatible with BIND 9.20 so it is possible to use BIND directly as the DoT forwarder and avoid running Unbound on the IPA server.

Comments (29 posted)

Development statistics for the 6.15 kernel

By Jonathan Corbet
May 26, 2025

The 6.14 kernel development cycle only brought in 11,003 non-merge changesets, making it the slowest cycle since 4.0, which was released in 2015. The 6.15 kernel, instead, brought in 14,612 changesets, making it the busiest release since 6.7, released at the beginning of 2024. The kernel development process, in other words, is back up to full speed. The 6.15 release happened on May 25, so the time has come for the obligatory look at where the changes in this release came from.

As a reminder, LWN subscribers can find this information and more, at any time, for any kernel version since 2005, in the LWN Kernel Source Database.

The work in 6.15 was contributed by 2,068 developers — a relatively high number, though it falls short of the record 2,090 seen in the 6.2 development cycle. There were 262 developers who made their first kernel contribution in 6.15. The most active contributors this time around were:

Most active 6.15 developers

By changesets

Kent Overstreet 266 1.8%

Kuninori Morimoto 191 1.3%

Ville Syrjälä 144 1.0%

Andy Shevchenko 137 0.9%

Alex Deucher 123 0.8%

Nam Cao 123 0.8%

Sean Christopherson 117 0.8%

Krzysztof Kozlowski 115 0.8%

Takashi Iwai 114 0.8%

Dr. David Alan Gilbert 111 0.8%

Thomas Weißschuh 108 0.7%

Jani Nikula 106 0.7%

Pavel Begunkov 102 0.7%

Jakub Kicinski 94 0.6%

Eric Biggers 93 0.6%

Christoph Hellwig 92 0.6%

Arnd Bergmann 91 0.6%

Matthew Wilcox 89 0.6%

Ian Rogers 89 0.6%

Mario Limonciello 87 0.6%

By changed lines

Wayne Lin 80287 9.5%

Ian Rogers 33886 4.0%

Miri Korenblit 29176 3.4%

Bitterblue Smith 26801 3.2%

Andrew Donnellan 25819 3.0%

Edward Cree 12941 1.5%

Austin Zheng 12889 1.5%

Michael Ellerman 12629 1.5%

Dikshita Agarwal 8901 1.1%

Nick Chan 8802 1.0%

Nick Terrell 8749 1.0%

Kent Overstreet 8296 1.0%

Christoph Hellwig 7202 0.8%

Eric Biggers 7012 0.8%

Dr. David Alan Gilbert 6844 0.8%

Nuno Das Neves 6419 0.8%

Ivaylo Ivanov 5938 0.7%

David Howells 5909 0.7%

Alex Deucher 5398 0.6%

Matthew Brost 5312 0.6%

Once again, the developer with the most changesets was Kent Overstreet, who continues to work on stabilizing the bcachefs filesystem. Kuninori Morimoto contributed a large set of cleanups to the sound subsystem. Ville Syrjälä worked exclusively on the Intel i915 graphics driver. Andy Shevchenko contributed small improvements throughout the driver subsystem, and Alex Deucher worked, as always, on the AMD graphics driver subsystem.

Returning to a pattern often seen in recent years, the "lines changed" column is led by Wayne Lin, who contributed yet another set of AMD GPU header files. Ian Rogers made a number of contributions to the perf subsystem, including updating the large Intel vendor-events files. Miri Korenblit added the new "iwlmld" driver for newer Intel WiFi adapters. Bitterblue Smith added a number of RealTek WiFi driver variants, and Andrew Donnellan removed a couple of unused CXL drivers.

The top testers and reviewers this time around were:

Test and review credits in 6.15

Tested-by

Daniel Wheeler 163 9.2%

Neil Armstrong 64 3.6%

Thomas Falcon 35 2.0%

Babu Moger 30 1.7%

Shaopeng Tan 30 1.7%

Peter Newman 30 1.7%

Amit Singh Tomar 30 1.7%

Shanker Donthineni 30 1.7%

Stefan Schmidt 28 1.6%

Nicolin Chen 25 1.4%

Xiaochun Lee 25 1.4%

Venkat Rao Bagalkote 24 1.4%

Andreas Hindborg 21 1.2%

Alison Schofield 21 1.2%

Carl Worth 21 1.2%

Reviewed-by

Simon Horman 271 2.7%

Krzysztof Kozlowski 161 1.6%

Dmitry Baryshkov 147 1.5%

Geert Uytterhoeven 112 1.1%

Andrew Lunn 109 1.1%

Ilpo Järvinen 105 1.1%

Darrick J. Wong 105 1.1%

David Sterba 102 1.0%

Rob Herring (Arm) 100 1.0%

Jonathan Cameron 97 1.0%

Linus Walleij 96 1.0%

Charles Keepax 93 0.9%

Jan Kara 88 0.9%

Christoph Hellwig 82 0.8%

Jacob Keller 81 0.8%

Daniel Wheeler retains his permanent spot as the top-credited tester; nobody else even comes close. The top reviewers are a bit different this time around, with Simon Horman reviewing just over four networking patches for every day of this development cycle.

There were Tested-by tags in 1,411 6.15 commits (9.7% of the total), while 7,332 (50.2%) of the commits had Reviewed-by tags.

Work on 6.15 was supported by (at least) 195 employers, a slightly smaller number than usual. The most active employers were:

Most active 6.15 employers

By changesets

Intel 1755 12.0%

(Unknown) 1302 8.9%

Google 983 6.7%

(None) 930 6.4%

Red Hat 889 6.1%

AMD 881 6.0%

Linaro 645 4.4%

SUSE 549 3.8%

Meta 493 3.4%

NVIDIA 370 2.5%

Huawei Technologies 370 2.5%

Renesas Electronics 367 2.5%

Qualcomm 319 2.2%

Arm 301 2.1%

Linutronix 296 2.0%

Oracle 286 2.0%

IBM 282 1.9%

Microsoft 259 1.8%

(Consultant) 180 1.2%

NXP Semiconductors 179 1.2%

By lines changed

AMD 125923 14.9%

(Unknown) 97908 11.5%

Intel 94150 11.1%

Google 67461 8.0%

IBM 48682 5.7%

(None) 45049 5.3%

Red Hat 43981 5.2%

Qualcomm 34014 4.0%

Meta 26182 3.1%

Microsoft 19431 2.3%

Linaro 16389 1.9%

NVIDIA 16191 1.9%

SUSE 15175 1.8%

Huawei Technologies 14136 1.7%

Xilinx 12961 1.5%

Collabora 11640 1.4%

Arm 9357 1.1%

NXP Semiconductors 8857 1.0%

Rockchip 8085 1.0%

BayLibre 8037 0.9%

This is mostly the usual list of companies that consistently support kernel work from one year to the next. Linutronix has moved up the list this time around, mostly as the result of a lot of work on the kernel's timer subsystem. IBM, once one of the top contributors to the kernel, continues to move downward.

A different view of how the process works can be had by looking at the Signed-off-by tags applied to patches, specifically those applied by developers other than the author. Those additional signoffs are the traces left when developers forward a patch or apply it to a Git repository on its way toward the mainline; they thus give a clue as to who is doing the work of herding patches upstream. For 6.15, the signoff statistics look like this:

Non-author Signed-off-by tags in 6.15

Developers

Jakub Kicinski 955 7.0%

Mark Brown 774 5.7%

Andrew Morton 649 4.8%

Alex Deucher 571 4.2%

Ingo Molnar 400 2.9%

Greg Kroah-Hartman 389 2.9%

Jens Axboe 325 2.4%

Paolo Abeni 314 2.3%

Hans Verkuil 257 1.9%

Thomas Gleixner 235 1.7%

Christian Brauner 218 1.6%

Namhyung Kim 194 1.4%

Jonathan Cameron 186 1.4%

Alexei Starovoitov 183 1.3%

Johannes Berg 160 1.2%

Heiko Stuebner 148 1.1%

Martin K. Petersen 137 1.0%

Vinod Koul 137 1.0%

David Sterba 131 1.0%

Shawn Guo 130 1.0%

Employers

Meta 1702 12.5%

Google 1405 10.3%

Intel 1310 9.6%

Red Hat 1151 8.5%

Arm 955 7.0%

AMD 908 6.7%

Linaro 768 5.7%

Microsoft 427 3.1%

Linux Foundation 418 3.1%

SUSE 404 3.0%

(Unknown) 376 2.8%

(None) 331 2.4%

Qualcomm 307 2.3%

NVIDIA 304 2.2%

Huawei Technologies 289 2.1%

Linutronix 283 2.1%

Cisco 281 2.1%

Oracle 202 1.5%

LG Electronics 194 1.4%

IBM 173 1.3%

One patch out of every eight going into the kernel now passes through the hands of a maintainer at Meta, and nearly as many are handled by Google developers.

As of this writing, there are well over 12,000 commits in linux-next, almost all of which can be expected to find their way into the kernel during the 6.16 merge window. That suggests that the next development cycle will be as busy as this one was. As always, keep an eye on LWN to keep up with the next kernel as it is assembled and stabilized.

Comments (none posted)

Long-duration stress-testing for filesystems

By Jake Edge
May 22, 2025

LSFMM+BPF

Testing filesystems is a frequent topic at the Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF); the 2025 edition was no exception. Boris Burkov led a filesystem-track session to discuss stress-testing filesystems—and running those tests for lengthy periods. He reviewed what he has been doing when testing filesystems and wanted to gather ideas for what could be done to catch more bugs before the filesystems hit production.

He began by noting that he works for Meta on Btrfs, which means that he spends a lot of time tracking down "weird bugs that you only see on millions of computers". Production use stresses filesystems, so it makes sense for filesystem developers to do that stressing ahead of time to try to catch bugs before they reach production. To get an idea of what kinds of bugs made it into production, he surveyed ones that Meta had encountered; "it was a quick biased sample" that may miss some types of bugs or overemphasize others. There were two data-corruption bugs, a metadata corruption that "took a few months of Josef [Bacik] and I debugging it to find", a noisy-neighbor problem where misbehaving containers could cause a problem in other containers due to CPU and global-lock contention, and corruption when trying to enable large folios on XFS.

Burkov tried to extract some patterns from those problems and their investigations. One important thing emerged: all of these kernels were tested with fstests using the "auto" group (i.e. -g auto). That group is a set of tests meant for regression testing. It is something that Meta runs daily and on every commit; he thought that many other developers and companies were probably doing something similar.

The way these problems were reproduced for debugging was with custom scripts that were somewhat similar to the buggy workload; they would often require hours or days to reproduce the problem. Those runs were done with "a high degree of parallelism and perhaps with other stressing conditions" until the problem would occur. Frequently, getting the bug to reproduce relied on memory pressure or increased concurrency.

Data integrity seems to be something of blind spot, he said. Roughly half of the bugs boiled down to some kind of data corruption, and "far from half of fstests are about data corruption".

The obvious next step is to look at what others have done for stress-testing, he said. There is a "soak" group for fstests, which is aimed at longer-duration tests, as well as fsstress and fsx from the Linux Test Project (LTP) that are available. In the "default" settings, though he noted that his defaults may not be universal, those tests generally do not run for more than about ten minutes in practice, however. The SOAK_DURATION parameter can be used to run that group for as long as desired, which Darrick Wong described in his response to Burkov's session proposal post. Most of the stress tests in fstests run fsstress or fsx in combination with some other "nasty thing" like CPU hotplug, Btrfs balance operations, or XFS scrub operations.

The operations that fsstress uses are extensive, and include some filesystem-specific operations, all of which is great, Burkov said. But what is lacking are some stressors, the biggest of which is memory pressure. Another is more parallelism for the operations, which is something that Dave Chinner mentioned in conjunction with the check-parallel test script in his reply to the topic proposal. Chris Mason had suggested adding in random filesystem-sync and cache-clearing operations into the mix as well, Burkov said.

As part of his research into filesystem stress-testing, he came across a paper about the NFSv4 test project, which "had a passage that struck me":

One year after we started using FSSTRESS (in April 2005) Linux NFSv4 was able to sustain the concurrent load of 10 processes during 24 hours, without any problem. Three months later, NFSv4 reached 72 hours of stress under FSSTRESS, without any bugs. From this date, NFSv4 filesystem tree manipulation is considered to be stable.

That is how the NFSv4 developers decided the filesystem was stable. The 72 hours of fsstress is not really part of his testing, though he thinks Btrfs would pass that bar. He does not really think of Btrfs as "stable", however. It is 20 years since that statement was made for NFSv4, so Burkov wondered what the modern equivalent for today's filesystems should be. His proposal was: "Run dozens of relevant complex operations including fsstress and fsx in parallel under memory pressure for 72 hours", where the dozens include various, sometimes filesystem-specific, operations such as sync, reflink, balance, dropping caches, memory compaction, and CPU hotplug.

Another option might be to modify fsstress itself, perhaps by using its -x option to run commands to, say, drop the caches, or by running it in a control group to add memory pressure. In addition, data-integrity checks and more filesystem-specific stressors could be added. The check-parallel script does not really fulfill the goals that he sees as needed, but "there's definitely room for it to grow into that space". He was open to suggestions if none of those really appealed; "what do people think we should do as a gold standard for stress testing?"

Ted Ts'o thought that check-parallel may not be great for finding problems, as Chinner had suggested, though Ts'o said that is good for triggering these kinds of problems. There is, however, a need to find "ways of running these soak tests which are reproducible enough" to "reliably trigger the failure multiple times" in the shortest time possible. Tests that take 72 hours and fail 50% of the time, for example, will be difficult to use to track down bugs and to verify that they have been fixed, so quick reproducibility is important. He is concerned that relying on check-parallel will make that more difficult because it is so timing-dependent.

In his testing, Ts'o has found that using a variety of storage devices is important; "some things only trigger if you're using a fast SSD, other things only trigger if you are using spinning-rust platters". If a problem happens once on a hard disk, for example, try to reproduce it on a fast SSD or ramdisk, he said. "It's not enough just to say 'great we were able to trigger a failure', it's 'can we trigger a failure reliably?'" Burkov said that he could not agree more, as fast reproducibility is what he is constantly working toward with his testing and tools.

Chuck Lever said that he had a suggestion for a test to use, but feared it would fall into the "intermittently reproducible" bucket: the Git regression test suite. It runs nearly 1000 different tests and checks the contents of the files that are manipulated using Git operations. He turns that functionality test into a stress test by running it in multiple threads using a command like "make -j 16". Often the single-threaded test will run reliably many times, but the multi-threaded test will fail and generally pretty quickly. But it is sometimes hard to track down what went wrong, he said, because it is testing Git, not filesystems.

Zach Brown said that the parameter space for filesystems was so large that it was not really productive to try to claim that it has been exhaustively explored. But, as Burkov has seen, there are parts of that parameter space that are frequently seen in production, but have not been tested much; a good example is memory pressure, Brown said, which is a problem area that he has also observed. He wondered if it made more sense to try to somehow fingerprint production deployments to determine where in the parameter space they are running, which could point to areas that are not being tested.

Bacik and Ts'o both thought that adding more data verification into fsx and fsstress would be useful; the belief is that neither does much or any of that right now. Mason said that another stressor that should be added into the mix is memory compaction; there are various ways that the memory-management subsystem moves pages around underneath the filesystems, which may help shake out bugs. Luis Chamberlain suggested running tests in virtual machines with a different types of filesystem in the guest and host.

Ts'o said that it might make sense to collaborate on the "antagonists" (stressors) so that they can be run in all of the different test harnesses that are in use. Once that is done, "a bunch of standard antagonist packages" could be added to fstests; if the antagonists are defined at that level, more filesystems and testers will be able to use them. As time ran out, Chamberlain noted that the Rust developers require adding tests for new APIs in order for them to be merged, but that is not something that is required for filesystem APIs, which should change, he said.

Comments (10 posted)

Formally verifying the BPF verifier

By Daroc Alden
May 23, 2025

LSFMM+BPF

The BPF verifier is an increasingly complex and security-critical piece of code. When the kinds of people who are apt to work on BPF see a situation like that, they naturally question whether it's possible to use formal verification to ensure that the implementation of the code in question is correct. Santosh Nagarakatte led the first of two extra-long sessions in the BPF track of the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit about his team's work formally verifying the BPF verifier with a custom tool called Agni.

Agni's history

Work on Agni began about 6 years ago, Nagarakatte said, when he got interested in the PREVAIL BPF verifier, and met other people excited to study it. Since then, Harishankar Vishwanathan, Matan Shachnai, Srinivas Narayana, and Nagarakatte have been working at the Rutgers Architecture and Programming Languages Research Group to develop the tool.

The Linux kernel's BPF verifier is probably the first real instance of formal verification "in production", Nagarakatte said. Other projects that use formal verification tend to do so "on the side", not as part of the running, deployed system. That makes it interesting because writing correct formal verifiers is hard, and the BPF verifier will often be running in a context where it's hard for the original developer to spot errors.

So, he asked, can we understand the algorithms that the BPF verifier uses, and guarantee that they're correct? The BPF verifier has a lot of different components, so Nagarakatte and his team decided to start by tackling value tracking: the part of the verifier that determines what values a variable can have at different points in the program. Narayana's later session, which will be the subject of a separate article, covered their subsequent work on checking whether the verifier's path-pruning algorithm is correct.

Their first stab at the problem was to manually encode and check some proofs about the BPF verifier's abstract-value-tracking implementation. That worked fine for addition, but they couldn't make it work for the verifier's checking of multiplications. As a result of that experience, they ended up writing a new algorithm for multiplying abstract values that was amenable to verification, and got that accepted into the mainline kernel. So, from that work, they were confident that addition and multiplication were correct, which is already useful.

The BPF verifier changes all of the time, however, and manually keeping their proofs up to date was clearly not going to be feasible. That's where Agni steps in: it takes the C source code of the BPF verifier and converts it into a satisfiability modulo theory (SMT) problem that can be automatically proved or disproved by an SMT-LIB implementation such as Z3. If the solver can prove that the verifier is correct, that's excellent.

If it finds a counterexample, however, the raw output is not particularly useful. Ideally, Nagarakatte's team wants the BPF developers to be able to use Agni as an extra check during development — something that can be used to test changes before they actually make it into the kernel. In pursuit of that goal, they added a program-synthesis component. If the SMT solver finds that the verifier is not correct, Agni will take the output of the SMT solver and use it construct a proof-of-concept BPF program that triggers the bug in the verifier. That can be fed back to the developer to illustrate where the failure comes from.

Verifying arithmetic

With that high-level history of the project out of the way, Nagarakatte went on to explain how Agni actually does this. First, it takes the C source code and compiles it to LLVM's intermediate representation (IR). Agni doesn't need to handle every corner-case of the IR because it turns out that the verifier's code is not "as bad as other real world C" — it uses a fairly limited subset of the language. Once Agni has the IR, it uses LLVM's dead-code elimination to focus on a single operator at a time by discarding all of the parts of the verifier that aren't relevant to that operator.

Those operators are used to combine the verifier's abstract representations of what a variable could be. So it's not as simple as adding two concrete numbers — instead, the verifier has to be able to answer questions like "if register 1 has a number between 0 and 100, and register 2 has a number between 3 and 5, is their sum less than the length of this array?". This information is used throughout the verifier to ensure that accesses are within-bounds and aligned.

In particular, the verifier tracks which bits of a value are known exactly, as well as what its range of possible values is as a signed or unsigned number. Shung-Hsi Yu led a session at the 2024 summit about his work simplifying the representation of these abstract values.

For each mathematical and bitwise operator, Agni takes the LLVM IR and translates it into a machine-checkable specification that the operator is implemented correctly. This transformation ends up using type information from the LLVM IR, which poses a problem because some of that type information is not available in LLVM version 15 or higher. Eventually, when the kernel updates to require LLVM 15, Agni will break and the BPF developers will need to find an alternate approach. That was a problem Nagarakatte wanted to discuss with the assembled developers in more depth.

What it means for an abstract operator of this type to be correct ("sound") is remarkably straightforward, as complicated mathematical definitions go. Suppose that there are two abstract values (considered as sets of possible values, even though this isn't how the verifier represents them in memory), P and Q, and two specific numbers, x and y, which are members of P and Q respectively. The verifier's implementation of "+" is sound if the abstract representation that comes out of calculating the operation of "+" on two registers containing P and Q always contains the number "x + y". That is to say, given some specific numbers that are correctly modeled by two abstract register states, adding the two numbers should produce something that is correctly modeled by the addition of the two abstract states.

Complications

At first, they had planned to verify each way that the verifier tracks values (as known bits, and signed and unsigned ranges) independently. That turns out not to work, however, because the verifier actually shares information between these representations. For example, if it knows that all of the bits other than the least significant two are zero, it also knows that the signed and unsigned ranges are 0-3. In the absence of this sharing of information, the BPF verifier's implementation would be unsound. The academic term for this sort of thing is a "shared refinement operator"; a refinement operator being something that slims-down an abstract value by ruling out impossible values.

Once they were able to successfully model the shared refinement operator, they finally got confirmation that modern kernels are sound. Specifically, they were able to show that kernels from version 5.13 onward were sound. The oldest kernel version they tested was 4.14, so that left the problem of how to demonstrate an actual problem in the kernels between those versions — or, if they could not, to discover another deficiency in Agni.

This is where the idea of synthesizing BPF programs came in. If Agni can prove that the verifier's implementation of an operator is not correct, that essentially means that it has figured out a way to add two registers that outputs a concrete value the verifier is not expecting. Then the problem becomes: how to create a BPF program that puts the verifier into those specific abstract states, and ends up calculating the bad final value.

They saw that the real-world failures from earlier kernel versions were generally caused by fairly simple conditions, and so ultimately selected a brute-force approach. Agni will consider every BPF program that uses a series of arithmetic instructions ending in the flawed one in increasing order of program length, and return the smallest that triggers the bug.

This approach worked to generate several proof-of-concept BPF programs for older kernels. Unfortunately, SMT-solving is NP-complete, and, as the verifier has become more complicated, the time it takes Agni to verify that its implementation is correct has grown. Agni ran against kernel version 4.14 for 2.5 hours, against version 5.13 for ten, and against version 6.4 for several weeks. Then, Andrii Nakryiko posted a patch that improves the accuracy of the verifier's shared refinement operator, which significantly slows Agni's analysis, leading to timeouts.

Going faster

At this point, the team working on Agni was in a rough place: they had a working tool that could turn up real bugs in the BPF verifier, but it wasn't going to be able to keep up with new kernels because of scaling problems. So they decided to try to break the problem down into subproblems that could be solved independently.

Each abstract operator that Agni extracted from the verifier came to about 5,000 lines of SMT-LIB code. Of those, about 700 lines are the actual operator itself, and the rest is the code for the shared refinement operator. They decided to see if they could verify the shared refinement operator once, and share that proof between all of the operators.

That approach didn't work, because it turns out that the shared refinement operator was also masking latent unsoundness in some of the bitwise operations. These didn't represent real bugs, because in the actual verifier the shared refinement operator was always used. But they did represent a barrier to Agni, because it seemingly made it impossible to verify the shared refinement operator independently of the operations that used it.

The solution ended up being to submit a small fix for the bitwise operators. Once those patches were accepted, the divide-and-conquer approach became feasible, and Agni's run time dropped to less than 30 minutes for kernel version 6.8.

Future work

John Fastabend asked whether modeling the shared refinement operator separately allowed them to say whether the fixed versions of the bitwise operators were more or less precise (in the sense of more closely approximating the minimal set of possible values of the output). Nagarakatte said that is was exactly as precise, actually. Daniel Borkmann asked whether they had looked into whether the shared refinement operator could be made more precise. Nagarakatte said that they were experimenting with that internally, and once they have a better refinement operator that they're confident won't break anything, they'll submit a patch set.

Fastabend asked whether they would be able to use the tool to find redundancy in the C code — that is, conditions that the verifier checks even though a check is not needed. Nagarakatte responded that one of his students was working on a project to synthesize abstract operators from scratch, which "should be as good or better than what the kernel does". They've already come up with a more concise representation for abstract values, although the data structure the kernel uses has already been proved to be maximally precise.

Recently, Nagarakatte's student shared a patch that improves the precision of the multiply instruction to work better with negative values. He wants to work with them to put together a paper on the technique once they can explain it, at which point it may be applicable to other parts of the verifier.

With Agni fully described, he then wanted to turn to the topic of how to move forward. The main upcoming problem Nagarakatte foresees is the kernel moving to LLVM 15. His preferred resolution would be for the BPF developers to rewrite the verifier in some abstract specification language, which could be used as an input to Agni and as a source of generated C code. He was optimistic that writing the verifier in a higher-level language would make improving the verifier and reviewing it easier for everyone.

Borkmann mentioned that Nagarakatte had proposed the idea of embedding some kind of domain-specific language (DSL) for the verifier in the comments of the C source code; he asked whether that invites the problem of ensuring that the DSL actually corresponds to the C code. Nagarakatte agreed that was a problem, but it's a much easier problem than parsing C source code correctly without LLVM.

Another audience member pointed out that any DSL for verifier code would be yet another language to learn — "how do we make that easy?" Nagarakatte explained that when he said it would be nice to use a DSL, he didn't mean anything too complicated. One of the problems that they're dealing with in Agni is handling arguments that are passed in pointers; right now, they're relying on LLVM's analysis to remove memory accesses from the code to make modeling it easier. If the developers could specify argument types with a DSL, it could potentially simplify things.

One person asked whether this kind of approach could be extended to other parts of the kernel. Nagarakatte said that there are other static-analysis-based approaches that could be applied to other parts of the kernel. The seL4 microkernel, for example, has a formal proof of correctness. He hasn't been working on that, though; he has been focusing on Agni. Ultimately, as with so many things in open source, it just needs someone to take the time to make it happen.

Amery Hung wanted to know whether there were other parts of the verifier that could be formally verified, beyond arithmetic operations. Nagarakatte said that he was excited about looking at Spectre mitigations, which he thinks may be provably unnecessary in some places. The group is also planning to look at improving precision, and verifying the correctness of the verifier's path-pruning algorithm, which was the subject of Narayana's talk. The path-pruning logic is "leaving a lot on the table", he said, because the logic is widely dispersed throughout the code, which makes it hard to simplify. There were a few more minutes of clarification about the exact claims that Agni proves, and why newer LLVM versions were problematic, but eventually the session came to a close.

Comments (3 posted)

Verifying the BPF verifier's path-exploration logic

By Daroc Alden
May 27, 2025

LSFMM+BPF

Srinivas Narayana led a remote session about extending Agni to prove the correctness of the BPF verifier's handling of different execution paths as part of the Linux Storage, Filesystem, Memory Management, and BPF Summit. The problem of ensuring the correctness of path exploration is much more difficult than the problem of ensuring the correctness of arithmetic operations (which was the subject of the previous session), however. Narayana's plan to tackle the problem makes use of a mixture of specialized techniques — and may need some assistance from the BPF developers to make it feasible at all.

Path exploration is a key component of the BPF verifier, Narayana said. It's what makes it practical for the verifier to infer precise bounds for registers even in the presence of conditionals and loops. The brute-force approach to path exploration would be to consider every possible path through the program. That means considering a number of paths exponential in the number of conditionals, which would be slow.

Instead, the verifier is somewhat selective about exploring paths: it attempts to explore only paths that are essentially different from other paths that have already been considered. In other words, if the register values along a path are a subset of what has already been checked, the verifier can avoid exploring that path because it knows the BPF program has already been verified under more general preconditions.

This optimization substantially speeds up the verification of programs with complex control flow; it's also quite complicated to implement correctly, and has already resulted in at least one security problem. Narayana wants to use Agni to show that the current path-pruning logic is implemented correctly.

Unlike with arithmetic operators, however, specifying what a correct implementation of path pruning looks like is difficult. The core requirement is that pruned paths must only exhibit a subset of the previously explored safe behaviors of the program, but the path-pruning logic depends on several other parts of the verifier to make that determination. For example, the verifier tracks whether each register is used in a subsequent computation (whether it is "alive") in order to decide whether a register can be relevant to a path. So the correctness of path pruning depends on the soundness of this tracking.

There is a lot of existing academic research on how to make sure tracking the future use of a register is correct; the problem is how to apply that research to the verifier. Narayana's proposal is to use the existing research to produce a set of exhaustive tests covering every possible scenario. Testing is not normally thought of as a formal-verification technique, but exhaustive testing is essentially a direct proof of correctness. The difficulty is in showing that the set of tests is actually exhaustive. A similar approach can be taken for other parts of the verifier that deal with tracking the use of registers.

Narayana listed eight total conditions that must be fulfilled for path pruning to be correct. Four of these are basic assumptions about how the verifier is called and the safety properties of BPF programs that must be manually audited by a human. One is already covered by Agni: the correctness of arithmetic operations. Another is the requirement that dataflow algorithms (such as tracking whether registers are alive) are correct, which he intends to ensure through testing. The final two are specific to path pruning: "state containment" and "sound generalization".

State containment

State containment is the simpler property to explain, but it still benefits from the use of an example. In Narayana's slides, he used an image of a control-flow graph to illustrate his point, but for readers without a background in compiler design, this program may be clearer:

    ...
    int r2;
    int r4;
    if (r1 == 10) {
        r4 = 15;
        r2 = r4;
    } else {
        r2 = rand(0, 20);
    }
    int r3 = r1 + r2;
    ...

Suppose the verifier has been verifying a version of this program that has been compiled to BPF, with the integer variables being stored in the BPF registers with the same names. The verifier will reach the assignment to r3 by two different paths: one where r2 is 15, and one where r2 is some number between 0 and 20. The question of state containment is: is the abstract state of the program in the first case a subset of the abstract state of the program in the second case? It's easy to see that if r2 can be anything between 0 and 20, it can also be 15. In fact, Agni already has a correctness proof for the function that calculates these kinds of comparisons in the verifier as part of its existing scope.

What about r4? In the first state, it is also 15. In the second state, it hasn't been assigned to, and therefore reading from it would be forbidden. Logically, if the code were to read from r4 at some point in the future, then the whole program would be rejected. Therefore, it's valid for the verifier to consider the first state as "contained" in the second state: if the second state eventually leads to the program being verified correctly, then the first state would have done the same, so the first state doesn't actually need to be explored further.

In the general case, when the two states being compared have come to the same point in the program via arbitrarily complex paths, the question of state containment breaks down into the same two parts: whether the possible values of registers in one state are a subset of possible values in the other state, and whether the legal dataflow from registers in one state is more restrictive than the legal dataflow from registers in the other state. Narayana wants to research how to formalize the rules for answering the second part correctly. Trying to write down the rules formally will help us prove it, he said.

Sound generalization

The final piece of the puzzle for proving the path-exploration logic correct is sound generalization. Consider this slightly modified example:

    ...
    int r2;
    if (r1 == 10) {
        r2 = 47;
    } else {
        r2 = rand(0, 20);
    }
    int r3 = 5 + r2;
    ...

In this case, the path where r1 is 10 results in a state that is clearly not a subset of the path where r1 is something else. These two states have different possible values for r3. The verifier, however, will sometimes unify these states anyway. Suppose that from this point in the program onward, it doesn't matter whether r3 is from 5 to 25 or 52. Suppose that the program would be correct as long as r3 is less than the length of an Ethernet frame, which is at least 64 bytes. If that were the case, then the verifier could combine the states even though one is not contained in the other.

In general, this kind of pruning (called generalization) is correct as long as the combined state that the verifier creates (such as "r3 is between 5 and 52") is stronger than the weakest precondition required to ensure that the program from this point onward is still safe. This is easy to check for a single program, Narayana said, but figuring out how to prove it for all programs is somewhat tricky.

His current example is to take an existing algorithm for finding weakest preconditions that has a proof of correctness, and generate a set of exhaustive tests showing that in every case, the preconditions computed by the verifier are at least as strict as the preconditions computed by the formally verified algorithm. In this way, the proof of correctness for a well-known, high-level algorithm can essentially be automatically extended to cover the verifier's implementation.

The idea of testing the verifier in that way raises an obvious question, however: why not simply use an existing algorithm for finding weakest preconditions directly? Narayana looked at the path-pruning code in the 6.12 kernel, and found that it was not generalizing states in all possible cases, resulting in wasted work spent verifying paths that don't need it. If the verifier were changed to compute the weakest precondition in a systematic way, it would be both more efficient and easier to prove correct (by proving that the C implementation of the weakest-precondition-finding code matches the known-correct high-level algorithm).

Going forward

Path exploration is critical to both the correctness of the verifier, and to its performance, Narayana said. It's a challenging problem, with a lot of opportunity for error. Extending Agni to show that the verifier's path exploration is correct is going to require substantial work. While he and his colleagues intend to keep working on it, there are a few things that the BPF developers can do to help. For one, he would like to be involved in the discussion of any new features that might impact the path-pruning logic.

He reiterated Santosh Nagarakatte's call for the BPF verifier to start including structured comments in a domain-specific language (DSL), to make writing proofs about it easier. In response to a question from the audience, he clarified that he does not have a specific DSL in mind, but introducing any higher-level abstraction over C will make it easier to prove that the verifier implements a particular algorithm that corresponds with existing research.

The assembled BPF developers were generally supportive of his work, although they recognized it as an ambitious project. Agni has already helped eliminate bugs in the simplest parts of the verifier; hopefully, Narayana and his colleagues will be able to bring similar guarantees to the parts of the BPF verifier most in need of them.

Comments (none posted)

Allowing BPF programs more access to the network

By Daroc Alden
May 28, 2025

LSFMM+BPF

Mahé Tardy led two sessions about some of the challenges that he, Kornilios Kourtis, and John Fastabend have run into in their work on Tetragon (Apache-licensed BPF-based security monitoring software) at the Linux Storage, Filesystem, Memory Management, and BPF Summit. The session prompted discussion about the feasibility of letting BPF programs send data over the network, as well as potential new kfuncs to let BPF firewalls send TCP reset packets. Tardy presented several possible ways that these could be accomplished.

Sending data

Tetragon has two general jobs: enforcing security policies and collecting statistics and other information for observability. The way that the latter currently works, Tardy explained, introduces unnecessary copies. BPF programs will create records of events and place them into a ring buffer. Then Tetragon's user-space component reads the events and eventually writes them to a file, a pipe, or a network socket in order to centralize and store them.

That requires a minimum of two copies between the kernel and user space. While exploring alternatives, Tardy realized that this situation could be avoided if BPF programs were allowed to call vmsplice(). The user-space agent could give the BPF program a file descriptor, and let BPF call vmsplice() to forward the information. Eventually, it might be possible to remove the user-space agent altogether.

An alternative to vmsplice() would be to use io_uring to perform the same operations. Tardy clarified that for his use case, he really mostly cares about being able to send data over the network. Generally, Tetragon sends two types of data: alerts and periodic reports. The periodic reports are created in a timer callback, which may cause additional complications since he isn't sure whether those are called in a sleepable context.

Andrii Nakryiko thought that a synchronous send operation — which could block for a long time — would be a bad fit for BPF. Tardy agreed, saying that an asynchronous send operation would be fine. Nakryiko thought this was a lot of effort to avoid a small number of copies. Alexei Starovoitov pointed out that there is such a thing as a kernel TCP socket, so this is technically possible. Also, workqueues call their tasks in a sleepable context, so the operation could be run as a workqueue item and that would work. He agreed that it seemed like a lot of effort to avoid user-space copies, though.

Tardy explained that forwarding these reports is "almost the last thing the agent is doing". If it could be done in BPF, Tetragon would be close to being implemented in pure BPF. Although he didn't speak to why this would be desirable, an earlier session had raised the idea of making security software harder to tamper with by avoiding user-space components, so that may have been what he had in mind.

Starovoitov pointed out that there is an ongoing effort to use netconsole to send kernel log messages over TCP. So perhaps Tetragon's BPF programs could be made to print to the console, which is then sent over TCP. Daniel Borkmann asked whether netconsole could send arbitrary data; Starovoitov said that it could. Tardy suggested that they could start by prototyping something using netconsole's existing UDP-based messages. The session ended without coming to a firm conclusion, but Tardy left with a number of new directions to explore.

TCP reset

Currently, it is possible for BPF firewalls to drop packets, and therefore de-facto terminate a TCP connection. It would be friendlier, Tardy said in his second session, to send a TCP reset to immediately terminate the connection. This is already what other firewalls, like netfilter, do; Tardy wants to add a kfunc to let BPF programs do the same thing.

One possible way to add that would be to extend the bpf_sock_destroy() function that Aditi Ghag added in 2023. That function lets BPF programs close sockets in specific circumstances: while inside an iterator and holding the socket lock. The fact that it sends a TCP reset is really a side effect of its main operation, but it is somewhat related.

Borkmann pointed out that using bpf_sock_destroy() would only work if the socket existed on the machine in question; a firewall sitting between a client and a server would need a different way to send a reset. Another member of the audience suggested setting up an unroutable route, forwarding a packet from the TCP connection to that, and letting the existing networking stack handle the rest.

There is already a kernel function that allows BPF programs to send TCP acknowledgment messages; in light of that, adding one for sending reset messages struck some people as not a big deal. Ultimately, this discussion didn't reach a conclusion either, but there was no real opposition to the idea of allowing BPF programs to cleanly terminate TCP connections.

Comments (5 posted)

Reports from OSPM 2025, day two

By Jonathan Corbet
May 23, 2025

OSPM

The seventh edition of the Power Management and Scheduling in the Linux Kernel Summit (known as "OSPM") took place on March 18-20, 2025. Topics discussed on the second day include improvements to device suspend and resume, the status and future of sched_ext, the scx_lavd scheduler, improving the efficiency of load balancing, and hierarchical constant bandwidth server scheduling.

As with the coverage from the first day, each report has been written by the named speaker.

Device suspend/resume improvements

Speaker: Rafael J. Wysocki (video)

Possible improvements to device suspend and resume during system-wide power-management (PM) transitions were discussed. To start with, Wysocki said that this topic was not particularly aligned with the general profile of the conference, which focused on scheduling and related problem spaces, but he thought that spending some time on it might be useful anyway. It would be relatively high-level, though, so that non-experts could follow it.

He provided an introductory part describing the design of the Linux kernel's code that handles transitions to system sleep states and back to the working state, and the concepts behind it.

A system is in the working state, he said, when user-space processes can run. There are also system states, referred to as system sleep states, in which user space is frozen and doesn't do any work; these include system suspend and hibernation. The system enters sleep states to save energy, but when user work needs to be done, it goes back to the working state. Those transitions, referred to as system suspend and resume, respectively, affect the system as a whole and, if the kernel is configured to support system sleep states, every system component needs to play its part in handling them. In other words, support for system suspend and resume (and hibernation, if the kernel is configured to support it) is mandatory.

As a rule, transitions from the working state into one of the sleep states are initiated by user space, but transitions from a sleep state back into the working state are started in response to a signal from a device; this signal is referred to as a system wakeup event. Devices allowed to trigger system wakeup events are referred to as wakeup devices.

When a transition into a system sleep state is started, all devices need to be suspended. All activity must be stopped, hardware needs to go into low-power states, and wakeup devices need to be configured to trigger wakeup events. During a transition back into the working state, the reverse needs to happen, except that, in some cases, it is possible (or even desirable) to leave a device in suspend after a system resume and let it be handled by run-time power management. All of that should be as fast as reasonably possible because some systems, like phones, suspend and resume often.

In the working state, individual components of the system are subject to power management (PM) through frameworks like run-time PM, device-frequency scaling (devfreq), CPU-frequency scaling (cpufreq), CPU idling (cpuidle), energy-aware scheduling (EAS), power capping, and thermal control. Obviously, this needs to be taken into account when the system goes into a sleep state. Some devices may need to be reconfigured, which may require accessing their registers, and they may need to be resumed to satisfy dependencies. On the way back to the working state, care must be taken to maintain consistency with working-state PM.

Dependencies between devices must be taken into account during transitions between the working state and sleep states. Obviously, children depend on their parents, but there are also dependencies between suppliers and consumers, represented in the kernel by device links. Dependent devices cannot be suspended after the devices they depend on and they cannot be resumed before those devices.

Three layers of code are involved in transitions between the working state and sleep states of the system. The PM core is responsible for the high-level flow control, the middle-layer code (bus types, classes, device types, PM domains) takes care of commonalities (to avoid duplication of code, among other things), and device drivers do device-specific handling. As a rule, the PM core invokes the middle-layer code that, in turn, invokes device drivers, but in the absence of the middle-layer code, the PM core can invoke device drivers directly.

There are four phases to both the suspend and resume processes. In the "prepare" phase of suspend, new children are prevented from being added under a given device and some general preparations take place, but hardware settings should not be adjusted at that point. As a general rule, device activity is expected to be stopped in the "suspend" phase; the "late suspend" and "suspend noirq" phases are expected to put hardware into low-power states.

Analogously, the "resume noirq" and "early resume" phases are generally expected to power-up hardware. If necessary, the "resume" phase is expected to restart device activity, and the "complete" phase reverses the actions carried out during the "prepare" phase. However, what exactly happens to a given device during all of those phases depends on the specific combination of the middle-layer code and the device driver handling it.

The "noirq" phases are so-called because interrupt handlers supplied by device drivers are not invoked during these phases. Interrupts are handled during that time in a special way such that interrupts involved in triggering wakeup events will cause the system to go back to the working state (resume). Run-time PM of devices is disabled during the "late suspend" phase and it is re-enabled during the "early resume" phase, so those phases can be referred to as "norpm" (no-run-time-PM) phases.

The handling of devices during transitions between the working state and sleep states of the system is coordinated with device run-time PM to some extent. The PM core freezes the run-time PM workqueue before the "prepare" phase and unfreezes it after the "complete" phase. It also increments the run-time PM usage counter of every device in the "prepare" phase and decrements that counter in the "complete" phase, so devices cannot run-time suspend during system-wide transitions, although they can run-time resume during the "prepare", "suspend", "resume", and "complete" phases.

Moreover, the PM core takes care of disabling and re-enabling run-time PM for every device during the "late suspend" and "early resume" phases, respectively. In turn, the middle-layer code and device drivers are expected to resume devices that cannot stay in run-time suspend during system transitions; they must also prevent devices that are not allowed to wake up the system from doing so.

All of this looks kind of impressive, Wysocki said, but there are issues with it. At this point, he showed a photo of the Leaning Tower of Pisa, to the visible amusement of the audience. Fortunately, he said, the Linux kernel's suspend and resume code is safely far from collapsing.

One of the issues that is currently being tackled is related to asynchronous suspend and resume of devices during system transitions between the working state and sleep states.

Generally speaking, there are devices that can be handled out of order with respect to any other devices so long as all of their known dependencies are met; they are referred to as "async" devices. The other devices, referred to as "sync" devices, must be handled in a specific order that is assumed to cover all of the dependencies, the known ones as well as the unknown ones, if any. Of course, the known dependencies between the async and sync devices, represented through parent-child relationships or by device links, must be taken into account as well.

Each of the suspend and resume phases walks through all of the devices in the system, including both the async and sync devices, and the problem is how to arrange that walk. For instance, the handling of all async devices may be started at the beginning of each phase (this is the way device resume code works in the mainline kernel), but then the threads handling them may need to wait for the known dependencies to be met, and starting all of those threads at the same time may stress the system. The processing of async devices may also be started after handling all of the preceding sync devices (this is the way device suspend code works in the mainline kernel), but, in that case, starting the handling of some async devices earlier may speed up the transition. That will happen if there are async devices without any known dependencies, for example.

There are other possibilities, and the working consensus appears to be that the handling of an async device should be started when some known dependencies are met for it (or it has no known dependencies at all). The question that remains is whether or not to wait until all known dependencies are met for an async device before starting the handling of it.

Regardless of the way the ordering issue is resolved, the handling of the slowest async device tends to take the majority of the time spent in each suspend and resume phase. Consequently, if there are three devices, each of which happens to be the slowest one in a different suspend phase, combining all of the phases into one would reduce the total suspend time. Along these lines of reasoning, reducing the number of suspend and resume phases overall, or moving "slow" device handling to the phases where there is other slow work already, may cause suspend and resume to become faster.

Another area of possible improvement is the integration of system transitions between the working state and sleep states with the run-time PM of devices. This integration is needed because leaving run-time suspended devices in suspend during system transitions may both save energy and reduce the system suspend and resume duration. However, it is not always viable, and drivers need to be prepared for this optimization so, if they want devices to be left in suspend, they need to opt in for that.

Currently, there are three ways to do so:

Participate in the so-called "direct-complete" optimization, causing the handling during a system suspend and resume cycle to be skipped for a device if it is run-time-suspended to start with. Hence the name; all suspend and resume phases except for "prepare" and "complete" are skipped for those devices, so effectively they go directly from the "prepare" to the "complete" phase.
Set the DPM_FLAG_SMART_SUSPEND driver flag.
Use pm_runtime_force_suspend() as a system suspend callback.

Unfortunately, the first option is used rarely, and the other two are not compatible with each other (drivers generally cannot do both of them at the same time). Moreover, some middle-layer code only works with one of them.

Even if the driver opts in to leave the device in suspend, the device may still have to be resumed because of the wakeup configuration. Namely, run-time PM enables wakeup signaling for all devices that support it, so that run-time suspended devices can signal a need to take care of some event coming from the outside of the system. The power-management subsystem wants to be transparent and it doesn't want to miss any signal that may require the user's attention.

On the other hand, only some of the wakeup-capable devices are allowed to wake up the whole system from sleep states, because there are cases in which the system needs to stay in a sleep state until the user specifically wants it to resume (for example, a laptop with a closed lid in a bag). For this reason, if a wakeup-capable device is run-time suspended prior to a system transition into a sleep state, and it is not allowed to wake up the system from sleep, it may need to be resumed and reconfigured during that transition. For some devices, the wakeup setting may be adjusted without resuming them, but that is not a general rule.

Apart from the above, there are dependencies on the platform firmware and on other devices that may require a given device to be resumed during a system transition into a sleep state. Usually, middle-layer code knows about those dependencies and it will act accordingly, but this means that drivers generally cannot decide by themselves what to do with the devices during those transitions and some cooperation between different parts of the code is required.

Leaving devices in suspend during a transition from a sleep state to the working state of the system may also be beneficial, but it is subject to analogous limitations.

Drivers that don't opt in for the direct-complete optimization may need to specifically opt in for leaving devices in suspend during system resume. If they use use pm_runtime_force_suspend() as a suspend callback, they also need to use use pm_runtime_force_resume() as a resume callback; this means that the device will be left in suspend unless it was in use prior to the preceding system suspend (that is, its run-time PM usage counter is nonzero or some of its children have been active at that time). If drivers set DPM_FLAG_SMART_SUSPEND, they also need to set DPM_FLAG_MAY_SKIP_RESUME to allow devices to be left in suspend.

However, if a given device is not allowed to wake up the system from sleep, and it cannot be reconfigured without resuming, leaving it in suspend is not an option. Also, if the platform firmware powers up devices during system resume before passing control to the kernel, it is more useful to resume all of them and leave the subsequent PM handling to run-time PM.

All of this needs to be carefully put in order. Different driver opt-in variants need to be made to work with each other and with all middle-layer code. Clear criteria for resuming run-time suspended devices during system transitions between the working state and sleep states need to be agreed on and documented, and all middle-layer code needs to adhere to them. In particular, device_may_wakeup() needs to be taken into account by all middle-layer code and in the absence of it, by device drivers and the PM core.

In addition to the above, it can be observed that for all devices with run-time PM enabled, run-time PM callbacks should always be suitable for resuming them during transitions from system suspend into the working state unless they are left in suspend. In principle, some significant simplifications of device handling during system resume may result from this observation, but again this will require quite a bit of work.

Sched_ext: current status, future plans, and what's missing

Speakers: Andrea Righi (video) and Joel Fernandes (video)

This talk covered the status of sched_ext: a technology that allows schedulers to be implemented as BPF programs that are loaded at run time. The core functionality of sched_ext is now maintained in the kernel (after the merge that happened in 6.12) and it's following the regular development workflow like any other subsystem.

Individual schedulers, libraries, and tooling are maintained in a separate repository. This structure was intentionally chosen to encourage fast experimentation within each scheduler. While changes still go through a review process, this separation allows a quicker development process. There is also a significant portion of this shared code base that is written in Rust, mostly topology abstractions and architectural properties that are accessible from user space and can be shared with the BPF code using BPF maps.

The community of users and developers keeps growing and the major Linux distributions are almost caught up with the kernel and packages for the main sched_ext schedulers.

An important question, raised by Juri Lelli, centered around the relationship with the kernel's completely fair scheduler (referred to here as "fair.c") and whether it's worthwhile to reuse some of its functionality to avoid code duplication. In fact, sched_ext, being implemented as a new scheduling class, includes its own implementation of a default scheduling policy. BPF-based schedulers can then override this default behavior by implementing specific callbacks. The default implementation in sched_ext could just reuse parts of fair.c where appropriate to minimize code duplication and allow users to build on a base that closely mirrors the kernel's default behavior.

However, reusing fair.c code is challenging due to its deep integration with various parts of the kernel scheduler. Features like energy and capacity awareness (EAS and CAS), which are not completely supported in sched_ext, complicate code reuse; introducing dependencies from sched_ext back into fair.c should be also avoided.

Given these challenges, the consensus for now is to keep sched_ext independent by reimplementing similar functionality within its core. In doing so, the goal is to remain as consistent as possible with fair.c, with the possibility of converging toward a shared code base in the future. This approach also presents an opportunity to revisit and possibly eliminate some legacy heuristics embedded in fair.c, making it a potentially beneficial process for everyone.

Another topic that was discussed is how to prevent starvation of SCHED_EXT tasks when a task running at a higher scheduling class is monopolizing a CPU. The proposed solution is to implement a deadline server, similar to the approach used to prevent starvation of SCHED_NORMAL tasks. This work is currently being handled by Joel Fernandes.

One of the sched_ext key features highlighted in the talk is its exit dump-trace functionality: when a scheduler encounters a critical error, the sched_ext core automatically unloads it, reverting to the default scheduler, and triggering the user-space scheduler program to emit a detailed trace containing diagnostic information. This mechanism also activates if a task is enqueued to a dispatch queue (a sched_ext run queue), but is not scheduled within a certain timeout, making it especially useful for detecting starvation scenarios.

Currently, there's no equivalent mechanism in fair.c to capture such traces. Thomas Gleixner suggested that we could achieve similar insights using tracepoints. Lelli added that, before the deadline server existed, the stalld daemon served a similar purpose: it monitored threads stuck in a run queue for too long without being scheduled, then temporarily boosted them using the SCHED_DEADLINE policy to grant them a small run-time slice. While the deadline server now can handle this in-kernel, stalld could still be used for its monitoring capabilities.

A potential integration with cpuidle was also discussed, Vincent Guittot pointed out that we can just use the cpuidle quality-of-service latency interface from user space, which is probably a reasonable solution, as it just involves some communication between BPF and user-space and there's really no need to add a new specific sched_ext API for that.

The talk also briefly touched the concept of tickless scheduling using sched_ext. A prototype scheduler (scx_tickless) exists; it routes all scheduling events to a designated subset of CPUs, while isolating the remaining CPUs. These isolated CPUs are managed to run a single task at a time with an effectively infinite time slice. If a context switch is needed, it is triggered via a BPF timer and handled by the manager CPUs using an inter-processor interrupt (allowing the scheduler to determine an arbitrary tick frequency, managed by the BPF timer). When combined with the nohz_full boot parameter, this approach enables the running of tasks on isolated CPUs with minimal noise from the kernel, which can be an appealing property for virtualization and high-performance workloads, where even small interruptions can impact performance.

That said, the general consensus from the audience was that the periodic tick typically introduces an overhead that is barely noticeable, so further testing and benchmarking will be necessary to validate the benefits of this approach.

Other upcoming features in sched_ext include the addition of richer topology abstractions within the core sched_ext subsystem and support for loading multiple sched_ext schedulers simultaneously in a hierarchical setup, integrated with cgroups.

What can EEVDF learn from a special-purpose scheduler? The case of scx_lavd

Speaker: Changwoo Min (video)

Min gave a talk on a gaming-focused, sched_ext-based scheduler, scx_lavd (which was also covered here in September 2024). The talk started with a quick overview of the scx_lavd scheduler and its goals. Scx_lavd is a virtual-deadline-based scheduler (like EEVDF) specialized for gaming workloads. This approach was chosen because a virtual deadline is a nice framework to express fairness and latency in a unified manner. Moreover, by sharing a common foundation, there could be opportunities for the two schedulers to share lessons learned and exchange ideas.

The technical goals of scx_lavd are achieving low tail latency (and thus high frame rates in gaming), lower power consumption, and smarter use of heterogeneous processors (like ARM big.LITTLE). He added that if scx_lavd achieves all three, it will be a better desktop scheduler, which is his stretch goal.

He clarified that the main target applications are unmodified Windows games running on the Proton/Wine layer, so it is hard to expect additional latency hints from the application. An audience member asked if Windows provides an interface specifying latency requirements. Min answered that it does, and if a game or a game engine provides the latency hints, such information can be handed down to the scx_lavd through the Proton/Wine layer.

Games are communication-intensive; 10-20 tasks are easily involved in finishing a single job (such as updating the display after a button press), and they communicate through primitives such as futexes, epoll, and NTSync. A scheduling delay among one of the tasks can cause cascading delay and latency (frame time) spikes.

The key question is how to determine which tasks are latency-critical. Min explained that a task in the middle of a task chain is latency-critical, so scx_lavd gives a shorter deadline to such a task, causing it to execute sooner. To decide whether a task is in the middle of a task chain, scx_lavd measures how frequently a task is blocked waiting for an event (blocking frequency) and how often a task wakes up another task (wakeup frequency). High blocking frequency means that the task usually serves as a consumer in a task chain, and high wakeup frequency indicates that the task frequently serves as a producer. Tasks with both high blocking and wakeup frequencies are in the middle of the chain somewhere.

Participants asked about memory consumption (potentially proportional to the square of the number of tasks), the time to reach the steady state, how to decay those frequencies, and the relationship to proxy execution. Min answered that it simply measures the frequencies without distinguishing individual wakers and wakees, so it is pretty cheap. Those frequencies are decayed using the standard exponential weighted moving average (EWMA) technique, converging very quickly (a few hundreds of milliseconds) in practice. Also, compared to proxy execution, which strictly tracks a lock holder and waiters, scx_lavd's approach is much looser in tracking task dependencies.

After explaining how scx_lavd identifies and boosts latency-critical tasks, Min showed a video demo, of a game that achieves high, stable frame rates while running a background job. That led to further discussion about scx_lavd's findings. Peter Zijlstra mentioned that the determination of latency-critical tasks is something that could be considered for the mainline scheduler, but breaking fairness is not.

Min moved on to how scx_lavd reduces power consumption. He is particularly interested in the system being under-utilized (say 20-30% CPU utilization) for running an old, casual game. He explained the idea of core compaction, which limits the number of actively used CPUs according to the system load, allowing inactive CPUs to stay longer in a deeper idle state and saving power. The relevance of EAS was discussed. Also, it was suggested that the core compaction needs to refer to the energy model for more accurate decisions on a broader variety of processors.

Reduce, reuse, recycle: propagating load-balancer statistics up the hierarchy

Speaker: Prateek Nayak Kumbla (video)

With growing core counts, the overhead of newidle balancing (load balancing performed when a CPU is about to enter the idle state) has become a scalability concern on large deployments. The past couple of years saw strategies such as ILB_UTIL and SHARED_RUNQ being proposed in the community to reduce the cost of idle balancing and to make it more efficient. This talk covered a new approach to optimize load balancing by reducing the cycles in its hottest function — update_sd_lb_stats().

The talk started by showing the benefits of newidle balancing by simply bypassing it; that made almost all the workloads tested unhappy. The frequency and the opportunistic nature of newidle balancing ensures that imbalances are checked frequently; as a result, the load is balanced opportunistically before the periodic balancer kicks in.

update_sd_lb_stats(), which is called at the beginning of every load-balancing attempt, iterates over all the groups of scheduling domain, calling update_sg_lb_stats() which, in turn, iterates over all the CPUs of the group and aggregates the load-balancing statistics. When iterating over multiple domains, which is regularly the case with newidle balancing, the statistics computed at a lower domain are never reused and are always computed over again, despite being done successively without any delay between them.

The new approach being proposed enables statistics reuse by propagating statistics aggregated at a lower domain when load balancing at a higher domain. This approach was originally designed to reduce the overheads of busy periodic balancing; Kumbla presented the pitfalls of using it for newidle balancing.

Using the data from perf sched stats with the sched-messaging benchmark as the workload, it was noted that aggressively reusing statistics without any invalidation can lead to newidle balancing converging on the groups that are no longer busy. The data also showed a dramatic reduction in newidle balancing cost, which was promising. Even with a naïve invalidation strategy, the regression in several workloads remained, which prompted further investigation. It was noted that the idle_cpu() check in the scheduler first checked if the current running task is the swapper task. Newidle balancing is done prior to a context switch, and a long time spent there can confuse the wakeup path by making the CPU appear busy. Kumbla noted that perhaps the ttwu_pending bit can be reused to signal all types of wakeups and remove the check for the swapper task from the idle_cpu() function.

Zijlstra noted that perhaps Guittot's push task mechanism can be used to redesign the idle and newidle balancing, and the statistics propagation can help reduce the overheads of busy-load balancing. Guittot mentioned an example implementation that uses a CPU mask to keep track of all the busy CPUs to pull from and idle CPUs to push tasks to. A prototype push approach was posted soon after OSPM as an RFC to flesh out the implementation details.

Zijlstra also noted that, during busy balancing, it is always the first CPU of the group that does the work for the domain, but perhaps that burden can be rotated among all the CPUs of the domain. There were some discussions on load-balancing intervals and how the statistics propagation would require aligning them for better efficiency. Kumbla noted that the prototype already contains a few tricks to align the intervals, but it could be further improved.

Fernandes questioned whether the statistics can be still considered valid if tasks were moved at a lower domain. It was noted that reusing statistics should be safe for busy-load balancing, since only the load or the utilization is migrated, and the aggregates of these statistics will remain the same even if tasks are moved at lower domains.

Julia Lawall asked if there have been any pathological cases where statistics propagation has backfired, to which Kumbla replied that the busy balancing is so infrequent compared to newidle balancing that it is very unlikely a single wrong decision will have any impact. Kumbla also requested for more testing to ensure that there are no loopholes in the logic.

The talk went on to discuss a yet another strategy to optimize newidle balancing that introduced a fast path based on tracing the busiest CPU in the lowest-level cache (LLC) domain and, first, trying to pull the load from this CPU. It was noted that, despite yielding some benefit at lower utilization, the fast path completely fails when there are multiple concurrent newidle balance operations running and the lock contention at the busiest CPU leads to diminishing returns.

The talk finished by discussing SIS_NODE which expanded the search space of wakeup beyond the LLC domain to the entire NUMA node. It was noted that, despite looking promising at lower utilization, SIS_NODE quickly fails at higher utilization where the overhead of the larger search space is evident when it fails to find an idle CPU. A guard like SIS_UTIL is required as a prerequisite to make it viable but its implementation remains a challenge, especially in face of bursty workloads and an ever-growing size of the node domain.

Hierarchical CBS with deadline servers

Speakers: Luca Abeni, Yuri Andriaccio (video)

This talk presented a new implementation of the hierarchical constant bandwidth server (HCBS), an extension of the constant bandwidth server that allows scheduling multiple independent, realtime applications through control groups, providing temporal isolation guarantees. HCBS will allow realtime applications inside control groups to be scheduled using the SCHED_FIFO and SCHED_RR scheduling policies.

In HCBS, control groups are scheduled through SCHED_DEADLINE, using the deadline-server mechanism. Each group is associated with a bandwidth reservation (over a specified period), which is distributed among all CPUs. Whenever a control group is deemed runnable, the scheduler is recursively invoked to pick the realtime task to schedule.

The proposed mechanism can be used for various purposes, such as having multiple independent realtime applications on the same machine, guaranteeing that they cannot interfere with each other, and providing access to realtime scheduling policies inside control groups, enforcing bandwidth reservation and control for those policies.

The proposed scheduler aims at replacing and improving upon the already implemented RT_GROUP_SCHED scheduler, reducing its invasiveness in the scheduler's code and addressing a number of problems:

HCBS uses SCHED_DEADLINE and the deadline-server mechanism to enforce bandwidth allocations, thus removing all the custom code RT_GROUP_SCHED uses. The deferred behavior of the deadline server must not be used in HCBS, which is different from how deadline servers are used to enforce run time for SCHED_OTHER tasks.
HCBS reuses the non-control-group code of the realtime scheduling classes to implement the local scheduler, with a few additional checks, to be as non-invasive as possible.
The use of deadline servers solves the "deferrable server" issue of the RT_GROUP_SCHED scheduler.
HCBS removes RT_GROUP_SCHED's run-time migration mechanism. Instead, it only performs task migration. HCBS migrates tasks from CPUs that have exhausted their run time to others that still have available time. This allows it to fully exploit the allocated bandwidth.
The HCBS scheduler has strong theoretical foundations. If users allocate an appropriate budget (computed by using realtime analysis), then it will be possible to guarantee respect for the application's temporal constraints.
It also performs admission controls to guarantee that it can effectively provide the requested bandwidth.

The current patchset is based on kernel version 6.13, but it is not complete yet. It passes most of the Linux Test Project tests and other custom-tailored stress tests. Tests with rt-app are consistent with realtime theory.

Arbitrary decisions on the implementation were discussed with the OSPM audience:

The HCBS scheduler should only be available for the version-2 control group hierarchy.
The bandwidth enforcement should not affect the root control group, to keep the current implementation of realtime policies.
Tasks should only be allowed to run in leaf groups. Non-leaf control groups are only used to enforce partitioning of CPU time.
Multi-CPU run-time allocation should follow the allowed CPU mask of the control group (cpuset.cpus file); disabled CPUs should not have run time allocated.
The assignment of different run times for a given set of CPUs is currently done through the rt_multi_runtime_us knob, but reusing the standard rt_runtime_us knob has been suggested.
Run-time migration of RT_GROUP_SCHED tasks has been removed to prevent over-commitment or CPU starvation. It has been suggested to look into solutions to perform such migration whenever possible to prevent unnecessary context switches.

As pointed out in the discussion, the scheduling mechanism may have counter-intuitive behaviors when over-committing: suppose a control group is allocated on two CPUs, each with 0.5 bandwidth usage, and two FIFO tasks are run, the first with priority 99 and usage of 0.8, the second with priority 50 and usage of 0.5, for a total usage of 1.3, over-committing the allocated bandwidth of 1.0. If the CPUs activate in parallel, both tasks will activate and will consume all the available bandwidth. The priority-50 task will use its requested bandwidth while the priority-99 task, even though it has higher priority, will consume only 0.5 out of the 0.8 usage. The result may also vary with a different distribution of the bandwidth on the same number of CPUs.

An expected behavior, instead, would be that higher priority tasks must have higher priority on the total CPU bandwidth; in this case, the priority-99 task should always consume its bandwidth. Since these situations arise only when over-committing, thus outside theoretical analysis, they should not pose a problem.

Comments (none posted)

Kernel release status

The 6.15 kernel is out, having been released on May 25. Linus noted:

So this was delayed by a couple of hours because of a last-minute bug report resulting in one new feature being disabled at the eleventh hour, but 6.15 is out there now.

Significant changes in 6.15 include smarter timer-ID assignment to make checkpoint/restore operations more reliable, the ability to read status information from a pidfd after the process in question has been reaped, the PIDFD_SELF special pidfd value, nested ID-mapped mounts, zero-copy network-data reception via io_uring, The ability to read epoll events via io_uring, resilient queued spinlocks for BPF programs, guard-page enhancements allowing them to be placed in file-backed memory areas and for user space to detect their presence, the once-controversial fwctl subsystem, the optional sealing of some system mappings, and much more.

See the LWN merge-window summaries (part 1, part 2) and the in-progress KernelNewbies 6.15 page for more information.

Stable updates: 6.14.8, 6.12.30, 6.6.92, 6.1.140, and 5.15.184 were released on May 22.

The 6.14.9 and 6.12.31 updates are in the review process; they are due on May 29.

Comments (none posted)

Quote of the week

Nova Core is in the infamous position of being the first driver to have been merged with the upstream kernel Linux that is written in Rust and that loads blobs.
We set out to clean it up, and we did, but... we don't speak Rust, so we've broken it in the process. Now, that's not so unconventional, is it? :-)

— "Freedo" releases Linux-libre 6.15-gnu

Comments (none posted)

AlmaLinux OS 10.0 released

Version 10 of the AlmaLinux OS distribution has been released.

The goal of AlmaLinux OS is to support our community, and AlmaLinux OS 10 is the best example of that yet. With an unwavering eye on maintaining compatibility with Red Hat Enterprise Linux (RHEL), we have made small improvements to AlmaLinux OS 10 that target specific sections of our userbase.

See the release notes for details.

Comments (6 posted)

Fedora Council overturns FESCo provenpackager decision

The Fedora Council has ruled on the Fedora Engineering Steering Council's (FESCo) decision last year to revoke Peter Robinson's provenpackager status. In a statement published to the fedora-devel-announce mailing list, the council has announced that it has overturned FESCo's decision:

FESCo didn't have a specific policy for dealing with a request to remove Proven Packager rights. In addition, the FESCo process was handled entirely in private. The contributor didn't receive a formal notification or warning from FESCo, and felt blindsided by the official decision when and how it was announced. The Fedora Council would like to extend our sincerest apology on behalf of the Fedora Project to them.

LWN covered the story in December 2024.

Comments (1 posted)

Launchpad mailing lists going away

Canonical's Launchpad software-collaboration platform that is used for Ubuntu development will be shutting down its hosted mailing lists at the end of October. The announcement recommends Discourse or Launchpad Answers as alternatives. Ubuntu's mailing lists are unaffected by the change.

Comments (8 posted)

NixOS 25.05 released

Version 25.05 of the NixOS distribution has been released. Changes include support for the COSMIC desktop environment (reviewed here in August), GNOME 48, a 6.12 kernel, and many new modules; see the release notes for details. (Thanks to Pavel Roskin).

Comments (none posted)

Home Assistant deprecates the "core" and "supervised" installation modes

Our recent article on Home Assistant observed that the project emphasizes installations using its own Linux distribution or within containers. The project has now made that emphasis rather stronger with this announcement of the deprecation of the "core" and "supervised" installation modes, which allowed Home Assistant to be installed as an ordinary application on a Linux system.

These are advanced installation methods, with only a small percentage of the community opting to use them. If you are using these methods, you can continue to do so (you can even continue to update your system), but in six months time, you will no longer be supported, which I'll explain the impacts of in the next section. References to these installation methods will be removed from our documentation after our next release (2025.6).

Support for 32-bit Arm and x86 architectures has also been deprecated.

Comments (46 posted)

Mozilla is shutting down Pocket

Mozilla has announced that it is shutting down Pocket, a bookmarking service acquired by Mozilla in 2017, this coming July. "Pocket has helped millions save articles and discover stories worth reading. But the way people use the web has evolved, so we're channeling our resources into projects that better match their browsing habits and online needs."

Comments (12 posted)

Development quotes of the week

To link this back to actual Unix history (or something much nearer that), I realized that `bullshit generator' was a reasonable summary of what LLMs do after also realizing that an LLM is pretty much just a much-fancier and better-automated descendant of Mark V Shaney: https://en.wikipedia.org/wiki/Mark_V._Shaney

— Normal Wilson

My name is Rob Pike and I approve this message.

— Rob Pike in reply to Wilson.

Comments (none posted)

DistroWatch Weekly May 26

FreeBSD Quarterly Status Report First quarter

Ubuntu Weekly Newsletter May 24

Emacs News May 26

GCC 14.3.1 Status Report May 23

What's cooking in git.git May 23

What's cooking in git.git May 27

This Week in GNOME May 23

Golang Weekly May 28

Koha Community Newsletter May

Last Week in Kubernetes Development May 21

Last Week in Kubernetes Development May 27

Libre Arts Weekly Recap May 25

LLVM Weekly May 26

OCaml Weekly News May 27

Perl Weekly May 26

This Week in Plasma May 24

PyCoder's Weekly May 27

Weekly Rakudo News May 26

Ruby Weekly News May 22

This Week in Rust May 21

Wikimedia Tech News May 26

Fedora FESCO meeting minutes May 27

Minutes from the libcamera workshop 2025 - Nice, May 16th May 28

openSUSE board meeting minutes April 28

openSUSE Release Engineering minutes May 28

This week in the Perl Steering Committee May 22

This week in the Perl Steering Committee May 28

CFP Deadlines: May 29, 2025 to July 28, 2025

The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.

Deadline	Event Dates	Event	Location
June 11	August 16 August 17	Free and Open Source Software Conference	Sankt Augustin, Germany
June 13	September 30 October 1	All Systems Go! 2025	Berlin, Germany
June 13	October 17 October 19	OpenInfra Summit Europe 2025	Paris-Saclay, France
June 15	July 14 July 20	DebConf 2025	Brest, France
June 15	November 7 November 8	Seattle GNU/Linux Conference	Seattle, US
June 20	August 29 August 31	openSUSE.Asia Summit	Faridabad, India
June 30	November 7 November 8	South Tyrol Free Software Conference	Bolzano, Italy

If the CFP deadline for your event does not appear here, please tell us about it.

Events: May 29, 2025 to July 28, 2025

The following event listing is taken from the LWN.net Calendar.

Date(s)	Event	Location
June 5 June 8	Flock to Fedora 2025	Prague, Czech Republic
June 12 June 14	DevConf.CZ	Brno, Czech Republic
June 13 June 15	SouthEast LinuxFest	Charlotte, NC, US
June 15 June 17	Berlin Buzzwords	Berlin, Germany
June 23 June 25	Open Source Summit North America	Denver, CO, US
June 26 June 28	Linux Audio Conference	Lyon, France
June 26 June 27	Linux Security Summit North America	Denver, CO, US
June 26 June 28	openSUSE Conference	Nuremberg, Germany
July 1 July 3	Pass the SALT Conference	Lille, France
July 14 July 20	DebConf 2025	Brest, France
July 16 July 18	EuroPython	Prague, Czech Republic
July 24 July 29	GUADEC 2025	Brescia, Italy

If your event does not appear here, please tell us about it.

Alert summary May 22, 2025 to May 28, 2025

Dist.	ID	Release	Package	Date
AlmaLinux	ALSA-2025:7395	9	389-ds-base	2025-05-26
AlmaLinux	ALSA-2025:7422	9	ghostscript	2025-05-26
AlmaLinux	ALSA-2025:7893	9	grafana	2025-05-26
AlmaLinux	ALSA-2025:8201	8	gstreamer1-plugins-bad-free	2025-05-27
AlmaLinux	ALSA-2025:8183	9	gstreamer1-plugins-bad-free	2025-05-27
AlmaLinux	ALSA-2025:8056	8	kernel	2025-05-21
AlmaLinux	ALSA-2025:8246	8	kernel	2025-05-28
AlmaLinux	ALSA-2025:7423	9	kernel	2025-05-26
AlmaLinux	ALSA-2025:7903	9	kernel	2025-05-26
AlmaLinux	ALSA-2025:8057	8	kernel-rt	2025-05-21
AlmaLinux	ALSA-2025:8132	8	libsoup	2025-05-26
AlmaLinux	ALSA-2025:8126	9	libsoup	2025-05-26
AlmaLinux	ALSA-2025:7425	9	osbuild-composer	2025-05-26
AlmaLinux	ALSA-2025:8136	9	python-tornado	2025-05-27
AlmaLinux	ALSA-2025:8046	8	webkit2gtk3	2025-05-21
Arch Linux	ASA-202505-14		bind	2025-05-27
Arch Linux	ASA-202505-13		varnish	2025-05-27
Debian	DLA-4181-1	LTS	glibc	2025-05-27
Debian	DSA-5924-1	stable	intel-microcode	2025-05-23
Debian	DLA-4178-1	LTS	kernel	2025-05-25
Debian	DSA-5925-1	stable	kernel	2025-05-24
Debian	DLA-4179-1	LTS	libavif	2025-05-26
Debian	DLA-4177-1	LTS	libphp-adodb	2025-05-24
Debian	DLA-4176-1	LTS	openssl	2025-05-24
Debian	DLA-4180-1	LTS	pgbouncer	2025-05-27
Debian	DLA-4182-1	LTS	syslog-ng	2025-05-28
Fedora	FEDORA-2025-d62bbb5261	F41	dotnet8.0	2025-05-25
Fedora	FEDORA-2025-3f807ca531	F42	dotnet8.0	2025-05-25
Fedora	FEDORA-2025-75bda8d944	F41	dotnet9.0	2025-05-23
Fedora	FEDORA-2025-a54ca28d07	F42	dotnet9.0	2025-05-23
Fedora	FEDORA-2025-86022c9c44	F42	dropbear	2025-05-23
Fedora	FEDORA-2025-d5e2376a90	F41	ghostscript	2025-05-24
Fedora	FEDORA-2025-db5caba0cc	F42	ghostscript	2025-05-23
Fedora	FEDORA-2025-7e1b66f54e	F41	iputils	2025-05-24
Fedora	FEDORA-2025-abf317121e	F42	microcode_ctl	2025-05-28
Fedora	FEDORA-2025-b0f2570b61	F41	mozilla-ublock-origin	2025-05-28
Fedora	FEDORA-2025-01794be9b3	F42	mozilla-ublock-origin	2025-05-22
Fedora	FEDORA-2025-bc02ec32fb	F41	nbdkit	2025-05-26
Fedora	FEDORA-2025-8a2d82f65a	F42	nbdkit	2025-05-23
Fedora	FEDORA-2025-0c2b7a8f32	F41	nodejs20	2025-05-28
Fedora	FEDORA-2025-2936dece0e	F42	nodejs20	2025-05-28
Fedora	FEDORA-2025-61ad6e65b3	F41	nodejs22	2025-05-28
Fedora	FEDORA-2025-f4cee58e97	F42	nodejs22	2025-05-28
Fedora	FEDORA-2025-a6305306dd	F41	open-vm-tools	2025-05-25
Fedora	FEDORA-2025-8896dcbcd0	F41	openssh	2025-05-23
Fedora	FEDORA-2025-e5d435516f	F41	python-watchfiles	2025-05-23
Fedora	FEDORA-2025-e6c12e820e	F42	python-watchfiles	2025-05-23
Fedora	FEDORA-2025-f566d6a4ad	F41	rpm-ostree	2025-05-23
Fedora	FEDORA-2025-6a67917349	F41	sudo-rs	2025-05-22
Fedora	FEDORA-2025-c62d1a4879	F42	sudo-rs	2025-05-22
Fedora	FEDORA-2025-ee55907675	F41	thunderbird	2025-05-24
Fedora	FEDORA-2025-32d6feec91	F42	thunderbird	2025-05-25
Fedora	FEDORA-2025-510a78f439	F41	vyper	2025-05-25
Fedora	FEDORA-2025-4acdb9a1bd	F42	vyper	2025-05-25
Fedora	FEDORA-2025-72469000ed	F41	yelp	2025-05-23
Fedora	FEDORA-2025-72469000ed	F41	yelp-xsl	2025-05-23
Fedora	FEDORA-2025-8365ba2261	F41	zsync	2025-05-23
Fedora	FEDORA-2025-6f6043cb99	F42	zsync	2025-05-23
Mageia	MGASA-2025-0159	9	chromium-browser-stable	2025-05-23
Mageia	MGASA-2025-0165	9	firefox, nss, rootcerts	2025-05-27
Mageia	MGASA-2025-0164	9	glibc	2025-05-25
Mageia	MGASA-2025-0163	9	iputils	2025-05-25
Mageia	MGASA-2025-0160	9	microcode	2025-05-23
Mageia	MGASA-2025-0161	9	nodejs	2025-05-25
Mageia	MGASA-2025-0166	9	open-vm-tools	2025-05-27
Mageia	MGASA-2025-0167	9	sqlite3	2025-05-27
Mageia	MGASA-2025-0168	9	thunderbird	2025-05-27
Mageia	MGASA-2025-0162	9	zsync	2025-05-25
Oracle	ELSA-2025-7589	OL8	.NET 8.0	2025-05-21
Oracle	ELSA-2025-7598	OL9	.NET 8.0	2025-05-23
Oracle	ELSA-2025-7600	OL9	.NET 9.0	2025-05-23
Oracle	ELSA-2025-7395	OL9	389-ds-base	2025-05-23
Oracle	ELSA-2025-7437	OL9	avahi	2025-05-23
Oracle	ELSA-2025-7389	OL9	buildah	2025-05-23
Oracle	ELSA-2025-7895	OL8	compat-openssl10	2025-05-21
Oracle	ELSA-2025-7937	OL9	compat-openssl11	2025-05-23
Oracle	ELSA-2025-7444	OL9	expat	2025-05-23
Oracle	ELSA-2025-8060	OL8	firefox	2025-05-22
Oracle	ELSA-2025-7428	OL9	firefox	2025-05-23
Oracle	ELSA-2025-8049	OL9	firefox	2025-05-23
Oracle	ELSA-2025-7422	OL9	ghostscript	2025-05-23
Oracle	ELSA-2025-7586	OL9	ghostscript	2025-05-23
Oracle	ELSA-2025-7417	OL9	gimp	2025-05-23
Oracle	ELSA-2025-7409	OL9	git	2025-05-23
Oracle	ELSA-2025-7894	OL8	grafana	2025-05-21
Oracle	ELSA-2025-7404	OL9	grafana	2025-05-23
Oracle	ELSA-2025-7893	OL9	grafana	2025-05-23
Oracle	ELSA-2025-8183	OL9	gstreamer1-plugins-bad-free	2025-05-27
Oracle	ELSA-2025-7416	OL9	gvisor-tap-vsock	2025-05-23
Oracle	ELSA-2025-8056	OL8	kernel	2025-05-22
Oracle	ELSA-2025-7423	OL9	kernel	2025-05-27
Oracle	ELSA-2025-7903	OL9	kernel	2025-05-27
Oracle	ELSA-2025-7436	OL9	libsoup	2025-05-23
Oracle	ELSA-2025-8126	OL9	libsoup	2025-05-27
Oracle	ELSA-2025-7410	OL9	libxslt	2025-05-23
Oracle	ELSA-2025-7419	OL9	mod_auth_openidc	2025-05-23
Oracle	ELSA-2025-7402	OL9	nginx	2025-05-23
Oracle	ELSA-2025-7426	OL9	nodejs:20	2025-05-23
Oracle	ELSA-2025-7433	OL9	nodejs:22	2025-05-27
Oracle	ELSA-2025-7967	OL8	osbuild-composer	2025-05-21
Oracle	ELSA-2025-7425	OL9	osbuild-composer	2025-05-23
Oracle	ELSA-2025-7431	OL9	php	2025-05-27
Oracle	ELSA-2025-7432	OL9	php:8.2	2025-05-27
Oracle	ELSA-2025-7418	OL9	php:8.3	2025-05-27
Oracle	ELSA-2025-7391	OL9	podman	2025-05-23
Oracle	ELSA-2025-8136	OL9	python-tornado	2025-05-27
Oracle	ELSA-2025-7438	OL9	redis	2025-05-27
Oracle	ELSA-2025-7686	OL8	redis:6	2025-05-21
Oracle	ELSA-2025-7429	OL9	redis:7	2025-05-27
Oracle	ELSA-2025-7539	OL8	ruby:2.5	2025-05-21
Oracle	ELSA-2025-7397	OL9	skopeo	2025-05-23
Oracle	ELSA-2025-7435	OL9	thunderbird	2025-05-23
Oracle	ELSA-2025-7440	OL9	vim	2025-05-23
Oracle	ELSA-2025-8046	OL8	webkit2gtk3	2025-05-21
Oracle	ELSA-2025-7387	OL9	webkit2gtk3	2025-05-23
Oracle	ELSA-2025-7995	OL9	webkit2gtk3	2025-05-23
Oracle	ELSA-2025-7672	OL9	xdg-utils	2025-05-23
Oracle	ELSA-2025-7427	OL9	xterm	2025-05-23
Oracle	ELSA-2025-7430	OL9	yelp	2025-05-23
Red Hat	RHSA-2025:8184-01	EL10	gstreamer1-plugins-bad-free	2025-05-27
Red Hat	RHSA-2025:8201-01	EL8	gstreamer1-plugins-bad-free	2025-05-27
Red Hat	RHSA-2025:8183-01	EL9	gstreamer1-plugins-bad-free	2025-05-27
Red Hat	RHSA-2025:8137-01	EL10	kernel	2025-05-26
Red Hat	RHSA-2025:8056-01	EL8	kernel	2025-05-26
Red Hat	RHSA-2025:7901-01	EL8.4	kernel	2025-05-26
Red Hat	RHSA-2025:7903-01	EL9	kernel	2025-05-26
Red Hat	RHSA-2025:8142-01	EL9	kernel	2025-05-26
Red Hat	RHSA-2025:7897-01	EL9.0	kernel	2025-05-26
Red Hat	RHSA-2025:8133-01	EL9.2	kernel	2025-05-26
Red Hat	RHSA-2025:8057-01	EL8	kernel-rt	2025-05-26
Red Hat	RHSA-2025:7902-01	EL8.4	kernel-rt	2025-05-26
Red Hat	RHSA-2025:7896-01	EL9.0	kernel-rt	2025-05-26
Red Hat	RHSA-2025:7676-01	EL9.2	kernel-rt	2025-05-26
Red Hat	RHSA-2025:8134-01	EL9.2	kernel-rt	2025-05-26
Red Hat	RHSA-2025:8132-01	EL8	libsoup	2025-05-26
Red Hat	RHSA-2025:8252-01	EL8.8	libsoup	2025-05-28
Red Hat	RHSA-2025:8126-01	EL9	libsoup	2025-05-26
Red Hat	RHSA-2025:8140-01	EL9.2	libsoup	2025-05-26
Red Hat	RHSA-2025:8139-01	EL9.4	libsoup	2025-05-26
Red Hat	RHSA-2025:8128-01	EL10	libsoup3	2025-05-26
Red Hat	RHSA-2025:8195-01	EL8.8	mingw-freetype and spice-client-win	2025-05-27
Red Hat	RHSA-2025:7967-01	EL8	osbuild-composer	2025-05-23
Red Hat	RHSA-2025:8075-01	EL8.8	osbuild-composer	2025-05-23
Red Hat	RHSA-2025:8254-01	EL8	pcs	2025-05-28
Red Hat	RHSA-2025:8256-01	EL9	pcs	2025-05-28
Red Hat	RHSA-2025:8135-01	EL10	python-tornado	2025-05-26
Red Hat	RHSA-2025:8136-01	EL9	python-tornado	2025-05-26
Red Hat	RHSA-2025:8226-01	EL9.2	python-tornado	2025-05-28
Red Hat	RHSA-2025:8223-01	EL9.4	python-tornado	2025-05-28
Red Hat	RHSA-2025:8131-01	EL10	ruby	2025-05-26
Red Hat	RHSA-2025:8046-01	EL8	webkit2gtk3	2025-05-27
Red Hat	RHSA-2025:7995-01	EL9	webkit2gtk3	2025-05-27
Slackware	SSA:2025-140-01		aaa_glibc	2025-05-20
Slackware	SSA:2025-143-01		ffmpeg	2025-05-24
Slackware	SSA:2025-140-02		mozilla	2025-05-20
Slackware	SSA:2025-147-01		mozilla	2025-05-27
SUSE	openSUSE-SU-2025:15150-1	TW	audiofile	2025-05-24
SUSE	openSUSE-SU-2025:15156-1	TW	bind	2025-05-27
SUSE	openSUSE-SU-2025:15143-1	TW	chromedriver	2025-05-22
SUSE	openSUSE-SU-2025:15132-1	TW	dante	2025-05-21
SUSE	openSUSE-SU-2025:15157-1	TW	dnsdist	2025-05-27
SUSE	SUSE-SU-2025:20328-1		elemental-operator	2025-05-28
SUSE	SUSE-SU-2025:01710-1	SLE12	firefox	2025-05-26
SUSE	SUSE-SU-2025:01701-1	SLE15 SES7.1 oS15.6	firefox	2025-05-26
SUSE	openSUSE-SU-2025:15133-1	TW	firefox-esr	2025-05-21
SUSE	SUSE-SU-2025:01702-1	SLE15 oS15.6	glibc	2025-05-26
SUSE	openSUSE-SU-2025:15134-1	TW	gnuplot	2025-05-21
SUSE	SUSE-SU-2025:01653-1	SLE15 oS15.6	govulncheck-vulndb	2025-05-22
SUSE	SUSE-SU-2025:01713-1	SLE15 oS15.6	govulncheck-vulndb	2025-05-27
SUSE	openSUSE-SU-2025:15135-1	TW	govulncheck-vulndb	2025-05-21
SUSE	openSUSE-SU-2025:15144-1	TW	govulncheck-vulndb	2025-05-23
SUSE	openSUSE-SU-2025:15159-1	TW	govulncheck-vulndb	2025-05-27
SUSE	openSUSE-SU-2025:15145-1	TW	grafana	2025-05-23
SUSE	openSUSE-SU-2025:15136-1	TW	grype	2025-05-21
SUSE	SUSE-SU-2025:01718-1	SLE15 SES7.1 oS15.3	gstreamer-plugins-bad	2025-05-28
SUSE	SUSE-SU-2025:01717-1	SLE15 oS15.5	gstreamer-plugins-bad	2025-05-28
SUSE	openSUSE-SU-2025:15160-1	TW	jetty-annotations	2025-05-27
SUSE	openSUSE-SU-2025:15161-1	TW	jq	2025-05-27
SUSE	SUSE-SU-2025:01707-1	SLE15 oS15.6	kernel	2025-05-26
SUSE	openSUSE-SU-2025:15146-1	TW	kind	2025-05-23
SUSE	openSUSE-SU-2025:15147-1	TW	kubo	2025-05-23
SUSE	openSUSE-SU-2025:15151-1	TW	libecpg6	2025-05-24
SUSE	openSUSE-SU-2025:15165-1	TW	libnss_slurm2	2025-05-27
SUSE	openSUSE-SU-2025:15167-1	TW	libyelp0	2025-05-27
SUSE	SUSE-SU-2025:01716-1	SLE15 oS15.6	mariadb	2025-05-28
SUSE	SUSE-SU-2025:20327-1		nvidia-open-driver-G06-signed	2025-05-28
SUSE	SUSE-SU-2025:20319-1		nvidia-open-driver-G06-signed	2025-05-28
SUSE	SUSE-SU-2025:01658-1	SLE-m5.1 SLE-m5.2 SLE-m5.3 SLE-m5.4 SLE-m5.5 oS15.3	open-vm-tools	2025-05-22
SUSE	SUSE-SU-2025:01705-1	SLE15 SES7.1	postgresql13	2025-05-26
SUSE	openSUSE-SU-2025:15137-1	TW	postgresql13	2025-05-21
SUSE	SUSE-SU-2025:01654-1	oS15.6	postgresql13	2025-05-22
SUSE	SUSE-SU-2025:01661-2	SLE15	postgresql14	2025-05-26
SUSE	SUSE-SU-2025:01661-1	SLE15 oS15.6	postgresql14	2025-05-22
SUSE	openSUSE-SU-2025:15138-1	TW	postgresql14	2025-05-21
SUSE	openSUSE-SU-2025:15139-1	TW	postgresql15	2025-05-21
SUSE	openSUSE-SU-2025:15140-1	TW	postgresql16	2025-05-21
SUSE	SUSE-SU-2025:01644-1	SLE15 oS15.6	postgresql17	2025-05-21
SUSE	openSUSE-SU-2025:15162-1	TW	prometheus-blackbox_exporter	2025-05-27
SUSE	SUSE-SU-2025:01523-1	SLE15	python-Django	2025-05-26
SUSE	SUSE-SU-2025:01662-1	SLE15 oS15.6	python-cryptography	2025-05-22
SUSE	SUSE-SU-2025:20330-1		python-h11, python-httpcore	2025-05-28
SUSE	SUSE-SU-2025:01704-1	MP4.3 SLE15 oS15.4 oS15.6	python-setuptools	2025-05-26
SUSE	SUSE-SU-2025:01695-1	SLE12	python-setuptools	2025-05-23
SUSE	SUSE-SU-2025:01715-1	SLE15 SLE-m5.1 SLE-m5.2 SES7.1	python-setuptools	2025-05-28
SUSE	SUSE-SU-2025:01649-2	SLE15	python-tornado6	2025-05-23
SUSE	SUSE-SU-2025:01649-1	SLE15 oS15.4 oS15.6	python-tornado6	2025-05-22
SUSE	SUSE-SU-2025:01709-1	SLE15 oS15.4 oS15.6	python310-setuptools	2025-05-26
SUSE	openSUSE-SU-2025:15152-1	TW	python311-Flask	2025-05-24
SUSE	openSUSE-SU-2025:15153-1	TW	python311-tornado6	2025-05-24
SUSE	openSUSE-SU-2025:15163-1	TW	python312	2025-05-27
SUSE	openSUSE-SU-2025:15154-1	TW	python313	2025-05-24
SUSE	openSUSE-SU-2025:15141-1	TW	python314	2025-05-21
SUSE	SUSE-SU-2025:01693-1	SLE12	python36-setuptools	2025-05-23
SUSE	SUSE-SU-2025:01723-1	SLE15 SES7.1 oS15.3 oS15.6	python39-setuptools	2025-05-28
SUSE	openSUSE-SU-2025:15164-1	TW	screen	2025-05-27
SUSE	SUSE-SU-2025:20323-1		sqlite3	2025-05-28
SUSE	SUSE-SU-2025:01660-1	SLE15 oS15.6	thunderbird	2025-05-22
SUSE	openSUSE-SU-2025:15131-1	TW	thunderbird	2025-05-21
SUSE	openSUSE-SU-2025:15149-1	TW	thunderbird	2025-05-24
SUSE	openSUSE-SU-2025:15155-1	TW	transfig	2025-05-24
SUSE	SUSE-SU-2025:01651-1	MP4.3 SLE15 SLE-m5.1 SLE-m5.2 SLE-m5.3 SLE-m5.4 SLE-m5.5 SES7.1 oS15.6	ucode-intel	2025-05-22
SUSE	SUSE-SU-2025:01650-1	SLE12	ucode-intel	2025-05-22
SUSE	openSUSE-SU-2025:15166-1	TW	umoci	2025-05-27
SUSE	SUSE-SU-2025:01724-1	MP4.3 SLE15 oS15.4	webkit2gtk3	2025-05-28
SUSE	SUSE-SU-2025:01720-1	SLE12	webkit2gtk3	2025-05-28
SUSE	SUSE-SU-2025:01703-1	SLE15 oS15.6	xen	2025-05-26
SUSE	openSUSE-SU-2025:15142-1	TW	xen	2025-05-21
Ubuntu	USN-7525-1	18.04 20.04 22.04 24.04	Tomcat	2025-05-21
Ubuntu	USN-7525-2	24.04 24.10 25.04	Tomcat	2025-05-27
Ubuntu	USN-7526-1	24.10 25.04	bind9	2025-05-21
Ubuntu	USN-7536-1	20.04 22.04 24.04 24.10	cifs-utils	2025-05-27
Ubuntu	USN-7534-1	25.04	flask	2025-05-26
Ubuntu	USN-7532-1	20.04 22.04 24.04 24.10 25.04	glib2.0	2025-05-26
Ubuntu	USN-7541-1	18.04 20.04 22.04	glibc	2025-05-28
Ubuntu	USN-7535-1	16.04 18.04 20.04 22.04 24.04 24.10 25.04	intel-microcode	2025-05-27
Ubuntu	USN-7527-1	16.04 18.04 20.04	libfcgi-perl	2025-05-22
Ubuntu	USN-7510-7	20.04 22.04	linux-aws, linux-intel-iotg-5.15, linux-nvidia-tegra-igx, linux-raspi	2025-05-28
Ubuntu	USN-7521-2	24.10	linux-aws	2025-05-22
Ubuntu	USN-7510-6	22.04	linux-aws-fips	2025-05-27
Ubuntu	USN-7517-3	20.04	linux-bluefield	2025-05-26
Ubuntu	USN-7516-5	18.04	linux-hwe-5.4	2025-05-23
Ubuntu	USN-7513-4	22.04	linux-hwe-6.8	2025-05-28
Ubuntu	USN-7516-6	20.04	linux-ibm	2025-05-26
Ubuntu	USN-7517-2	18.04	linux-ibm-5.4	2025-05-21
Ubuntu	USN-7521-3	24.04 24.10	linux-lowlatency, linux-lowlatency-hwe-6.11, linux-oracle	2025-05-28
Ubuntu	USN-7516-4	18.04	linux-oracle-5.4	2025-05-21
Ubuntu	USN-7539-1	20.04	linux-raspi	2025-05-28
Ubuntu	USN-7524-1	24.04	linux-raspi	2025-05-26
Ubuntu	USN-7540-1	18.04	linux-raspi-5.4	2025-05-28
Ubuntu	USN-7537-1	20.04 22.04 24.04 24.10 25.04	net-tools	2025-05-27
Ubuntu	USN-7533-1	24.10 25.04	openjdk-17-crac	2025-05-26
Ubuntu	USN-7531-1	24.10 25.04	openjdk-21-crac	2025-05-26
Ubuntu	USN-7520-2	25.04	postgresql-17	2025-05-21
Ubuntu	USN-7280-2	14.04 16.04 18.04 20.04 22.04 24.10	python	2025-05-22
Ubuntu	USN-7528-1	20.04 22.04 24.04 24.10 25.04	sqlite3	2025-05-22
Ubuntu	USN-7529-1	20.04 22.04	tika	2025-05-26

Full Story (comments: none)

Linus Torvalds Linux 6.15 May 25

Freedo GNU Linux-libre 6.15-gnu May 26

Greg Kroah-Hartman Linux 6.14.8 May 22

Greg Kroah-Hartman Linux 6.12.30 May 22

Greg Kroah-Hartman Linux 6.6.92 May 22

Greg Kroah-Hartman Linux 6.1.140 May 22

Greg Kroah-Hartman Linux 5.15.184 May 22

Alexis Lothoré bpf, arm64: support up to 12 arguments May 27

Harald Freudenberger New s390 specific protected key hmac May 22

Atish Patra Add SBI v3.0 PMU enhancements May 22

Deepak Gupta riscv control-flow integrity for usermode May 22

Clément Léger riscv: add SBI FWFT misaligned exception delegation support May 23

Tony Luck x86/resctrl telemetry monitoring May 21

Elena Reshetova Enable automatic SVN updates for SGX enclaves May 22

Chao Gao Introduce CET supervisor state support May 22

Babu Moger x86/resctrl: Support L3 Smart Data Cache Injection Allocation Enforcement (SDCIAE) May 22

Sean Christopherson x86, KVM: Optimize SEV cache flushing May 22

Chao Gao TD-Preserving updates May 23

Kees Cook stackleak: Support Clang stack depth tracking May 22

Shrikanth Hegde sched: cpu parked and push current task mechanism May 23

Kumar Kartikeya Dwivedi BPF Standard Streams May 23

Eduard Zingerman bpf: propagate read/precision marks over state graph backedges May 24

Leon Hwang bpf: Introduce global percpu data May 27

Alice Ryhl uaccess: rust: use newtype for user pointers May 27

Lyude Paul Refcounted interrupts, SpinLockIrq for rust May 27

Menglong Dong bpf: tracing multi-link support May 28

Ingo Molnar sched: Use the SMP scheduler on UP too May 28

Alessandro Carminati kunit: Add support for suppressing warning backtraces May 26

Pavel Begunkov io_uring/mock: add basic infra for test mock files May 26

Pavel Begunkov mock file tests May 26

Tomas Glozar rtla/timerlat: Implement flexible actions on latency threshold overflow May 28

André Apitzsch via B4 Relay media: i2c: imx214: Add support for more clock frequencies May 21

Derek J. Clark platform/x86: Add Lenovo WMI Gaming Series Drivers May 21

George Cherian soc: marvell: Add a general purpose RVU physical May 22

Lukasz Majewski net: mtip: Add support for MTIP imx287 L2 switch driver May 22

Elaine Zhang rockchip: add can for RK3576 Soc May 22

Yuanfang Zhang coresight: Add Coresight Trace Network On Chip driver May 22

Baochen Qiang wifi: ath12k: handle link select and inactivate May 22

Zhang Yi ASoC: codecs: add support for ES8375 May 22

Mathieu Dubois-Briand Add support for MAX7360 May 22

Angelo Dureghello iio: adc: add ad7606 calibration support May 22

Pradeep Kumar Chitrapu wifi: ath12k: add MU-MIMO and 160 MHz bandwidth support May 21

Mayank Rana Add Qualcomm SA8255p based firmware managed PCIe root complex May 21

Hugo Villeneuve drm: rcar-du: rzg2l_mipi_dsi: add MIPI DSI command support May 22

James Clark spi: spi-fsl-dspi: DSPI support for NXP S32G platforms May 22

Arkadiusz Kubalewski dpll: add Reference SYNC feature May 22

Alexey Klimov qrb4210-rb2: add wsa audio playback and capture support May 22

Frank Li media: imx8: add camera support May 22

John Madieu thermal: renesas: Add support for RZ/G3E May 22

Tariq Toukan net/mlx5e: Add support for devmem and io_uring TCP zero-copy May 23

Bjorn Helgaas Rate limit AER logs May 22

Rajnesh Kanwal riscv: pmu: Add support for Control Transfer Records Ext. May 23

Harshitha Ramamurthy gve: Add Rx HW timestamping support May 22

Fenglin Wu via B4 Relay power: supply: Add several features support in qcom-battmgr driver May 23

Xuyang Dong Add driver support for ESWIN eic700 SoC clock controller May 23

Cathy Xu pinctrl: mediatek: Add pinctrl driver on mt8189 May 23

Stefan Klug media: rkisp1: Add RKISP1_CID_SUPPORTED_PARAMS_BLOCKS ctrl and WDR support May 23

Yao Zi Add clock support for Loongson 2K0300 SoC May 23

Clément Le Goffic Introduce HDP support for STM32MP platforms May 23

Thomas Antoine via B4 Relay Google Pixel 6 (oriole): max77759 fuel gauge enablement and driver support May 23

Ming Lei ublk: add UBLK_F_QUIESCE May 23

David Arinzon PHC support in ENA driver May 22

Ulf Hansson pmdomain: Add generic ->sync_state() support to genpd May 23

Yassine Ouaissa media: Add Gen 3 IP stateful decoder driver May 23

Arkadiusz Kubalewski dpll: add all inputs phase offset monitor May 23

Michael Kelley fbdev: Add deferred I/O support for contiguous kernel memory framebuffers May 23

Christian Marangi thermal/drivers: airoha: Add support for AN7583 May 23

Lothar Rubusch iio: accel: adxl313: add power-save on activity/inactivity May 23

Kory Maincent Add support for PSE budget evaluation strategy May 24

Olivia Wen Add MediaTek ISP7 Image Syatem driver May 24

Anup Patel Linux SBI MPXY and RPMI drivers May 25

Tatyana Nikolova Add RDMA support for Intel IPU E2000 in idpf May 23

Zixian Zeng spi: sophgo: Add SPI NOR controller for SG2042 May 25

Aradhya Bhatia drm/tidss: Add OLDI bridge support May 25

George Moussalem via B4 Relay Add support for the IPQ5018 Internal GE PHY May 25

Vincent Knecht via B4 Relay CAMSS support for MSM8939 May 25

Mahesh Rao via B4 Relay stratix10: Add framework for asynchronous communication with SDM May 26

Matthias Fend media: add Himax HM1246 image sensor May 26

Jun Nie drm/msm/dpu: Support quad pipe with dual-interface May 26

Svyatoslav Ryhel drm: bridge: add ssd2825 RGB/DSI bridge support May 26

Cristian Ciocaltea HID: playstation: Add support for audio jack handling on DualSense May 26

Cristian Ciocaltea ALSA: usb-audio: Support jack detection on Sony DualSense PS5 May 26

Maciej Andrzejewski Driver for Xilinx ZynqMP AXI Timeout Block (ATB) May 25

Ze Huang Add USB2.0 PHY and USB3.0 PHY support for SpacemiT K1 May 26

Ze Huang Add SpacemiT K1 USB3.0 host controller support May 26

Michał Winiarski PCI: VF resizable BAR May 26

Xianwei Zhao via B4 Relay Add support for Amlogic S7/S7D/S6 pinctrl May 27

Amirreza Zarrabi Trusted Execution Environment (TEE) driver for Qualcomm TEE (QTEE) May 26

Claudiu clk: renesas: rzg2l-cpg: Drop PM domain abstraction for MSTOP May 27

Cristian Ciocaltea drm/connector: hdmi: Allow using the YUV420 output format May 27

Bui Quang Minh virtio-net: support zerocopy multi buffer XDP in mergeable May 27

Aditya Garg HID: multitouch: Add support for Touch Bars on x86 MacBook Pros May 27

Konrad Dybcio arm64: qcom: allow up to 4 lanes for the Type-C DisplayPort Altmode May 27

Melody Olvera phy: qcom: Introduce USB support for SM8750 May 27

Christian Marangi pinctrl: Add Airoha AN7583 support May 28

Samuel Kayode via B4 Relay add support for pf1550 PMIC MFD-based drivers May 27

Shangjuan Wei Add driver support for Eswin eic7700 SoC ethernet controller May 28

Dharma Balasubiramani TCB: Add DMA support to read the capture register AB May 28

Irui Wang Add support for MT8196 video encoder May 28

Qunqin Zhao Add Loongson Security Engine chip driver May 28

Longbin Li riscv: pwm: sophgo: add pwm support for SG2044 May 28

Bingbu Cao Intel IPU7 PCI and input system device drivers May 28

Ryan Walklin drm: sun4i: add Display Engine 3.3 (DE33) support May 28

Christian Marangi clk: add support for Airoha AN7583 clock May 28

Alexander Usyskin mtd: add driver for Intel discrete graphics May 28

Lyude Paul Rust abstractions for shmem-backed GEM objects May 21

T.J. Mercier Replace CONFIG_DMABUF_SYSFS_STATS with BPF May 22

Sean Anderson Add PCS core support May 23

Pierre-Eric Pelloux-Prayer Improve gpu_scheduler trace events + UAPI May 26

Jason Wang virtio_ring in order support May 28

Guan-Yu Lin Support system sleep with offloaded usb transfers May 28

Shashank Balaji cgroup, docs: cpu controller interaction with various scheduling policies May 22

Sebastian Andrzej Siewior 0/4] Add documentation for PR_FUTEX_HASH May 26

Shashank Balaji sched_deadline, docs: update rt-app examples, add cgroup v2 cpuset HOWTO May 27

Bo Liu erofs: support deflate decompress by using Intel QAT May 22

Darrick J. Wong ] fuse: use fs-iomap for better performance so we can containerize ext4 May 21

Keith Busch block: another block copy offload May 21

Baokun Li ext4: better scalability for ext4 block allocation May 23

Yu Kuai md/llbitmap: md/md-llbitmap: introduce a new lockless bitmap May 24

Uday Shankar ublk: decouple server threads from ublk_queues/hctxs May 27

Pankaj Raghav add THP_HUGE_ZERO_PAGE_ALWAYS config option May 22

Byungchul Park Split netmem from struct page May 23

Zi Yan Make MIGRATE_ISOLATE a standalone bit May 23

SeongJae Park mm/damon: introduce DAMON_STAT for simple and practical access monitoring May 26

Pankaj Raghav add STATIC_PMD_ZERO_PAGE config option May 27

Dev Jain Optimize mremap() for large folios May 27

Tonghao Zhang add broadcast_neighbor for no-stacking networking arch May 22

Phil Sutter Dynamic hook interface binding part 2 May 21

Roopni Devanathan wifi: cfg80211/mac80211: Set/get wiphy parameters on per-radio basis May 22

Sarika Sharma wifi: cfg80211/mac80211: add support to handle per link statistics of multi-link station May 22

Linus Lüssing net: bridge: propagate safe mcast snooping to switchdev + DSA May 22

Chia-Yu Chang DUALPI2 patch May 25

John Ousterhout Begin upstreaming Homa transport protocol May 25

Tingmao Wang landlock: Use hashtable for merged domains May 21

Jari Ruusu Announce loop-AES-v3.8f file/swap crypto package May 27

Maxim Levitsky KVM: x86: allow DEBUGCTL.DEBUGCTLMSR_FREEZE_IN_SMM passthrough May 21

Sean Christopherson KVM: Make irqfd registration globally unique May 22

Sean Christopherson KVM: x86: Dynamically allocate hashed page list May 22

Sean Christopherson KVM: iommu: Overhaul device posted IRQs support May 22

Ankit Agrawal KVM: arm64: Map GPU device memory as cacheable May 23

Fuad Tabba KVM: Mapping guest_memfd backed memory at the host for software protected VMs May 27

Burak Emir rust: adds Bitmap API, ID pool and bindings May 26

Alan Maguire Add kind layout to BTF May 28

LWN.net Weekly Edition for May 29, 2025

Hosting

CTI

Isolation

Security check list

Platform decay

Twiddling

Paying and products

Four constraints

Reversing course

Zero-trust architecture

System-wide encrypted DNS

Technology choices

Integration

Encrypted DNS in identity management

Upstream and downstream availability

Agni's history

Verifying arithmetic

Complications

Going faster

Future work

State containment

Sound generalization

Going forward

Sending data

TCP reset

Device suspend/resume improvements

Sched_ext: current status, future plans, and what's missing

What can EEVDF learn from a special-purpose scheduler? The case of scx_lavd

Reduce, reuse, recycle: propagating load-balancer statistics up the hierarchy

Hierarchical CBS with deadline servers

Brief items

Kernel development

Distributions

Development

Announcements

Newsletters

Distributions and system administration

Development

Meeting minutes

Calls for Presentations

CFP Deadlines: May 29, 2025 to July 28, 2025

Upcoming Events

Events: May 29, 2025 to July 28, 2025

Security updates

Kernel patches of interest

Kernel releases

Architecture-specific

Build system

Core kernel

Development tools

Device drivers

Device-driver infrastructure

Documentation

Filesystems and block layer

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous