[rfc] LLM policy?
We've seen a slight uptick in pull-requests and bug reports which appear to be LLM-generated, so it's probably about time to come to a decision on what we should and should not accept and document this somewhere (presumably in CONTRIBUTING.md).
My personal opinion is we shouldn't accept anything LLM-generated, but this is probably not the common position of most @opencontainers/runc-maintainers, so we should probably consider LLM-generated code and issues separately.
IMHO, we should close all LLM-generated issues as spam, because even if they are describing real issues the entire issue description contains so much unneeded (and probably incorrect) information that it'd be better if they just provided their LLM prompt as an issue instead. More importantly, when triaging bugs we have to assume that what the user has written did actually happen to them, but with LLM-generated issues -- who knows whether the description is actually describing something real? (See https://github.com/opencontainers/runc/issues/4982 and https://github.com/opencontainers/runc/issues/4972 as possible examples of LLM-generated bug reports.)
For LLM-generated code, I think the minimum bar should be that the submitter needs to be able to respond to review requests in their own words (i.e., they understand what their patch does and was able to write the code themselves). (https://github.com/opencontainers/runc/pull/4940 and https://github.com/opencontainers/runc/pull/4939 was the most recent example of this I can think of, and I'm not convinced the submitter would've cleared this bar.)
(FWIW, my view is that LLM-generated code cannot fulfil the requirements of the DCO -- not to mention the copyright situation of such code is still very much unclear -- and so we shouldn't accept it for the legal reasons alone, but I appreciate this is a minority view.)
For reference, Incus added add a note to their CONTRIBUTING.md earlier this year, banning all LLM usage.
Using LLM should be totally fine for:
- translating non-English to English
- finding and fixing grammatical errors and typos
- tab-completing trivial texts and codes (e.g.
if err != nil { return nil, err })
I completely agree on the first two, though I suspect most such cases wouldn't scream "LLM" when you read them anyway (well, aside from being a little too stilted and formulaic).
For the third I suspect some heavy LLM users tend to consider the difference between the third case (fancy tab-completion) and using something like Claude to implement a design they've written (a step below "vibe-coding") to be a matter of degree, not completely different activities. I wonder if there is a simple line we can draw to make expectations clear...
I'd like to point out the PR and issues you're having trouble with are "bad issues" rather than LLM issues.
If someone uses an LLM to generate code and comments that are good they will go un-noticed and you'll probably think it was a human.
Certainly can understand there is an inflation of LLM slop getting sent in, especially false issues. But probably easier to choose what content to allow based on quality rather than source.
Is there proof that the ratio of good LLM PRs to bad LLM PRs is worse than human PRs?
Is there proof that the ratio of good LLM PRs to bad LLM PRs is worse than human PRs?
the ratio doesn't yet matter because it's a lot easier to submit an LLM generated issue, so it's a question of volume.
Maybe in some future date when neural nets are more capable than what LLMs can achieve this will be relevant. But at that point, the OP issue will be irrelevant because nobody will be able to tell what is machine generated anyway.
I was wondering why people with no relationship with runc were commenting -- this issue is the top story on Hacker News. 😅
To folks coming from HN: Let's try to avoid spam so I don't have to lock the issue.
I think this part of Incus's LLM policy is particularly germane:
We expect all contributors to be able to reason about the code that they
contribute and explain why they're taking a particular approach.
LLMs and similar predictive tools have the annoying tendency of
producing large amount of low quality code with subtle issues which end
up taking the maintainers more time to debug than it would have taken to
write the code by hand in the first place.
The problem with LLM issues and PRs is that LLMs allow people to produce reasonable-looking spam which takes more time to review, only for you to discover that the person submitting the code doesn't know what the code does or if the description in the issue is actually true. If you get them to fix one issue, they (submitter or LLM) won't remember this preference for the rest of their patch (or future patches).
Yeah a "good guy with an LLM" can almost certainly produce reasonable-looking code, but LLMs don't require an "LLM license" for you to show that you have some level of proficiency to use it. Also, a lot of these arguments seem like a no-true-Scotsman to me -- yes, in theory someone could produce LLM code that is perfectly indistinguishable from code a human wrote, and they could test it extensively before submitting the patch, and they could study the code so that they understood it as well as someone who had written it. But how many actually do that?
For what it's worth, if every LLM-generated PR and issue was tagged as such, I might have a different outlook -- but my general experience is that I only figure out a PR is LLM-generated after I've done an in-depth review like I would with a human PR. For one, it feels deceptive, but it also feels somewhat disrespectful -- as though I've wasted my time reviewing something in depth when the submitter probably won't bother to even read my comments.
For folks arguing that we should just treat "bad LLM PRs" as if they are just "bad code PRs" -- I tend to mentor the author when dealing with a "bad code" PR so their patch can get merged and they may be able to contribute better patches in the future. This approach makes no sense with LLMs and would honestly just be a waste of time. I don't like closing PRs made by humans without a long discussion first, because someone spent their (human!) time to try to improve our project -- even if the patch isn't good, I would like to understand what they wanted to do and whether there is an alternative solution for them. Again, this approach makes no sense with LLMs.
We could develop a censorship tool with federated communication (so that each user could maintain their own blacklists or subscribe to existing ones). It could be a browser extension or a direct GitHub feature.
I used to be in favour of AI generated code, but it has become evident that it can quite easily produce code with a semblance of quality, but with rather poor actual quality once it is inspected. I thus believe that in some of the best of cases, AI generated PRs waste the reviewer's time, while in others it can potentially introduce dangerous mistakes that slip through code review.
That being said, I believe AI can improve productivity if used primarily as an enhanced search engine.
I am using LLMs (mostly Claude Code) occasionally, with mixed results. To me it's OK to use LLM
- for code analysys;
- for initial quick POC code snippets to play with;
- trivial and mundane stuff like unit tests (not always);
- if LLM attribution is clearly stated.
I am also using PR reviews from LLMs, again with mixed results. In a few cases LLMs spotted real issues, yet most comments are not helpful and need to be discarded (it helps a lot that you don't have to reason with an LLM generated review comment).
Overall, I don't think we should outright ban LLM. Having said that, stuff like #4982 upsets my stomach.