software-peer-review Develop policy around LLM generated code in our packages submissions

We have our first submission that acknowledges use of LLMs in the codebase development

I appreciate that it's acknowledged in the code base. I assume that many other submissions use / have used it too, but aren't acknowledging it. JOSS discussed this topic, but didn't land on a policy (that i can see):

https://github.com/openjournals/joss/issues/1297

JOSS has also faced challenges with LLM-driven submissions, which have made it easier for users to generate more code more quickly. This in part has stressed the review pipeline (humans trying to review rapidly developed packages).

I see several potential challenges to consider here:

LICENSES: If maintainers use MIT / BSD-3 but the LLM-generated code was trained on other licenses that are less permissive (e.g., copy left) or other, how does that work (is it legal?)
Human Impact: We have humans taking time to review code that is non-human-generated (in part or full)
Ethical: there are significant ethical issues with LLM-generated code. However, we also know that MANY people do use LLMs in their daily workflows, and it helps them. We also know that it has the potential to be dangerous for groups that are underrepresented in open source.

We should develop a policy or at least have some language around this topic. However, I want to understand people's thoughts on this topic better first.

Legally Speaking (re: Licenses)

I'm not a legal scholar, so when it comes to things like this I look to what others are doing.

Two US federal court rulings have recently been handed down which, in short, say that training LLMs on copyrighted material (specifically books in this case) is not copyright infringement. Here's a write up I found helpful. An analogy is that a person reading a book about something and then authoring their own about a similar topic is clearly not copyright infringement, so neither is this (i.e., if I learn to code by reading other people's code and then writing my own, it's OK). In my opinion this ignores the fact that AI models are capable of exactly reproducing code snippets without proper attribution, but theoretically any developer might do the same thing, so this is not a problem (?).

On the other hand, another judge has ruled that material generated by AI (specifically images) are not copyrightable, as they are not generated by a human (another write up). Does this extend to code?

My impression of this is that LLM-generated code on its own is not a copyright violation, but is not copyrightable. I suspect that LLM-generated code *reviewed by a human is just like any other code.

* loaded term - this could be them reading it line by line, or just giving it a scan through like you would a trusted junior dev

Practically Speaking

From a more pragmatic point of view, LLMs are useful for generating code and people will continue to use them. I think setting a blanket rule that having any code generated by an LLM is disqualifying and will force people away and discourage community contributions to pyOpenSci packages. I like the linked JOSS policy.

I won't comment on the ethical aspects of this, since I don't know enough about it.

Aug 21 '25 00:08 JacksonBurns

first main thing is a disclosure requirement: we should have a place to check a few ways LLMs may have been used

e.g.

code generation
documentation
tests
...other?

with a textbox to explain further, prompting to describe how much and which parts of the code were generated if any of those boxes were checked.

this package notes in its readme:

This project was written and maintained by hand, while making occasional use of Perplexity. For transparency, this is acknowledged, but the project should not be considered AI‑generated.

as far as i know perplexity is not a code generation llm that can hook into an IDE like cursor, so this probably means they asked for some snippets of code to be generated in some places? so that's a good example of asking "where is the llm generated code" so reviewers can assess

the scope of llm generated code - e.g. i definitely will copy/paste functions from stackoverflow and put a link back to the post in a docstring, and sporadic use for utility operations is different to me than widespread code generation in core routines, and matters for a reviewer deciding whether they want to review something, and
location of llm generated code - if reviewers do choose to review a package with llm code, they should probably know which it is so they can pay extra attention to it/know if something is just boilerplate they can glaze over

we should also have a note in the disclosure section like "if you do not disclose LLM usage and reviewers have reason to believe that code is generated, they may choose to discontinue the review and pyos may decline to review future submissions" or something. failure to disclose would be an ethical violation and in a traditional journal that would be like failing to disclose critical details in a methods section, and you'd expect an editor to drop a paper like that.

imo we can leave it up to reviewers and editors if they want to take on llm generated packages. i wouldn't, and i don't think we should encourage it, but the scenario that feels bad to me is some newbie is only exposed to "everyone uses LLMs now," they make their first package and are excited for external review, and then we decline to review it without them having a chance to know that LLM use could be problematic or controversial.

Aug 21 '25 01:08 sneakers-the-rat

To me, the important question is the scale: "Using LLM" can range from "I typed a few prompts and that generated all the code in this project" to essentially a fancy version of autocomplete. We've had editors that autocomplete know variable names or language commands or insert templates for e.g. a for loop for literally decades and nobody thought twice about that. In the latter case, I would recommend to threat that the code just like hand-written code.

A problem in practice is that only the authors know where on the scale from 0 to 100 % LLM-generated their code falls and there is no way to check or evaluate that [*]; I personally dislike policies that disadvantage honest submitters (by e.g. putting up higher hurdles for submissions with LLM generated code in them) with no chance of ever even noticing bad actors. I don't think it is fair to put up significant barriers for honest submitters based on ethical concerns because that strongly incentives submitters to lie to us - and I don't think it's ethical to up a structure that actively gives rewards to people who behave unethical (by simply not declaring that their code is LLM generated).

So, I suggest to

suggest that projects have a statement about their LLM policy and
we can add a check-box "All LLM generated code in the codebase (if any) has been reviewed by a human" and require people to check that box before we begin a review. That's unenforceable (we don't know how carefully they read the code), but a reasonable nudge, I think. At least the first review of LLM generated code should be on the maintainers, not the reviewers. (But for large projects, our reviewers won't read every line of code anyway.)

Beyond that I don't think, from a practical point, that we can treat LLM-generated code different, simply because we can't recognize it reliably as being different. That's independent on what your ethical stand on LLMs is - if we can't tell if code is LLM generated that we just can't treat it different.

*: While some LLMs may have a certain coding style that one could recognize, automatic detection of LLM writing has been shown many times to be inaccurate and the landscape is evolving so fast that it's impossible to keep up with that.

Aug 21 '25 16:08 hamogu

I think, practically speaking, anything that a human submits, must be acknowledged to be attributed to that human, and he is responsible for it. If it's AI generated code, but it's good, then there is nothing in the way of it.

I therefore strongly support a checkbox.

Furthermore, I'd recommend to be rather strict if there is straight out BS code generated by an LLM but obviously not manually reviewed, that is, sloppy AI coded. This strongly severs the trust and therefore the review process, and maybe alltogether prevents this package from obtaining a verification (if it happened now, it could happen again in the future. We want to guarantee careful coded packages.

Aug 22 '25 16:08 jonas-eschle

I don’t think we can (or should) try to stop the progress, code is now being generated faster than ever before. Some of it is sloppy, but some of it, when paired with strong testing, can actually prove to be efficient and production-ready.

As technology advances, maybe we also need to rethink the peer review process itself. Beyond the traditional human review, it could be worth exploring how LLMs might assist reviewers, catching obvious issues, suggesting tests, or checking style consistency, so that humans can focus on the higher-level aspects like design, ethics, and overall quality.

Of course, responsibility and accountability must remain with humans, but I see real potential in a hybrid approach: LLMs as supportive tools in both writing code and reviewing it, while humans remain the final gatekeepers of quality.

Aug 22 '25 16:08 EdisonAltamirano

As technology advances, maybe we also need to rethink the peer review process itself.

Fully agree on this one! It progresses, LLMs will generate more code, and that's a good thing overall. And we should definitely investigate the use of LLMs for ourselves, there should be quite a few checks that could be automatized.

Of course, responsibility and accountability must remain with humans, but I see real potential in a hybrid approach: LLMs as supportive tools in both writing code and reviewing it, while humans remain the final gatekeepers of quality.

Yes, this.

Aug 22 '25 16:08 jonas-eschle

Suppose a human types // sparse matrix transpose, presses tab, and gets a page of code that still has namespaces from an existing library with copyright/attribution stripped. What does it mean that "accountability remains with humans"? What if the algorithmic system obfuscates just enough that it isn't obvious to the human that it's plagiarizing an existing library (violating that project's license)? Just because many people use a product doesn't mean it isn't manufacturing plausible deniability. If the claim is merely "I don't think copyright holders will successfully sue me for infringing", that is a statement about power rather than justice or consent.

The Does v Github litigation is ongoing (summary and status). It is plagiarism (and copyright infringement/license violation) for a human to take copyrighted code and obfuscate (e.g., by changing variable names and control flow isomorphisms) while removing attribution. Given that LLM-based systems are not even capable of clean-room design, I think it's a poor argument to claim it's like a person producing original works. (Even if a future court rules otherwise, that doesn't make it right or respectful of consent, and laws may change.)

LLMs are systems that emit derivative works of unknown provenance. It doesn't matter whether it's "good enough quality" or has been reviewed by a human. If a human was found to be creating derivative works without attribution, it would be an ethical transgression and may have legal consequences. The same standard should apply to LLM-generated products.

LLMs will generate more code, and that's a good thing overall.

I disagree. It is creating technical and social debt.

Aug 22 '25 19:08 jedbrown

Ok Y'all. I see some great comments above. Based on those and to further direct this conversation, here is a list of goals that we can consider when drafting both this policy/approach and the associated thought pieces on LLMS in scientific open source.

I am going to try to align some of the comments above with the goals I list below:

GOAL 1: Create a thoughtful disclosure policy around authors acknowledging the use of LLMs in submissions and how they were used in our submission process.

Provide guidance on how to acknowledge (how and where // extent to which ) LLMs were used to develop/maintain the package.

@sneakers-the-rat The scope and location of the llm generated code - ie how much of it was llm generated - links in functions to stack overflow @hamogu I don't think it is fair to put up significant barriers for honest submitters based on ethical concerns because that strongly incentives submitters to lie to us - and I don't think it's ethical to up a structure that actively gives rewards to people who behave unethical (by simply not declaring that their code is LLM generated).

GOAL 2: Protect Peer Review Efficiency: Minimize Burden

Minimize review burden from low-quality submissions while avoiding false accusations about code origin.
- e.g., Reviewing “junk” code created by an LLM without critical human review and testing burdens reviewers
Misidentifying human code as LLM code will potentially impact newcomers disproportionally (who have lower confidence and sense of belonging) and will further divide the community.

@sneakers-the-rat Statement that allows reviewers to determine if it is “junk code” – not reviewed bhy a human

GOAL 3: Create a supportive (learning) environment for newcomers in peer review:

Make it easy and supportive for those who disclose their use. Antigoal: don’t incentivize people to not disclose (example: make it easy and acceptable) (see @hamogu comment above regarding bad actors)
Provide a support opportunity for authors to learn more about better ways to use these tools and
Address potential learning problems with using these tools (@sneakers-the-rat has brought this up in Slack as well)
Provide a learning opportunity for authors to better understand ethical challenges around the use of LLMs in open science & open source (I think many of us here ca
Raise awareness of licensing and other issues associated with LLM-generated code

@sneakers-the-rat scenario that feels bad to me is some newbie is only exposed to "everyone uses LLMs now," they make their first package and are excited for external review, and then we decline to review it without them having a chance to know that LLM use could be problematic or controversial @JacksonBurns I think setting a blanket rule that having any code generated by an LLM is disqualifying and will force people away and discourage community contributions to pyOpenSci packages.

Still not addressed: Licensing issues (@jedbrown ) I'm honestly not sure how to address this one, given the complexity and given our shared goal of not excluding people who are using these tools with good intentions.

Some implementation suggestions to consider:

@jonas-eschle Furthermore, I'd recommend to be rather strict if there is straight out BS code generated by an LLM but obviously not manually reviewed, that is, sloppy AI coded. This strongly severs the trust and therefore the review process, and maybe alltogether prevents this package from obtaining a verification (if it happened now, it could happen again in the future. We want to guarantee careful coded packages. @hamogu suggest that projects have a statement about their LLM policy @hamogu (@sneakers-the-rat referred to this too) @jonas-eschle Add a check box to the template about "All LLM-generated code in the codebase being reviewed and tested by a human prior to review ....

@EdisonAltamirano I love your suggestions around reimaging peer review!! For this specific thread, let's focus on policy and education first. And we can revisit these tools as reviewers later.

Please keep feedback coming - I just tried to distill what I see above into a set of goals that can drive our conversation and become pillars for decision making. ps - i'll be mostly offline next week so keep the convo going and i'll chime in again and summarize further when i'm back.

Aug 22 '25 21:08 lwasser

I think @jedbrown that this is in the end an issue which needs to be decided by others, not pyOpenSci. We just follow what is generally done with open source packages, there is no need to come up with custom solutions. We're not here to check license infringment on such a level.

Even if a future court rules otherwise, that doesn't make it right or respectful of consent, and laws may change.

I think this is what makes me suspect, there is more personal resentment in your answer than actual fact based arguments, could this be? :) Because, course, we follow what the courts think is right. If they find it's fine, we don't try to overrule them. If the laws change, they change, and everybody needs to adapt. Which, again, isn't really on us to enforce.

We care about code quality, unittests, maintainability and more, including not obvious copyright infringment. That's it. As long as courts deem it fine, we should deem it fine too, no question.

Aug 25 '25 09:08 jonas-eschle

Hello. I am just passing by. In the end, whatever process you adopt, the ending would play out however the masses want it to be. If this is volunteer-driven and a flood of AI submissions come in, human volunteers will get burned out and disappear, then what will happen to the project itself? What courts are we talking about and can they be trusted to make fair judgments?

Aug 25 '25 16:08 pllim

This is a really great thread, and couldn't be more timely for me!

Just some quick thoughts on the goals outlined by @lwasser here:

Goal 1:
- I agree with @hamogu's that it's important to allow users to disclose, otherwise people will submit anyways and not disclose. I wonder if in addition to certifying that any AI code has been checked by a human, it would also be useful to require people to certify (just by a checkbox) that they have fully disclosed any AI use, since even when allowed some people hide their usage.
- I do really like the idea of asking submitters to identify which lines of their codes are AI generated like @sneakers-the-rat does with stack overflow, and I wonder if we could request that those lines include adjacent explanatory comments made by a human, just to demonstrate that the submitter understands what the generated code is doing.
Goal 3: I'm very excited about this goal because overall guidelines on LLM assistance are really needed all throughout the research pipeline, and I've been thinking a lot about this in the context of teaching dev to students.
- Would tiered guidelines based off of a submitters' familiarity/expertise be helpful? For example, I find expert coders are much more likely to catch where an LLM is being overly verbose or ineffective, and not to use entire AI blocks but snippets instead, while a coder that is just starting out is more likely to use larger blocks and not catch when an AI uses many lines to do something that could be done in one line instead. I've seen that LLM assistance often hinders novice projects even when it can appear to be helpful to the novice.

Sep 09 '25 22:09 elliesch

I do really like the idea of asking submitters to identify which lines of their codes are AI generated like @sneakers-the-rat does with stack overflow, and I wonder if we could request that those lines include adjacent explanatory comments made by a human

I don't think that's practically possible. If I use copilot in VScode, then most of my lines will be autogenerated to some degree; in some cases it might just be to autocomplete a single word, in others if might give me the skeleton of a for-loop to fill in manually, or it might generate a 5-line function. Even if I reject the suggestion, it will continue to offer things until it gets close enough to what I want until I eventually accept a suggestion. Do we really want people to add comments to a large fraction of their code lines saying "three words on this line where autogenerated but I checked them" and "AI made this functios, but then I changed the variable names and made the comparison greater than instead of less than because that's what's needed here"?

An "explanatory comment adjacent to each of those lines" essentially means "if you use co-pilot, you need to comment almost every line of code by hand after you write it" - which essentially means spending way more time to comment than the coding would take in the first place and thus renders any AI assistance useless. If that's what we want, it would be more fair to say "We do not accept code written with the assistance of copilot or similar LLVMs".

If you generate a whole module from a prompt and then insert that into your pre-existing code, that's a different situation, but that's not how I (or most people who work around me) use LLVMs today.

Sep 10 '25 00:09 hamogu

I am going to make an epistemic argument regarding copyright infringement. First, note that LLMs prompted with the first line of a book can generate the entire book verbatim. The argument that LLMs manipulate ideas (which are not copyrightable) and just incidentally generate entire books or entire functions does not hold water. LLMs manipulate expression without the ideas.

If a human were to manipulate expression (such as paraphrasing via thesaurus/grammar substitution), it would be copyright infringement. The application could be deemed fair use, e.g., if it's a parody or quote/paraphrase with citation, but that doesn't apply to competing uses (as is typical for code). Legal tests for "substantial similarity" are circumstantial evidence for the mechanism by which a human creates an allegedly-infringing work because the court does not normally have access to direct evidence of that process. If one has a confession of mechanism, there is no need for substantial similarity tests. With an LLM, we know the mechanism is manipulation of expression.

Regarding this:

we follow what the courts think is right. If they find it's fine, we don't try to overrule them.

Journals constantly make editorial choices about scholarly practices, including banning practices that are 100% legal. For example, COPE forbids ghost authorship even though ghost authorship is entirely consensual (the ghost author gets paid). LLM output is non-consensual ghost authorship in a blender: you are paying an LLM provider to manipulate expression of undisclosed authors who have not consented to have their work used in that way. Ghost authorship is unacceptable in scholarly journals because it is a lie to the second victim of plagiarism: the audience. LLM output is both plagiarism (both victims) and copyright infringement.

Sep 10 '25 04:09 jedbrown

Journals constantly make editorial choices

We're not a journal and people aren't submitting papers but code. A tool.

We review code, coding is engineering and all that we are concerned about is if the package is in a good enough state and doesn't impose red flags. That's something quite different from a paper review.

[...] regarding copyright infringement.

Let's cut this discussion short. Copyright is a legal term. It is therefore up to the legal system to decide what is copyright infringement and what isn't. We're not the legal system nor do we have any expert knowledge. We follow, what the experts do.

I understand that you have a strong opinion. I get that. And some resentment maybe against AI. I get it, I do share similar feelings and thoughts. So please, feel free to express all of this in the appropriate places including within the legal system. Regarding pyOpenSci, this simply isn't the place to discuss this fundamental, legal questions. We're not even in a position to do so, being "pseudo-judges" without any knowledge of the topic will only harm us.

Judges don't determine scientific coding standards, we do; we don't determine the legal concept of copyright and how it applies, judges do.

Sep 10 '25 11:09 jonas-eschle

HI Everyone!! Me again, popping in to suggest a path forward. ✨

What I appreciate the most about this convo is the thoughtful disagreement about how we should proceed. Thoughtful disagreement just shows that we all care, and at the same time, it highlights and acknowledges the real and pressing challenges associated with policies around these tools.

This issue has been open for 3 weeks. I am starting a Google document that contains 2 parts:

An intro blog post, which we will publish to pyOpenSci's blog about gen AI. The blog post will introduce all of the challenges that we see without bias (if we can) and without too much opinion (if we can achieve that). And all of you can (and should!!) contribute to it!!
the start of a "policy" around AI with a focus on disclosure and human/maintainer review (before entering peer review). I believe that the most critical aspect for our organization is protecting our volunteer-driven process.

I think it would be fantastic to accompany this blog and policy release with thought pieces, as we have discussed. These thought pieces can share different ideas, thoughts, and resources about Gen AI, including licensing infringement, etc. And they can contradict each other (in the sense that I think it's ok, good, and healthy to have different opinions if we present them in ways that are sensitive to not shaming others but informing others). And they don't have to represent pyOpenSci as a whole - we will clearly label them as opinion pieces.

If anyone wants to write such a piece, please let me know. Licensing seems to be a topic that is popping up here that could make for great opinion pieces.

In the spirit of full transparency, the very early draft version) of the google doc is here.

Note that I'm still adding some of the more recent comments to the doc so it's in early draft stages.

I will turn this document into two PRs (one for our blog and another for the peer review guide) once it's crafted well enough for that step.

I welcome input on this process as well here or on Slack ... You know where to find me 😸

Sep 10 '25 16:09 lwasser

Goal 1) Disclosure Statement

I am not seeing much disagreement about disclosure except this one point:

I personally dislike policies that disadvantage honest submitters (by e.g. putting up higher hurdles for submissions with LLM generated code in them) with no chance of ever even noticing bad actors.

So what is the purpose of a disclosure requirement if people can just lie and say they used no AI when they did? first i would dispute that LLM-written code is not recognizable, but that's not really important to what i think the purpose is. The purpose is to give reviewers a sense of what they are reviewing so they can decide whether and how they want to spend their time reviewing something. Cooperative, open review is ideally a repeated back and forth where reviewers and authors work together to help improve a work, and a disclosure allows the author to be matched with reviewers that would facilitate that - if your code was 100% LLM generated, you wouldn't want someone who is ethically opposed to LLM usage to review your code. There are a range of objections, from absolute ethical rejection to not wanting to spend the time to clean up code that the author didn't even take the time to write, and reviewers and authors should be able to know what they're getting into before they commit time.

People of course can still lie, but having the disclosure requirement gives reviewers something to point to - "hey, you indicated here that you didn't use any generative AI in the code, and yet here is an import of a package that doesn't exist and the same function written nearly-identically four times. That looks like LLM code to me, do you have any explanation for that? If so i respectfully decline to continue the review since trust is a precondition of the review process." If someone indicates they didn't use AI when they actually did and the reviewer finds no problems, then there is no problem - the goal of disclosure is not to ensure a 100% accurate database of which packages used AI, but to facilitate a healthy review process.

So then I think the remaining things to nail down include...

granularity - i think it would probably be unworkable to identify LLM-generated code down to the line, and over time it might be unclear to even the author what is their and what is the LLM. I think a disclosure statement that is relatively open ended, but prompts to maybe expand on a few broad categories of possible use/domain might be good: "generated the docs and initial test cases..." "used line autocompletions..." "full on agentic multi-model mcp server vibe coded this whole thing..."
endorsement of author review - a few ppl mentioned this and i think this is a good idea, the author should have to affirmatively say that they have at least reviewed and read their own code, since imo it is very disrespectful to ask someone else to review your code if you haven't even proofread it! again this is for the purpose of giving the reviewer something to point to like "hey what is this? did you read this?"
statement explaining disclosure - the main text to nail down is an explanation about what to disclose and why, i will check out the google doc, but just listing this for completeness. I think this should be like "be honest with the reviewers, they are your peers and they are here to help. this is for ensuring fairness and equitable use of time. complete disclosures help reviewers know what parts of the code might need extra attention, so please be as complete as you can... etc." to encourage people to disclose and make it clear that it's not a purity test.

Licensing/plagiarism

I am not sure how the discussion got to waiting for the courts to decide what is good here, i agree it would probably be good to keep this discussion focused on what to do as an org here, since we are not trying to adjudicate what is Morally Correct for everyone, but we are trying to run a volunteer organization, and hopefully we are trying to encourage ethical practices in software development.

I think this sort of touches on one of the core cultural distinctions in FOSS spaces that's much older than LLMs:

We review code, coding is engineering and all that we are concerned about is if the package is in a good enough state and doesn't impose red flags. That's something quite different from a paper review.

Speaking for myself, I don't think "the code is just the code" - software, like all tools, embeds and embodies values both in what it does (would we be neutral in a review of an open-source weapons guidance system?), and how it was made (would we be neutral in a review of a package proudly produced by forced child labor?). What is the purpose of reviewing code? If the purpose is just to make sure that things run, then surely the LLM-oriented could just ask an LLM to review it and be on their way. I think the purpose of open collaborative review is to share expertise, to bring up a new generation of programmers, to get external perspective from other people, and, yes, curate a set of packages that hopefully have a reputation of ethical behavior and quality construction that people can feel good about trusting.

Take LLMs out of the question for one second: what if the packages PyOS reviewed were simple copy/pastes of other existing packages with a new author and no attribution of prior work? Would we feel good about promoting and putting our name behind something that exploits and undervalues someone who did the actual work? That's the extreme case, but LLMs do pose a real ethical problem here, maybe somewhere shy of full reproduction in most cases, but certainly not zero - if there was zero reproduction of the distribution of the training set, then the models wouldn't work. Most features of open source licenses have not been tested in court at all, and copy/pasting a full MIT-licensed package is actually fully within the bounds of the license (as long as the license is maintained, and most packages do not modify the license to include attribution). So i disagree that it's even primarily a legal question, in a lot of cases it wouldn't actually be illegal at all, but in my opinion it would be unethical for us to play a role as a review organization in facilitating people taking credit for other people's work.

I am somewhat embarrassed to admit that i spend a lot of time reading "vibe coded" projects because i find the culture fascinating and the code... maybe a more grim kind of fascinating. LLM code does create a lot of identical boilerplate, which is not altogether problematic to me, but it is also true that they frequently reproduce large sections of other packages, maybe with minor modifications in names, but unquestionably reproductions. This is both an ethical problem and an engineering problem - despite what the "AI" maximalists claim about "every tool created on demand" being some glorious future, hopefully those of us that have spent time in FOSS understand that stable maintenance of shared tools that are well tested and mature is vitally important to a functional tool ecosystem. Endorsing a series of projects that statistically reproduce their own yaml parser, jwt implementation, and so on would be counter to one of the core goals of review, curating a collection of high-quality packages that can be built into something larger than themselves.

Inevitably packages will duplicate some code even without LLMs. I am recalling an old discussion on "how do we deal with packages with overlapping functionality" (that i unfortunately can't find now) - we currently ask authors to write something about the state of the field of neighboring packages, how theirs differs from other things that might do the same thing, both to link together related things as well as clarify why a package was created when something else may have already existed. I was, and still am, on the side of "multiple implementations can be good if there is some reason for having them!" e.g. if there is some social problem like an ornery dictator maintaining something, or if there is some technological limitation, interface refinement, and so on. I think we may do something similar here: if reviewers have generated some substantial part of their package that could have conceivably been "inspired by"/copied from another existing package, ask if they have searched for related implementations, and write something short about why not use that, and if the code does appear to overlap substantially, add some attribution.

There are currently not great tools for this, since i have more often seen near-exact reproductions with jittered strings or names than 100% perfect reproductions, but in my experience i can usually find the original implementation in a few searches using characteristic symbol names/reading the implementation in the most popular prior packages. Again the purpose of this is to facilitate an ethical, efficient review process rather than being a court of law. If an author is autogenerating whole modules that are near-copies of existing code, I think it's fair that they are asked to do the homework of checking if it is a reproduction so that reviewers can calibrate their review (since reviewing the code will surely take more time than searching for a few strings in it), and we can ensure that we are a good player in the community encouraging proper attribution. this would be good for authors as well! if they can drop some module and depend on something mature instead, that's part of the process of review serving to improve the state of the work.

Sep 11 '25 02:09 sneakers-the-rat

We're not a journal and people aren't submitting papers but code.

Scholarly legitimacy is necessary, and even more important for the JOSS partnership, which we want universities to "count" as having the same integrity as traditional journals.

Software peer review, similar to the review of scientific papers

JOSS submission: If the maintainer wishes, and their package is within JOSS’ scope, they can now be fast tracked through the JOSS review process (another review is not required for this step).

If you say that ghost authorship, plagiarism, or license violation is not a concern or not worthy of due diligence to prevent, universities will view JOSS as a fake journal and the entire community will be set back decades. Excellent researchers who built their careers on trust in pyOS and JOSS will not be hired or will be denied tenure.

Sep 11 '25 04:09 jedbrown

Everyone, I just opened this PR that begins to create the policy for peer review.

I think that we need to focus our policy on things we can control: human review of any generated content & disclosure of use of tools. I do want us to have blog posts that broadly discuss the other issues that are harder to "control" in a review, like licensing, ethical concerns, environmental impacts, etc. The goal of these posts will be to raise awareness. But i don't think we can be successful in changing how people work via peer review but we can raise awareness and help people think about these tools critically as they adopt them.

I also think a thought piece on the issues around AI is worthwhile - I'm working on a draft and will share that soon for review as well. We can link to the blog and any thought pieces in the future on these pages once we are happy with it.

Please review the pull request. Let's plan to merge it by 25 September 2025 (so submit your review by the 24th please). Thank you all!

Sep 11 '25 17:09 lwasser

Friends - I'm sorry, I also just saw @sneakers-the-rat and @jedbrown follow-up thoughts. I made some decisions in generating the text in the open pr based on what I think we can successfully achieve vs what is too complex to manage. I really love Jonny's point about the disclosure not only serving pyOpenSci but also a potential reviewer who does have strong feelings about AI-generated code. I think focusing on disclosure here is a strong first step. And then let's get the blog post going next - I have an ugly draft that I will put into a pr when it's closer to being readable, and we can work on it there.

And we can link to it via our policy too! From there, I see thought pieces where you all can have the opportunity to talk about AI challenges and issues, and even post conflicting positions. I think that is fine and it's a critical discussion to have. In my mind, there is a systemic root challenge here:

There are some valuable uses of gen ai for SOME people. AND
There are huge ethical and societal impacts that these tools have that make them dangerous for many different reasons.

I believe that the tension that we are navigating here is that both statements above are true.

Feel free to reach out to me directly, y'all, if you aren't comfortable with how we are moving forward AND / OR if you have any thoughts on all of this! This is hard!

Sep 11 '25 18:09 lwasser

Scholarly legitimacy is necessary, and even more important for the JOSS partnership, which we want universities to "count" as having the same integrity as traditional journals.

Agree. But we don't review a paper, we review code. And as for papers, generation by LLMs in generally (please don't pick on this one, you know what is meant) is not integer, generating code is, as long as a human stands behind the code (at least this seems to me the current consensus in the legal and academic system and the de facto practice).

If you say that ghost authorship, plagiarism, or license violation is not a concern or not worthy of due diligence to prevent, universities will view JOSS as a fake journal and the entire community will be set back decades. Excellent researchers who built their careers on trust in pyOS and JOSS will not be hired or will be denied tenure.

Using an LLM and disclosing it is to code is not ghost authorship or plagiarism AFAIK, but I may be wrong. But that's maybe where we disagree, so let's settle this one point. Do you have sources for this claim $^1$? I think we can agree: if it is, then I am all with you. If it isn't, it's rather an argument to allow LLM contributions. My main point is, determining whether it is plagiarism etc. is not our job. Developing policies that enforce this behavior, this is our job.

@sneakers-the-rat

what if the packages PyOS reviewed were simple copy/pastes of other existing packages with a new author and no attribution of prior work?

I fully agree, and would like to think that argument through. That's clearly something that we should stop, one of the "red flags". But where does plagiarism start? Copying a line? Remembering a line? That's the hard part. But the question isn't whether we should stop plagiarism if we detect it (we should, yes, that's our job!) but what constitutes plagiarism. And as in the case of copy-paste -- or all those cases in between -- this is the job of courts and has never been up to us. We catch it, but we don't define what's plagiarism. So why should we act different with LLMs, which are a grey area, I agree, and not treat them the same way we have any kind of plagiarism: taking whatever the law and courts have decided.

@lwasser fully agree, especially with

I think focusing on disclosure here is a strong first step.

$^1$ please no long monologic discussion from the corner of the internet. If it is a problem, with so much code written with LLMs, there should be many, prominent cases. The question for us is, what's the overwhelming view on the issue (and you may very well disagree personally with that!).

Sep 15 '25 12:09 jonas-eschle

Hi everyone. I REALLY appreciate the conversation that we have had here. I have taken the liberty, supported by Mandy, to create a blog post on our AI policy.

https://github.com/pyOpenSci/pyopensci.github.io/pull/734

I'd love feedback from everyone here on the post, and I'd even love more if you want to add your name as a co-author (your choice, but the more the merrier from my perspective.

This blog post mentions challenges in this space but doesn't dive deep into the political and social implications of this technology. I'd love for some of you to consider writing thought pieces about these subtopics, which are huge topics, but ones that we can't really control right now with policy. We can protect our volunteers (like you all!) and provide some guidelines for maintainers to consider in a way that doesn't shame anyone for using the tools but instead raises awareness of the issues.

Let's shoot to merge this by the end of this month (October 1). Please provide feedback before that time and add your name as an author if you wish! I'd love to add everyone here as a minimum, IF you are all open to it.

Sep 16 '25 19:09 lwasser

If it is a problem, with so much code written with LLMs, there should be many, prominent cases.

This prominent class-action case for code was filed in 2022 and has yet to make it to court. This case includes examples in which entire pages of copyright code are produced nearly-verbatim from a one-line comment. https://githubcopilotlitigation.com/case-updates.html

This is not a lawsuit (yet), but it shows that an entire Harry Potter book can be emitted by Llama 3.1 70B from a prompt consisting of only the first line. https://arxiv.org/abs/2505.12546

Those cases are about copyright infringement (a legal concept), not plagiarism. Plagiarism is not illegal, but is a violation of publication ethics. While they often come together, there is copyright infringement that is not plagiarism and plagiarism that is not copyright infringement. Publishers are responsible for upholding publication ethics and good scholarly practices.

From Lemley and Ouellette (2025):

Sep 17 '25 06:09 jedbrown

Plagiarism is not illegal, but is a violation of publication ethics

Then we can agree I think:

copyright is up to the courts and we follow whatever they decide
plagiarism is something that concerns mainly publishing. pyOpenSci vets packages, to quote "We review Python packages and software with the goal of helping scientists build better, discoverable and usable software.", so it is not a primary concern to us. Even a plagiarised package can be "better, discoverable and usable", albeit we will of course try to catch any kind of blatant misconduct.
In connection with JOSS that publishes packages with papers, it indirectly becomes a concern to usin the sense that JOSS (not pyOpenSci) may requires standards for plagiarism. This would however rather be up to JOSS to make any stronger guidelines.

Do you agree?

Maybe the stronger point to me is the realistic point: We wan't to vet code, primarily, and increase code quality and discoverability, that's the main goal. Code has been written less and less by a single humans, IDEs with refactoring, autocompletion helped a lot, StackOverflow and search engines with publicly available code make us copy-paste things we maybe shouldn't, and now LLMs top it of. But these are all amazing tools. People will use them. If we put any serious restrictions on it (at pyOpenSci level, JOSS could be another thing), they simply won't get their packages vetted. Sure, we should prevent obvious cases, no question, and give good practical guidelines on how to use LLMs (!), as we do for other things, but not be prohibitiv, or we're shooting ourselves in the foot.

We say that "vetted code is sustainable and follows good software practice", we don't say that "everything is 100% not plagiarized and academically perfect". JOSS may wants this, and then we require whatever JOSS wants. Do you agree with these goals, or do we differ maybe?

Sep 17 '25 07:09 jonas-eschle