Proposal for how to use codemetapy and codemeta.json to help researchers and software maintainers make it easier to cite software
Hey Maarten,
I was hoping we might get a chance to chat sometime. I am currently an eScience Center Fellow and research software engineer at WUR. For my fellowship, I aim to help build tools that help researchers credit those who help create software and to prevent accidental or intentional plagiarism of software. As part of my work, I began to create a Python package called cff2toml (it probably should have been called toml2cff), which I had hoped would help automate creation of CFF files from pyproject.toml files. You can find it here: https://pypi.org/project/cff2toml/
I was hoping to use some of these tools to create a knowledge graph of citation metadata across PyPI packages and to try to estimate the amount of potential plagiarism occuring by research software engineers.
Then I discovered the CodeMeta project, and from there I discovered your tool.
As part of my fellowship, I was hoping to help develop some tools, like yours, and would love to help you if you are open to it.
Here's a problem I have been thinking about. The current strategy for improving citation metadata for software seems to help automate the creation of citation metafiles (like CFF, .zenodo, .tributors, codemeta.json, etc.). However, this strategy assumes that the maintainers of these packages will use these tools. And this is real problem from a research ethics perspective since researchers still need to cite software even if the creators of that software did not make it easy to cite them. Indeed, some software projects no longer have maintainers. Some projects have maintainers who do not want to create these files, perhaps because it takes too much time. Other maintainers might not create these files in time so that the researcher can cite them. And still other maintainers have no idea that there is demand for them to create these files. And even those who create these files often do not keep them updated, so they may be inaccurate or incomplete. And finally, even if a maintainer is willing to keep accurate citation metadata, they may not be technically able for released versions; in these cases, they cannot update the source for tagged branches (no way to add codemeta.json files) and they have no clear way to specify missing or update inaccurate citation metadata.
Here's my proposal for addressing these issues for citation metadata with respect to Python modules using your tool, codemetapy.
Enhance codemetapy so that:
- From command line, it can create a default codemeta.json file for any version of any Python package on PyPI.
Currently the tool can do this if the package is installed locally. However, it currently requires the user to install it locally first. Perhaps there is a way to update the command line to allow users to write something like "codemetapy {packagename} {version}" and then the tool would be able to generate this file. Perhaps it still adopts the strategy of generating it from a locally installed version, but it in this case, it would be nice if, for those who currently do not have it installed locally, it could create a temporary environment, install it, and then revert the installation or destroy the temporary environment. Alternatively, perhaps we could see if there is a way that this can be done without requiring the user to install it locally. This might involve a strategy of looking it up from some central server that does the installation remotely and then stores the file in a database and serves it back.
- Permit the code maintainer to override default codemeta.json for any version of any Python package on PyPl.
I thought of this approach. First, define and pilot a new standard protocol for metadata harvesters/tools (like this one) for searching and using codemeta files on the main/master branch. For example, tell maintainers to add a codemeta_{version}.json file to their master/main branch, and then tell metadata tool makers (like codemetapy) to check and use this file before trying to generate a codemetadata file through other means. Thus for codemetapy, we could make it so that it uses Gitlab API to check the main/master branch for codemeta_{version}.json and then use it, and otherwise, check the tag associated with the version to see if it has a codemeta_{version}.json file and use it, and otherwise check for codemeta.json on that tag and use it, and otherwise if that all fails, try to generate a codemeta.json output by other means, like parsing project files. We could update the command line options of this tool to allow users to specify a tag prefix (like "v" or "version") and the name of the master/main branch.
- Make it easy for maintainers to learn about and include codemeta.json files.
Here are some feature ideas:
- Create a command line option that can save the file with the new format that includes the version (e.g., codemeta_{version}.json)
- For projects have a maintainer or author email, send an email with a copy of the codemeta_{version}.json
- For Github projects, add a command line option that creates an issue and copies/attaches this file into the issue.
- For Github projects, add a command line option that forks the project, adds the file to the master/main branch, and then creates a pull request to their main/master branch.
What do you think?