citation-file-format icon indicating copy to clipboard operation
citation-file-format copied to clipboard

CiteSoft -- exporting citations for software used, propose to merge into citation-file-format project.

Open AdityaSavara opened this issue 3 years ago • 3 comments

The tool, CiteSoft, already exists. The tool just needs to be merged into this project and needs some minor adjustments. This would be a tool for CFF creation, which would occur in the course of running software. I am attaching an example and "roadmap for merging". It makes most sense to read everything below first, so I am putting most of the attachments at the end.

CiteSoftWithExample.zip

Who needs the new tool?

There are two types of users who need the tool:

  • Dev-users (people who write or contribute to software and want to get credit when their software/functions/codes are used)
  • End-users (people who run software, and who need to know which software to cite based on their user choices -- because a person's choices may use different parts of the backend).

What should the new tool enable users to do?

  • Dev-users will be able to embed lines of code in codes they contribute to which will spit out citations when their code is called.
  • End-users will be able to receive citations for whatever choices they have made / code they have run. Including any dependencies that were called in a way that is worth citing.
  • This is related to (but not identical to, maybe can be merged with) the "CiteMe" repository of citation-file-format https://github.com/citation-file-format/citeme but see below

What benefit would the new tool add?

  • As far as I know, this is really different from the CiteMe repository or anything else that the citation-file-format project currently has. CiteSoft includes not only some software, but a protocol so that people can consolidate citations with projects that span different programming languages. Describe the benefit the tool will give users. Example: End-users will be able to run a scientific simulation (such as molecular dynamics, or disease spread modeling) and would receive all of the relevant citations for their specific run. They could then use the metadata from CITATIONS.cff to cite software in papers they write. Dev-users would get credit.

Implementation suggestions

CiteSoft already exists in python. It also had a YAML format, very similar to citation-file-format. So CiteSoft needs to be slightly tweaked to match the current best practices of citation-file-format. Please see the attached files for example and roadmap for merging.

The citation-file-format schema would probably also need a unique-id top level field added (as noted in another issues card https://github.com/citation-file-format/citation-file-format/issues/344).

CiteSoft also has a partial C implementation so far, because C++ is widely used in scientific software. However, CiteSoft and its protocol has intentionally been designed to ultimately be implemented in all programming languages.

CiteSoft's source code is currently here: https://github.com/AdityaSavara/CiteSoft_Py I also have an "organization" account here: https://github.com/CiteSoft , but not being used yet since CiteSoft was not yet in wide release.

Can you help?

Yes. The python implementation is pretty much done. I am attaching an example and roadmap for mergin. I do wish to help, but I also do hope that this community can ultimately takeover CiteSoft with me as one of the maintainers of the CiteSoft repository. It probably makes sense for the CiteSoft repository to be moved here. The typical installation of the python version of Cite Soft is by pip, so it does not matter much where it is hosted on github, as long as I keep updating the pip as needed.

Explanation of files attached.

  1. A python package for CiteSoft, with an example you can run called runExample.py You can dig around in the CITATIONS directory after running the example. You’ll find there is a “combined” .cff file there called CITATIONS.cff (see https://github.com/citation-file-format/citation-file-format/issues/344 )
  2. A package written in C for CiteSoft that is only partly written. I recommend ignoring this C package for now.
  3. A merging to-do list / roadmap which you should probably read after running the example.
  4. The word document (CiteSoftStandard3.7.docx) that defines the current CiteSoft protocol and YAML standard, which would be changed before merging into the CFF project.

CiteSoftWithExample.zip

210905MergingPlanAndRoadmap.docx

CiteSoftStandard3.7.docx

210905MessageToBerendWeelTruncated.docx

CiteSoft_C-master.zip

AdityaSavara avatar Sep 16 '21 20:09 AdityaSavara

For easier readability, I converted @AdityaSavara's roadmap docx to markdown below:

MERGING PLAN & ROADMAP

For incorporating CiteSoft into the Citation File Format Project

CiteSoft side:

  1. CiteSoft: Needs to have author fields changed to match CFF. This needs to be changed in both the examples and the Schemas.
    • CiteSoft format Needs to have family name, given name, and orcid added as subfields for author names. CiteSoft intentionally had a simpler format to ease adoption, but CFF is already sufficiently adopted that such considerations are no longer necessary.
  2. My CiteSoft.py module needs to have its directory syntax fixed. I have currently hard coded the CITATIONS directory to work for the example.
    • CiteSoft protocol should be updated to specify that programs should (but are not required to) provide the option of an input variable or global variable called citations_path (can be in input files), that way when a program from one language calls another language, it can provide the citations_path for CiteSoft to use.
      • Note that if a program does not do so, their citations well end up in the wrong place. CiteSoft will not break. CiteSoft has a philosophy of “options” to reduce the entry barrier of dev-users usage while not harming other dev-users.
    • Note that individual softwares will have the option of making citations_path a required argument, if they wish to, for their software to run.
  3. File encoding needs to be specified in a way that works across Linux and Windows. I have found an issue that if CiteSoft files are made on windows and then need to be read and parsed for entries to be added on linux, sometimes it doesn’t work due to some encoding issue. I think this is some kind of Unicode issue (https://superuser.com/questions/294219/what-are-the-differences-between-linux-and-windows-txt-files-unicode-encoding , maybe related to this “BOM” thing https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom)
    • I don’t know what the answer is to this, but apparently something needs to be specified for consistency. Initially my plan was to make an “encoding” field in the individual files (like in the CFF files), but that plan didn’t work because my program crashed rather than encountering the encoding line. Unfortunately, I did not save any examples to this, but I could try to make a demonstrative example if needed.

CFF side:

  1. CFF standard should note that a single file can hold multiple CFF citations, if named CITATIONS.cff . This will have --- separators for multiple entries. The string --- is used like a page break between YAML document objects, so more than one "cff" can be put in a single file.
    • This concept already exists in the CiteSoft standard, and I have made my example export a CITATIONS.cff file
  2. CFF should have optional field of unique-id added: this can be a doi, or a url, or anything else that is uniquely associated with this citation/software. This is what CiteSoft uses to distinguish between duplicate citations to the same software.
    • unique-id should be a top level field of CFF, just like it has top level fields of doi and url.
  3. CFF should have optional field of date-and-time-used added , this will be equivalent to the CiteSoft timestamp and will have format like this: ISO 8601 format (YYYY-MM-DDThh:mm:ss)
    • This should be a top level field in CFF, similar to how currently CFF has date-released
  4. CFF should probably add a top level optional field of URI (examples of URI are below).
    • https://datatracker.ietf.org/doc/html/rfc3986#section-1.1.2
  5. (optional): add an optional encoding field at the top level.

Roadmap (Longer Term):

  1. CiteSoft should have a function added that will allow simply reading in an existing CFF and adding that CFF to the log. That way programmers don’t need to write all the information directly in python, they can just call CiteSoft.function_call_cite(cff_filename = “./MYREPOSITORY/Citation.cff”)
    • There is no need to do anything except validate the CFF before writing it to the log, and even that validation step should be something the user can turn off.
    • A person should even be able to give a web address to a CFF file if desired.
  2. (optional) CiteSoft currently has a line that replaces characters from unique_id with “_” when making some extra files. While this is technically an extra and unnecessary step, it’s still not good that the translation is not 1:1. This should be changed to use an encoding/decoding. However, I was reluctant to choose which encoding/decoding since I am not knowledge about this issue, and we want to make sure that whatever encoding/decoding of filenames is used will work across different computer languages.

jspaaks avatar Sep 20 '21 10:09 jspaaks

Please find below my initial responses to items 1-5 (CFF Side) from @AdityaSavara's suggestion for a roadmap:

  1. For the use case where you have a big piece of software that depends on many papers, other pieces of software etc: this is what references was meant for. See

    • https://github.com/citation-file-format/citation-file-format/blob/1.2.0/schema-guide.md#referencing-other-work
    • https://github.com/citation-file-format/citation-file-format/blob/1.2.0/schema-guide.md#references.

    For the use case where "end-users" don't want to include all citable items from references, tooling may need to be developed that determines the appropriate subset. It seems to me that both citeme and CiteSoft_Py aim to do this, both through having developers annotate functions using decorators.

  2. I don't think this key should be added to the CFF schema, see my explanation here https://github.com/citation-file-format/citation-file-format/issues/344#issuecomment-922801233

  3. What would be the purpose of this new key?

  4. I think this item is predicated on (2), and since I disagree with adding (2), this is also not needed

  5. I'm not sure at this point if this should be included as part of the specification. Currently the document (https://github.com/citation-file-format/citation-file-format/blob/1.2.0/schema-guide.md) says CITATION.cff files are YAML 1.2, but perhaps this needs to be specified more tightly. If so, I guess it is only a choice between utf-8, utf-16 or utf-32. Either way I don't think adding a new key encoding is warranted. More information here https://yaml.org/spec/1.2/spec.html#id2771184

jspaaks avatar Sep 20 '21 11:09 jspaaks

I will respond with the same numbering (CFF Side):

  1. CITATIONS.cff suggestion: I don't think you are correct that references fulfills the purpose I am describing. The references feature is useful but is almost like a dependency tree. Whereas CITATIONS.cff is more like a library or concatenatable bibliography list that gets spit out. It's not really correct that softwares always have a nested dependency structure: sometimes it's more correct to say they called each other. Also, the nested references would make parsing the references for bibliographic softwares a real pain. As I mentioned on another thread, it becomes extra painful in other languages outside of python. We wouldn't want CFF to be a python confined format for practical usage. When someone wants to make a list of bibliographic references (not show the dependency tree), having a format that allows simple concatenating will be very easy in any language, like in this example here: I hope the point is already made. If not, imagine the dilemma faced by myself or whomever is developing CiteSoft, it's not trivial even in python: I would have to take third party dependencies citations, parse them in a way that can fit into the CFF references field, and this could create a very long and basically nested CFF file comes out. Then, after that, some bibliography software has to parse these references fields into citable items. In contrast, what I'm describing simply appends to a structure that can be parsed later. The references field is different from this 'library' of citations. Consider Zenodo and Github: they're not going to want to parse some nested thing either. They'll be much happier somebody can just drag in a single file CITATIONS.cff that gets split into a bunch of references (as easily as splitting a string into a list based on a delimeter -- in fact, it is splitting a string into a list based on a delimiter). This is may be complementary to but is not fulfilled by the references feature. [Sorry for writing so long, I guess if what I wrote here did not convince you, then I probably will not be able to!]

  2. unique-id suggestion: Ok, I yield as actually it is not needed for getting CFFs from CiteSoft. I do think unique-id is useful for bibliographic softwares in general, as explained on the other thread ( https://github.com/citation-file-format/citation-file-format/issues/345 ), but Citesoft does not need CFF to carry this field. CiteSoft does an internal hashing by unique-id already.

  3. date-used field suggestion: Ok, I yield as it is not needed for getting CFFs from CiteSoft. Though I do think date-used is useful for citations in general (for example, it could be used to populate date-accessed for webpages, and I think it is a good practice for records like CFF to include when software was used, not just released).

  4. URI suggestion: This one is not needed for citesoft, it's just a suggestion. Currently you have top level fields of "doi" and "url", these are just subsets of URIs. If you make a URI top level field, then people would have more flexiblity in the future. They could even put "uri: ISBN: 123555" . This way you don't have to predict what types of identifiers will become dominant in the future. [other than assuming they will be URI compatible strings...]

  5. Encoding suggestion (doesn't have to be decided immediately, may be should be moved to separate issue thread); I think it'd be a good idea to specify UTF-8 with BOM at the beginning of file (or without?). The BOM seems to be where I had run into trouble with CiteSoft. Some software understood the BOM, and some did not. I assume that in most languages it is easier to write to file without the BOM, but evidently Notepad and other softwares will have a hard time without the BOM: (example where notepad displayed wrong without BOM provided) . I'm not very knowledgeable about this issue, so I don't know what the right decision is. It might also be worth explaining in the guide for windows users that MS Word can be used to convert files to UTF-8 and LF only (this means "\n" without "\r" ). Image below. image Many software readers and writers will be tolerant of either option "\r" and "BOM", but it will be good to have a specification so that people making CFF compatible software know what is expected / best practice with CFF.

AdityaSavara avatar Sep 21 '21 00:09 AdityaSavara