paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Allow custom 'citation', 'name', and other fields for a SearchIndex's Docs in manifest.csv

Open plbremer opened this issue 8 months ago • 0 comments

Hello,

I am hoping that someone can help me build an index such that I specify attributes like a Doc's name, citation, etc.

I can naively define a manifest.csv with something like

file_location,doi,docname,title,citation
../resources/benchling_pdfs_small/EXP22000050.pdf,,EXP22000050.pdf,my_title,(`EXP22000050 ~ my_title` by ['my_author'])

Unfortunately, the Doc that is used in the SearchIndex looks something like:

Doc(
    docname='my_author2200',
    dockey='b55bd364d1332cc88ec5f8a87b59f495',
    citation='my_author. *my_title*. lots_of_other_text_that_is_not_quite_appropriate',
    fields_to_overwrite_from_metadata={'key', 'doc_id', 'citation', 'dockey', 'docname'},
    ...
)

What does not work:

  1. setting the parsing parameter use_doc_details to False. This prevents the creation of a DocDetails, which is good because then I avoid Crossref calls, however, my problem is occurring earlier, during the creation of the Doc.
  2. making a column "fields_to_overwrite_from_metadata" in the manifest file. Again, we do not make DocDetails, so this is not used

What might work but I want to avoid:

  1. Providing a custom citation_prompt in the parsing settings. I still want to to be able to specify other things like the docname, and, ultimately, this is just a roundabout way with less precision than me simply providing the value that I want.
  2. Post-hoc modification of the Docs in the SearchIndex. THat is, after I build the search index, trying to modify its values then save it again. This just seems like a dangerous and wonky approach.

Ultimately, superficially, and in the short term, what I am interested in is making it so that outputs have in-line citations that don't look surprising. Insteady of my_author obtained a XYZ cell line from (my_author pages 1-4). I want the in-line citation that I want. Future work will probably involve more programmatic access to the various attributes that I hope to set in a custom manner.

Other ideas:

  • Sometimes I noticed that the LLM-generated citations were completely wrong (ie, had journals, etc, despite these being internal documents)
  • I am also slightly wonderinghow problematic all of this is in light of paperqa's expectations about traversable citations, etc.

plbremer avatar Mar 27 '25 19:03 plbremer