paper-qa
paper-qa copied to clipboard
Allow custom 'citation', 'name', and other fields for a SearchIndex's Docs in manifest.csv
Hello,
I am hoping that someone can help me build an index such that I specify attributes like a Doc's name, citation, etc.
I can naively define a manifest.csv with something like
file_location,doi,docname,title,citation
../resources/benchling_pdfs_small/EXP22000050.pdf,,EXP22000050.pdf,my_title,(`EXP22000050 ~ my_title` by ['my_author'])
Unfortunately, the Doc that is used in the SearchIndex looks something like:
Doc(
docname='my_author2200',
dockey='b55bd364d1332cc88ec5f8a87b59f495',
citation='my_author. *my_title*. lots_of_other_text_that_is_not_quite_appropriate',
fields_to_overwrite_from_metadata={'key', 'doc_id', 'citation', 'dockey', 'docname'},
...
)
What does not work:
- setting the parsing parameter
use_doc_detailstoFalse. This prevents the creation of a DocDetails, which is good because then I avoid Crossref calls, however, my problem is occurring earlier, during the creation of the Doc. - making a column
"fields_to_overwrite_from_metadata"in the manifest file. Again, we do not make DocDetails, so this is not used
What might work but I want to avoid:
- Providing a custom
citation_promptin the parsing settings. I still want to to be able to specify other things like the docname, and, ultimately, this is just a roundabout way with less precision than me simply providing the value that I want. - Post-hoc modification of the Docs in the SearchIndex. THat is, after I build the search index, trying to modify its values then save it again. This just seems like a dangerous and wonky approach.
Ultimately, superficially, and in the short term, what I am interested in is making it so that outputs have in-line citations that don't look surprising. Insteady of my_author obtained a XYZ cell line from (my_author pages 1-4). I want the in-line citation that I want.
Future work will probably involve more programmatic access to the various attributes that I hope to set in a custom manner.
Other ideas:
- Sometimes I noticed that the LLM-generated citations were completely wrong (ie, had journals, etc, despite these being internal documents)
- I am also slightly wonderinghow problematic all of this is in light of paperqa's expectations about traversable citations, etc.