citar
citar copied to clipboard
Files referred in file-field with ':' in their name and/or multiple files are parsed incorrectly
Describe the bug
An ebib entry referencing 2 files - named e.g. a:a.pdf
and b.pdf
are parsed incorrectly by the two citar functions:
- citar-file-parser-default
- citar-file-parser-triplet
There are 2 problems:
- a:a.pdf is interpreted exclusively as 'Calibre / Mendeley' format; however, this is not the case here
- the files separate
;
in the ebib entry has a trailingsplit-string
function
To Reproduce
- create an ebib entry as follows:
@Article{atakishiyev21:explain,
file = {a:a.pdf; b.pdf},
author = {Atakishiyev, Shahin and Salameh, Mohammad and Yao, Hengshuai and Goebel, Randy},
journal = {arXiv e-prints},
title = {{E}xplainable artificial intelligence for autonomous driving: {A}n overview and guide for future research directions},
}
- crate the two files in your according papers folder
- insert citation in some org doc.
- open/follow the link
- the two files are not shown for selection
Expected behavior
The two files should be shown / presented for selection
Emacs version:
"GNU Emacs 28.1 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.33, cairo version 1.17.6) of 2022-04-28"
I'll take a closer look later, but I think this is related to #578/#579.
cc @roryk
Following quck-hack-fix works for my use case:
(defun hgi/citar-file-parser-arxiv (dirs file-field)
"Return a list of files from DIRS and FILE-FIELD.
Works for
- ebib entries with multiple file entries;
- file names containing ':'
This is a 'quick-hack-fix' for citar bug:
https://github.com/bdarcus/citar/issues/599"
(let ((files (split-string file-field "[;]" 'omit-nulls " ")))
(delete-dups
(seq-mapcat
(lambda (dir)
(mapcar
(lambda (file)
(expand-file-name file dir)) files))
dirs))))
Activated with: (setq citar-file-parser-functions '(hgi/citar-file-parser-arxiv))
What's the significance of arxiv WRT to the colon?
I wonder also if my suggestion to @roryk to not split the parsers was a mistake: https://github.com/bdarcus/citar/issues/578#issuecomment-1107474787.
E.g. in your case, if it only split on the semi-colon, you would have never run into the problem.
Regarding the colon's significance
The elpa package arxiv-citation
(which I'm using) downloads papers from arXiv.
It creates locally a pdf filename that corresponds to the paper's title.
Now, if the the title contains a colon, I run into said problem.
E.g. https://arxiv.org/abs/2112.11561
creates following filename: atakishiyev-salameh-yao-goebel_explainable-artificial-intelligence-for-autonomous-driving:-an-overview--and-guide-for-future-research-directions.pdf
Regarding your suggestion to not split the parsers (https://github.com/bdarcus/citar/issues/578#issuecomment-1107474787)
One argument for not splitting might be, that traversing all the parsers might get expensive, in the case, where there are a lot of folders with papers to search through - in the order of: p_parsers * f_folders.
From SE perspective (decoupling) it perhaps would be better to have 1 parser per job.
Now, if the the title contains a colon, I run into said problem.
I wonder if it's worth a bug report to that package? Seems to me arxiv-citation-pdf-name
should split the title on the colon (or on a question mark etc.), and only use the main title.
Hmm, as far as I understand, in this specific example, the part before and after the colon form the actual title of the paper. So I would assume that building the filename - including both parts - and the colon makes sense..?
On the other side, if a title would also include e.g. a semicolon, than it would definitively be problematic, as bib-latex uses the semicolon as well to separate multiple files in the file tag.
Perhaps replacing all those special characters in a title would be the way to go?
in this specific example, the part before and after the colon form the actual title of the paper.
No; the colon delimits title and subtitle. The title is just "Explainable artificial intelligence for autonomous driving".
Since that function already includes the author names to help disambiguate, I see no point in including the subtitle.
Perhaps replacing all those special characters in a title would be the way to go?
That's another option, but the file name in this example still ends up ridiculously and unnecessarily long.
In any case, the colon in the file name itself is arguably the bug.
Sorry to be obstinate about this one, but why do you think that the string past the colon is the subtitle?
In this specific case, arXiv's 'Export Bibtex Citation' yields:
@misc{https://doi.org/10.48550/arxiv.2112.11561,
doi = {10.48550/ARXIV.2112.11561},
url = {https://arxiv.org/abs/2112.11561},
author = {Atakishiyev, Shahin and Salameh, Mohammad and Yao, Hengshuai and Goebel, Randy},
keywords = {Artificial Intelligence (cs.AI), Computers and Society (cs.CY), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Explainable artificial intelligence for autonomous driving: An overview and guide for future research directions},
publisher = {arXiv},
year = {2021},
copyright = {Creative Commons Attribution 4.0 International}
}
(no subtitle nor titleaddon tags here)
However, I've now also opened an issue over at arxiv-citation
Sorry to be obstinate about this one, but why do you think that the string past the colon is the subtitle?
It's a convention so widely understood, at least in English-language scholarship, that I didn't think I needed to explain ;-)
https://style.mla.org/punctuation-with-titles/#:~:text=Titles%20and%20Subtitles&text=1%20of%20the%20eighth%20edition,of%20the%20title%20or%20subtitle.%E2%80%9D
no subtitle nor titleaddon tags here
There's lots of non-ideal bibliographic data.
To be clear, though, that string represents the full title, which is main title + subtitle.
Also related to #454
It's a convention so widely understood, at least in English-language scholarship, that I didn't think I needed to explain ;-)
I respectfully disagree: at least in English-language medical and biochemistry literature (that I'm familiar with) the use of a colon in a title (where journal articles don't really have a concept of a subtitle) is terribly common. So a convention that might work well for books or other fields doesn't work everywhere. In my BibTeX file of over 5k references, over 1k have a title with a colon in it.
Thankfully my reference manager (Paperpile) didn't include any of them in the filename.
Should we close this, or is there some change we should make?