citeproc-py
citeproc-py copied to clipboard
improve bibtex error handling and some missing macros/types
Probably needs a comprehensive review for better error handling through all of this parsing -- in fact, could we just use an existing, more robust BibTeX parsing library, like bibtexparser? -- but this change has enough fixes that I could actually load my .bib file once I converted it to ASCII.
Thanks! Can you create a new issue for this and attach your BibTeX database, or a database with just the problematic entries? I'd like to output a warning when discarding a BibTeX field at the very least.
For dates specifically, it might be interesting to add support for raw dates like citeproc-js. I suppose it simply dumps the raw date string (as included in the entry) where the date parts usually go in the citation or bibliography entry.
While implementing the changes to the BibTeX parser I did look at using an existing Python BibTeX parser library. I decided not to use bibtexparser since it seems to handle accented characters by searching for all possible ways to use the accent macros instead of expanding the macros as LaTeX would. The parser included with citeproc-py even expands (simple) custom macro's included in the preamble (see xampl.bib). Additionally, bibtexparser does not properly split names into first, von, last and jr parts.
I think Pybtex looks the most promising. While it does more than just parsing BibTeX databases (it is basically a reimplementation of BibTeX), its only external dependency is PyYAML, so that shouldn't be too much of a problem. Pybtex properly splits the names into parts, but doesn't handle accented characters or macros (just like BibTeX). latexenc could perhaps handle the accented characters, but macro handling would still be missing.
That said, I don't think the citeproc-py BibTeX parser is lacking much functionality after the recent changes. bibtexparser probably handles more symbol macro's (math mode macros), but these can be easily added.
What do you mean when you say "... once I converted it to ASCII"?
Issue #20 describes the ASCII problem. Pull requests #22 and #23 create test cases for the crashing failures on date and page number parsing (and include minimal .bib files to trigger the failures). This doesn't cover the missing macros/types, but hopefully it's enough to get you started.
Issue #20 and pull requests #22 and #23 have been taken care of. I added mappings for thesis and report in 6df260d. I did not add the \backslash macro as this is only supported in math mode, which is not yet handled (and simply passed as is, so shouldn't raise an exception). Is that acceptable?