python-bibtexparser
python-bibtexparser copied to clipboard
Failing in converting strings to unicode
Suppose you have this bibtex file (call it fail.bib):
@article{a,
author = {One Two and Three{\'\i}abc-Four{\'\i}def},
}
The example program:
#! /usr/bin/python3
#
import bibtexparser
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import convert_to_unicode
bf=open("fail.bib")
bib_database = bibtexparser.bparser.BibTexParser(common_strings=True,
customization=convert_to_unicode
).parse_file(bf)
bf.close()
print(bib_database.entries)
produces the following output:
{'author': 'One Two and Threeı́abc-Four\\d́ef', 'ENTRYTYPE': 'article', 'ID': 'a'}]
which is evidently wrong. It seems that \i is converted to ı (dotless i) too early, and then \'ı creates havoc.
I am not sure what the solution could be, because I do not follow the code very well --- quite too complex for my skill level, I fear.
It seems that adding the pattern and sorting the substitution lists (so that it starts substituting the longest match) sort of work:
#! /usr/bin/python3
#
import bibtexparser
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import convert_to_unicode
bibtexparser.latexenc.unicode_to_crappy_latex1 = (
('í', r"{\'\i}"), *bibtexparser.latexenc.unicode_to_crappy_latex1
)
bibtexparser.latexenc.unicode_to_crappy_latex1=sorted(bibtexparser.latexenc.unicode_to_crappy_latex1, key=lambda x: len(x[1]), reverse=True)
bf=open("fail.bib")
bib_database = bibtexparser.bparser.BibTexParser(common_strings=True,
customization=convert_to_unicode
).parse_file(bf)
bf.close()
print(bib_database.entries)
which outputs
[{'author': 'One Two and Threeíabc-Fourídef', 'ENTRYTYPE': 'article', 'ID': 'a'}]
Waiting for #264 before addressing this (might be fixed along the way)