python-bibtexparser icon indicating copy to clipboard operation
python-bibtexparser copied to clipboard

Failing in converting strings to unicode

Open Rmano opened this issue 5 years ago • 2 comments

Suppose you have this bibtex file (call it fail.bib):

@article{a,
  author = {One Two and Three{\'\i}abc-Four{\'\i}def},
}

The example program:

#! /usr/bin/python3
#
import bibtexparser
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import convert_to_unicode
bf=open("fail.bib")
bib_database = bibtexparser.bparser.BibTexParser(common_strings=True,
                                                 customization=convert_to_unicode
                                                ).parse_file(bf)
bf.close()
print(bib_database.entries)

produces the following output:

{'author': 'One Two and Threeı́abc-Four\\d́ef', 'ENTRYTYPE': 'article', 'ID': 'a'}]

which is evidently wrong. It seems that \i is converted to ı (dotless i) too early, and then \'ı creates havoc.

I am not sure what the solution could be, because I do not follow the code very well --- quite too complex for my skill level, I fear.

Rmano avatar Nov 11 '20 18:11 Rmano

It seems that adding the pattern and sorting the substitution lists (so that it starts substituting the longest match) sort of work:

#! /usr/bin/python3
#
import bibtexparser
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import convert_to_unicode
bibtexparser.latexenc.unicode_to_crappy_latex1 = (
            ('í', r"{\'\i}"), *bibtexparser.latexenc.unicode_to_crappy_latex1
                )
bibtexparser.latexenc.unicode_to_crappy_latex1=sorted(bibtexparser.latexenc.unicode_to_crappy_latex1, key=lambda x: len(x[1]), reverse=True)
bf=open("fail.bib")
bib_database = bibtexparser.bparser.BibTexParser(common_strings=True,
                                                 customization=convert_to_unicode
                                                ).parse_file(bf)
bf.close()
print(bib_database.entries)

which outputs

[{'author': 'One Two and Threeíabc-Fourídef', 'ENTRYTYPE': 'article', 'ID': 'a'}]

Rmano avatar Nov 11 '20 18:11 Rmano

Waiting for #264 before addressing this (might be fixed along the way)

MiWeiss avatar Jul 10 '22 13:07 MiWeiss