python-bibtexparser BibTeX-encoded string (containing escape characters) to plain text

BibTeX-encoded string (containing escape characters) to plain text

Open philonor opened this issue 7 years ago • 2 comments

I'm trying to match journal and conference names contained within a BibTeX entry with a list of known conferences and journals. So far I'm impressed how well bibtexparser is able to load all the BibTeX entries :+1:, thanks for the great work!

Problem: The list of known conferences and journals contain names as plain text. Some BibTeX entries contain special escape sequences (1, 2) in journal or conference names, see below. These highly degrade text matching results. Is there any possibility to replace all escape sequences with the corresponding plain text values?

Computers \& Graphics
Profesional de la Informaci{\'o}n, El: Information World en Espa{\~n}ol
Revista espa{\~n}ola de documentaci{\'o}n cient{\'\i}fica
Available at {\O}{\O}{\^O}{\guillemotright}{\guillemotright} {\^U}{\^U}{\^U}{\textordmasculine}\$\times\$ {\textordmasculine}{\`U} {\DH}{\textordmasculine} {\`U}{\guillemotright} {\"O}\$\times\$ {\"O}{\guillemotright}\$\times\$ {\^O} {\"O}\$\times\${\guillemotright} {\~N} {\O}{\"O}
KI-K{\"u}nstliche Intelligenz
8th International Prot{\'e}g{\'e} Conference Proceedings, Madrid, July

Pseudo solution code:

# load single article
str = """@article{herrero2014universidades,
  title={Universidades y Google News: visibilidad internacional a trav{\'e}s de los medios de comunicaci{\'o}n online},
  author={Herrero-Solana, V{\'\i}ctor and Arboledas, Luis and Leger{\'e}n-{\'A}lvarez, Elisa},
  journal={Revista espa{\~n}ola de documentaci{\'o}n cient{\'\i}fica},
  volume={37},
  number={3},
  pages={052},
  year={2014}
}
"""
entry = bibtex_load_from_string(str).entries[0]

$ entry['title'] Universidades y Google News: visibilidad internacional a trav{'e}s de los medios de comunicaci{'o}n online $ bibtexparser.normalize($entry['title']) Universidades y Google News: visibilidad internacional a través de los medios de comunicación online

Apr 21 '17 11:04 philonor

Hi @philonor, did you have a look at customizations? I believe you could use the convert_to_unicode function.

Does it answer your issue?

Apr 21 '17 23:04 omangin

Note that the latest released version on pip has some issues with convert_to_unicode (have a look at the issues). If this can affect you, these issues were solved in master.

Apr 22 '17 13:04 Phyks

python-bibtexparser python-bibtexparser copied to clipboard

BibTeX-encoded string (containing escape characters) to plain text

python-bibtexparser
python-bibtexparser copied to clipboard