python-bibtexparser
python-bibtexparser copied to clipboard
BibTeX-encoded string (containing escape characters) to plain text
I'm trying to match journal and conference names contained within a BibTeX entry with a list of known conferences and journals. So far I'm impressed how well bibtexparser is able to load all the BibTeX entries :+1:, thanks for the great work!
Problem: The list of known conferences and journals contain names as plain text. Some BibTeX entries contain special escape sequences (1, 2) in journal or conference names, see below. These highly degrade text matching results. Is there any possibility to replace all escape sequences with the corresponding plain text values?
Computers \& Graphics
Profesional de la Informaci{\'o}n, El: Information World en Espa{\~n}ol
Revista espa{\~n}ola de documentaci{\'o}n cient{\'\i}fica
Available at {\O}{\O}{\^O}{\guillemotright}{\guillemotright} {\^U}{\^U}{\^U}{\textordmasculine}\$\times\$ {\textordmasculine}{\`U} {\DH}{\textordmasculine} {\`U}{\guillemotright} {\"O}\$\times\$ {\"O}{\guillemotright}\$\times\$ {\^O} {\"O}\$\times\${\guillemotright} {\~N} {\O}{\"O}
KI-K{\"u}nstliche Intelligenz
8th International Prot{\'e}g{\'e} Conference Proceedings, Madrid, July
Pseudo solution code:
# load single article
str = """@article{herrero2014universidades,
title={Universidades y Google News: visibilidad internacional a trav{\'e}s de los medios de comunicaci{\'o}n online},
author={Herrero-Solana, V{\'\i}ctor and Arboledas, Luis and Leger{\'e}n-{\'A}lvarez, Elisa},
journal={Revista espa{\~n}ola de documentaci{\'o}n cient{\'\i}fica},
volume={37},
number={3},
pages={052},
year={2014}
}
"""
entry = bibtex_load_from_string(str).entries[0]
$ entry['title']
Universidades y Google News: visibilidad internacional a trav{'e}s de los medios de comunicaci{'o}n online
$ bibtexparser.normalize($entry['title'])Universidades y Google News: visibilidad internacional a través de los medios de comunicación online
Hi @philonor, did you have a look at customizations? I believe you could use the convert_to_unicode
function.
Does it answer your issue?
Note that the latest released version on pip
has some issues with convert_to_unicode
(have a look at the issues). If this can affect you, these issues were solved in master.