python-bibtexparser Stripping curly brackets is too greedy

In #158 a customization to strip { and } from fields was introduced. The problem with the greedy approach in the current implementation is that it will also replace curly brackets in mathematical expressions, and these are occasionally used in titles of papers.

The solution would be to only replace curly brackets in text mode, i.e. iterate over the string and keep track of text or math mode, and only then replace curly brackets.

If you ignore the abstract field, then the only thing you need to worry about is $ or $ and $ for inline math. Inside the abstract field (and maybe similar fields if there are any), anything can happen, and one is royally screwed in coming up with an implementation.

Sep 18 '17 13:09 pbelmans

Thanks for opening an issue on this!

I thought more about it and found a couple of related issues:

There could be an issue with things such as \url{http://example.com} which I sometimes saw in BibTeX entries. It would be translated into \url http://example.com with current code.
There might be an issue with the convert_to_unicode code also.
Also, title={An FFT Algorithm} will yield An FFT Algorithm whereas it is An Fft Algorithm in LaTeX. Not sure how expected this is.

I think we should try to list some examples snippets for now, before actually dealing with this issue, to ensure the solution will cover theses cases. I don't think we already have BibTeX snippets which cover these cases in our test base.

Concerning the solution itself, there is a debate to have concerning whether the actual output of plaintext_* should be plaintext (which I think is the way to go for the use case in #116) or just LaTeX without curly braces. First solution is difficult to achieve though :/

Sep 18 '17 14:09 Phyks

See also #193:

latex_to_unicode customization should preserve escaped braces See https://github.com/sciunto-org/python-bibtexparser/blob/master/bibtexparser/latexenc.py#L70 and #187.

Oct 07 '18 04:10 omangin

Hey. Just dropping in here as a non-python developer and non-LaTeX user. So this comment might be uninformed. But is it possible you're focussing too heavily on the "brackets"? To me this looks like a LaTeX2e encoded string, and in python there are very good packages available to convert those to text.

This one seems to be the most prominent one: https://pylatexenc.readthedocs.io/en/latest/latex2text/

I think I have implemented it successfully as a customization somewhat as follows

import pylatexenc
from pylatexenc.latex2text import LatexNodes2Text

def latex_to_text(record):
    record = {key: LatexNodes2Text().latex_to_text(value) for key, value in record.items()}
    return record

If the text contains what your documentation calls "accents and weird characters" it seems to imply that it's LaTeX encoded, and hence will contain a lot more weird stuff than just the brackets that are being focussed on here ...

Hope this is of any help! Thx for the very useful toolbox!

Oct 23 '19 12:10 WouterJeuris

Appealing to an external library is a good way of letting someone else deal with the special situations. But in any case you'll need to add math_mode='verbatim' as an option, otherwise the whole point of not stripping curly brackets in math mode is defeated.

Oct 23 '19 12:10 pbelmans

We're using pylatexenc as external library for now in v2. This may not be ideal (it's rather slow and not bibtex specific), but seems to be working well for all test cases reported so far.

If anyone wants to submit a fix for v1, I am happy to review it, but it seems to be a rather big change needed; it's probably easier to just migrate to v2.

May 26 '23 14:05 MiWeiss

python-bibtexparser python-bibtexparser copied to clipboard

Stripping curly brackets is too greedy

python-bibtexparser
python-bibtexparser copied to clipboard