python-bibtexparser icon indicating copy to clipboard operation
python-bibtexparser copied to clipboard

Not throwing exception on error?

Open benureau opened this issue 6 years ago • 9 comments

Hi, I have noticed that the current master does not throw PyParsing error anymore on some incorrect input: for instance,

from bibtexparser.bparser import BibTexParser

btp = BibTexParser()

print(btp.parse('@misc{I am not a correct bibtex{{}', partial=False).get_entry_dict())

just returns the empty dict right now. Is that normal? What do I have to do to get a parsing error that can give feedback on what went wrong?

Oh, the BibTexParser(bibstring) behavior is not documented (or announced deprecated), and does not accept the partial argument: it probably should. One way or the other, this just seems confusing right now.

benureau avatar Aug 06 '18 05:08 benureau

See also #163 and #212.

the way the parser is implemented right now is to consider blocks that could not be parsed as implicit comments. Because of that, the parser silently ignores most failures while parsing.

I believe this is indeed not a really desirable behavior and we should probably change the default to something stricter (eventually having a permissive_implicit_comments option if deemed necessary). The remaining question is what should then be considered an implicit comment?

In my opinion a reasonable answer would be anything not starting by @alpha{.

Any thought on this?

omangin avatar Aug 20 '18 14:08 omangin

Hi!

First of all thanks a lot for the pointers and the amazing functionality that bibtexparser provides.

For many weeks now, I've been curating a dataset involving bibtex entries, and for that I've been using bibtexparser. After some random checks I've found out that faulty entries are silently ignored. This was a very unpleasant surprise, because I don't really know how much of the dataset I've actually been missing along the way.

I couldn't figure this out after reading through the docs, then I went to the issues and found this one. Do you know any way (API, other libs, changing lib code, anything) to fix this? If there is a possible fix within the lib I'll be happy to give it a shot and submit a PR. Just let me know!

Now to a small constructive rant: In my situation, and probably in most others, it is fundamental to know if the data is corrupt, and I agree with the above comment that silently ignoring exceptions is not really a desired behaviour. But more importantly, it isn't an expected behaviour: Even if it was desired, users should definitely be informed about this! I'd suggest to add information about the current behaviour to the README or somewhere visible ASAP, this is way too relevant to be relegated to issue discussions.

Cheers,
Andres

andres-fr avatar Mar 16 '21 00:03 andres-fr

I've encountered similar silent exceptions like this when it encounters unicode characters it doesn't know how to handle. In one case, a logging action from line 31 of bibtexexpression.py was actually logging the entire file as it was parsed, using my logging.basicConfig and dumping the whole thing in my error log file. This was difficult to debug. This might relate to a broader issue with unicode characters, but I thought I'd chime in as more explicit error handling would have been helpful here in my opinion.

P.S. I'd like to echo Andres, thank you for writing this library, it's been a huge help.

tyarosevich avatar Nov 10 '21 21:11 tyarosevich

Thanks for your acknowledgements. Unfortunately, I do not have the bandwidth to maintain this package anymore (as stated in the readme file). If someone wants to take care of it, I'm ready to provide what's necessary (access, whatever).

sciunto avatar Nov 17 '21 10:11 sciunto

Dear @sciunto,

Thank you very much for your answer, I totally understand the situation. I'll try to get a PR with the following done over christmas:

  1. Modify package to optionally raise exceptions for data errors (default behaviour: ON)
  2. Update documentation
  3. Roll out a new (minor?) version to PIP, check autodocs&TravisCI are consistent.

I'll try to get this done, maintenance would be probably too much for me as well. What do you think about this? Any tips where to look for point 1? Thanks again for your contributions and consideration.

Best regards, Andres

andres-fr avatar Nov 21 '21 22:11 andres-fr

@andres-fr Have you worked on this problem?

I came across this issue. In my case I want that the bibtexparser only parses valid bibtex and raises an exception otherwise. In my opinion on could change self.main_expression by deleting self.implicit_comment if one wants "strict parsing". I wrote the following wrapper around BibTexParser:

class StrictBibTexParser(BibTexParser):
    def __init__(self, *args: Any, **kwargs: Any) -> None:
        super().__init__(*args, **kwargs)
        bibexpr = BibtexExpression()
        bibexpr.main_expression = ZeroOrMore(
            bibexpr.string_def |
            bibexpr.preamble_decl |
            bibexpr.explicit_comment |
            bibexpr.entry)

        self._expr = bibexpr
        self._expr.set_string_name_parse_action(
            lambda s, l, t:
            BibDataString(self.bib_database, t[0])
        )

        # Set actions
        self._expr.entry.addParseAction(
            lambda s, l, t: self._add_entry(
                t.get('EntryType'), t.get('Key'), t.get('Fields')
            )
        )
        self._expr.explicit_comment.addParseAction(
            lambda s, l, t: self._add_comment(t[0])
        )
        self._expr.preamble_decl.addParseAction(
            lambda s, l, t: self._add_preamble(t[0])
        )
        self._expr.string_def.addParseAction(
            lambda s, l, t: self._add_string(
                t['StringName'].name,
                t['StringValue']
            )
        )

In this solution even a single wrong bibtex entry in a file with several bibtex entries will raise an error and the correct bibtex entries won't be parsed. On could define a less strict version by just adding a flag in the BibTexParser class that tells us whether or not to add a parser action to the implicit_comment expression, i.e. something like:

if self.add_implicit_comments:
    self._expr.implicit_comment.addParseAction(
                lambda s, l, t: self._add_comment(t[0])
                )

In this version errors in some bibtex entries of a larger collection are just ignored, and the correct bibtex entries will get parsed correctly.

One could add a parameter to BibTexParser to enable "strict" parsing. But I don't know which version is better, or maybe both have some use cases.

ibe-314 avatar Sep 01 '22 18:09 ibe-314

@ibe-314 unfortunately I didn't get around to do it, and iirc i didn't figure out a strategy either back then, but what you propose looks reasonabe to me, I'd say go for it!

andres-fr avatar Sep 01 '22 20:09 andres-fr

Hi @ibe-314

Thanks for posting the snippet. The swallowing of errors is a major issue and I am happy that you found a workaround that works for your use case.

What regards the strict parameter you proposed for the BibTexParser, I guess I'd be reluctant to merge such a PR: Implicit comments are a defined part of valid bibtex (and bibtex variants). Thus, failing parsing whenever there's an implicit comment would be quite unexpected. Thus I'd propose you continue with the subclass you proposed :+1:

I am afraid a clean solution is still a bit further away - but not quite as far as it once was: I have started drafting BibTexParser 2.0 which uses a fundamentally different parsing approach, which amongst other advantages would allow raising warnings whenever an implicit comment looks suspicious.

MiWeiss avatar Sep 02 '22 06:09 MiWeiss

Thank's for the information. Then I will use my workaround until BibTexParser 2.0 will be published :-)

ibe-314 avatar Sep 02 '22 06:09 ibe-314

V2 beta is now available (install from main branch). It creates "ParsingFailedBlocks". After parsing checking for the presence of such blocks allows to check for errors and their causes, without disabling the parsing of other blocks.

MiWeiss avatar May 26 '23 13:05 MiWeiss