python-bibtexparser icon indicating copy to clipboard operation
python-bibtexparser copied to clipboard

Improve performance on large files

Open khaeru opened this issue 6 years ago • 1 comments
trafficstars

I have a database of about 1000 entries / 10k lines, and a bibtexparser-based command line tool that I use to manipulate it. Some of these manipulations only touch a single entry, e.g. opening a file listed in a localfile key.

I found that the automatic parsing of all entries led to a noticeable, annoying delay (~1 second) in my tool. So I implemented a lazy-loading version of bibtexparser.bibdatabase.BibDatabase (code follows).

I'm not proposing this code specifically, as it doubtless misses some corner cases. But the more general issue is: random access of entries from large BibTeX databases should not require the CPU/memory overhead of parsing every entry.

https://github.com/khaeru/bib/blob/bd9698d3737f0abb650f97df4a101a3d876045f1/bib/util.py#L64-L169

khaeru avatar Feb 21 '19 14:02 khaeru

Thanks @khaeru! The parser being slow is indeed an issue that needs to be looked at. Actually there is probably some room for improvement at the parser level since it had not been optimized at all. (This could be a good start: pyparsing/Performance-Tips)

Regarding your implementation, it is a nice solution to lazy parsing of bibtex files. I am however not sure of whether this use case fits many bibtexparser users, and how to integrate it in the current architecture without introducing too much complexity. If the first answer turns out to be ye and you have a proposal for the second point, please open a PR.

Let discuss the requests and use cases for performance improvements here.

omangin avatar Feb 24 '19 06:02 omangin

Fixed in v2: On google colab (free), thus a low-perf environment, parsing a file with ~45'000 entries takes between 5s (default parse stack) and 130s (with latex decoding). On v1, the same would not finish within a very loooong time...

MiWeiss avatar May 26 '23 13:05 MiWeiss