python-bibtexparser
python-bibtexparser copied to clipboard
Add methods to alter a database
Following #114
Add methods to alter a database (delete elements, keep a selection (slice?), more?) in the class BibDatabase()
I :+1: the deletion of elements and keeping a selection of elements. Pushing new elements to/replacing elements from a database would be useful too I think.
fwiw, I wrote a small module to abstract a bit on top of bibtexparser
and interact directly with BibTeX files. These are the kind of functions that I think would be worth having directly to interact with BibDatabase objects. https://github.com/Phyks/libbmc/blob/master/libbmc/bibtex.py
@sciunto Just noticed entries
and entries_dict
do not behave the same way. Actually, modifying entries_dict
does not reflect in entries
and then it fails when I want to bibtexparser.dumps
it.
Not sure if it is intended behaviour though, but I was understanding I could do it from the API description.
Yes normal, it was not designed for that so far. Just give me some time to implement what we need.
Ok about the design! Thanks a lot for this!
It would also be great to read more than one file into the database. I didn't have a detailed look at the code, but what I've seen in line 173 of bparser.py is that bib_database.entries
gets replaced by the new entries. Perhaps it is as easy as this?:
self.bib_database.entries.append(records)
|self.bib_database.entries.append(records) |
It's not that easy. You must ensure that there is no duplicate and if there is a duplicate, you need a rule.
BibDatabase
is fragile to modification of entries
. For example:
import bibtexparser
db = bibtexparser.bibdatabase.BibDatabase()
db.entries = [{'ID': 'foo', 'title': 'Foo'}, {'ID': 'foo', 'title': 'Hehe'}]
db.entries
>>> [{'ID': 'foo', 'title': 'Foo'}, {'ID': 'foo', 'title': 'Hehe'}]
db.get_entry_dict()
>>> {'foo': {'ID': 'foo', 'title': 'Hehe'}}
Moreover, the method BibDatabase.get_entry_dict
does not work if the attribute entries
is modified:
db.entries = [{'ID': 'foo', 'title': 'Foo'}, {'ID': 'foo', 'title': 'Hehe'}]
db.get_entry_dict()
>>> {'foo': {'ID': 'foo', 'title': 'Hehe'}}
db.entries = None
db.get_entry_dict()
>>> {'foo': {'ID': 'foo', 'title': 'Hehe'}}
because only _entries_dict
is memoized. The list
entries
needs to be memoized too, e.g., as _entries
, and the check replaced with whether _entries
matches entries
:
# initialize
self.entries = list()
self._entries = None
self._entries_dict = None
# and then in `self.get_entry_dict()`
if self._entries == self.entries:
return self._entries_dict
self._entries = list(self.entries)
for entry in self.entries:
self._entries_dict[entry['ID']] = entry
return self._entries_dict
If memory and time are not of concern (i.e., most people probably don't work with millions of bib entries -- it depends on the user base), then the best solution is to simply return dict(self.entries)
, without any memoization.
I do understand that the current implementation is supposed to be simple and small (which is good) and one-pass, but:
- some small modifications, like memoizing
entries
, don't sacrifice simplicity. They may matter for performance, but it seems that optimization is not the primary goal here. - it would be good to remark about these limitations in the docstrings or user documentation.
Moreover, it would be interesting to merge multiple bibtex files. However, as remarked above, this introduces the issue of duplicate BibTeX keys. Note that this issue can also arise when loading entries from a single BibTeX file, because the file is not a dict
(unless its creation guarantees uniqueness -- for example editing by hand does not). Therefore, this is a valid issue to resolve (I am interested on remarks and preferences, because I am working on some functions for that purpose).
Replicated entries shouldn't be a problem. Keeping a unique representative for each set of replicas suffices. However, this is about exact entry replicas. Entries with identical keys but differences in other BibTeX fields should be considered as different entries. In particular, they should be identified as similar, and dumping them lists them grouped (due to lexicographic sorting of keys), so that the user can then easily proceed with merging (this is what I am working on).
Even better, a report of how much entries differ can be produced by using the Levenshtein distance on the values of the differing fields, e.g., using the distance
package.
Closing issue: Not planned for v1. Supported in v2 (through custom middleware)