python-bibtexparser icon indicating copy to clipboard operation
python-bibtexparser copied to clipboard

Add methods to alter a database

Open sciunto opened this issue 9 years ago • 8 comments

Following #114

Add methods to alter a database (delete elements, keep a selection (slice?), more?) in the class BibDatabase()

sciunto avatar Feb 04 '16 13:02 sciunto

I :+1: the deletion of elements and keeping a selection of elements. Pushing new elements to/replacing elements from a database would be useful too I think.

Phyks avatar Feb 04 '16 13:02 Phyks

fwiw, I wrote a small module to abstract a bit on top of bibtexparser and interact directly with BibTeX files. These are the kind of functions that I think would be worth having directly to interact with BibDatabase objects. https://github.com/Phyks/libbmc/blob/master/libbmc/bibtex.py

Phyks avatar Feb 07 '16 14:02 Phyks

@sciunto Just noticed entries and entries_dict do not behave the same way. Actually, modifying entries_dict does not reflect in entries and then it fails when I want to bibtexparser.dumps it.

Not sure if it is intended behaviour though, but I was understanding I could do it from the API description.

Phyks avatar Feb 08 '16 17:02 Phyks

Yes normal, it was not designed for that so far. Just give me some time to implement what we need.

sciunto avatar Feb 09 '16 01:02 sciunto

Ok about the design! Thanks a lot for this!

Phyks avatar Feb 09 '16 08:02 Phyks

It would also be great to read more than one file into the database. I didn't have a detailed look at the code, but what I've seen in line 173 of bparser.py is that bib_database.entries gets replaced by the new entries. Perhaps it is as easy as this?:

self.bib_database.entries.append(records)

p-vitt avatar Mar 18 '16 09:03 p-vitt

|self.bib_database.entries.append(records) |

It's not that easy. You must ensure that there is no duplicate and if there is a duplicate, you need a rule.

sciunto avatar Mar 18 '16 11:03 sciunto

BibDatabase is fragile to modification of entries. For example:

import bibtexparser

db = bibtexparser.bibdatabase.BibDatabase()
db.entries = [{'ID': 'foo', 'title': 'Foo'}, {'ID': 'foo', 'title': 'Hehe'}]
db.entries
>>> [{'ID': 'foo', 'title': 'Foo'}, {'ID': 'foo', 'title': 'Hehe'}]
db.get_entry_dict()
>>> {'foo': {'ID': 'foo', 'title': 'Hehe'}}

Moreover, the method BibDatabase.get_entry_dict does not work if the attribute entries is modified:

db.entries = [{'ID': 'foo', 'title': 'Foo'}, {'ID': 'foo', 'title': 'Hehe'}]
db.get_entry_dict()
>>> {'foo': {'ID': 'foo', 'title': 'Hehe'}}
db.entries = None
db.get_entry_dict()
>>> {'foo': {'ID': 'foo', 'title': 'Hehe'}}

because only _entries_dict is memoized. The list entries needs to be memoized too, e.g., as _entries, and the check replaced with whether _entries matches entries:

# initialize
self.entries = list()
self._entries = None
self._entries_dict = None

# and then in `self.get_entry_dict()`
if self._entries == self.entries:
    return self._entries_dict
self._entries = list(self.entries)
for entry in self.entries:
    self._entries_dict[entry['ID']] = entry
return self._entries_dict

If memory and time are not of concern (i.e., most people probably don't work with millions of bib entries -- it depends on the user base), then the best solution is to simply return dict(self.entries), without any memoization.

I do understand that the current implementation is supposed to be simple and small (which is good) and one-pass, but:

  1. some small modifications, like memoizing entries, don't sacrifice simplicity. They may matter for performance, but it seems that optimization is not the primary goal here.
  2. it would be good to remark about these limitations in the docstrings or user documentation.

Moreover, it would be interesting to merge multiple bibtex files. However, as remarked above, this introduces the issue of duplicate BibTeX keys. Note that this issue can also arise when loading entries from a single BibTeX file, because the file is not a dict (unless its creation guarantees uniqueness -- for example editing by hand does not). Therefore, this is a valid issue to resolve (I am interested on remarks and preferences, because I am working on some functions for that purpose).

Replicated entries shouldn't be a problem. Keeping a unique representative for each set of replicas suffices. However, this is about exact entry replicas. Entries with identical keys but differences in other BibTeX fields should be considered as different entries. In particular, they should be identified as similar, and dumping them lists them grouped (due to lexicographic sorting of keys), so that the user can then easily proceed with merging (this is what I am working on).

Even better, a report of how much entries differ can be produced by using the Levenshtein distance on the values of the differing fields, e.g., using the distance package.

johnyf avatar Sep 07 '16 11:09 johnyf

Closing issue: Not planned for v1. Supported in v2 (through custom middleware)

MiWeiss avatar May 26 '23 13:05 MiWeiss