python-bibtexparser When parsing multiple sources with same parser, database object is shared

When the same parser is used to parse multiple source files, the database object is shared and only a reference is returned. This has some (at least to me) unexpected consequences. So far I have run into:

If entries_dict is used, _entries_dict already exists and is not overwritten even though entries have changed. In such a case, entries and entries_dict do not match.
A parsed database is overwritten if the object is not explicitly copied

Is this intended behavior? To me it feels more natural if the parser creates a new database object every time a new source file is parsed.

I have forked and created a hacky solution by moving database initialization to the parse function. It works for my purposes, but I don't fully understand how and why the database object is referenced internally in the parser, so I'm not sure this is a good idea.

Thanks for a great package, by the way!

Example of 1

from bibtexparser.bparser import BibTexParser

parser = BibTexParser()

db1 = parser.parse("""
@article{article1,
    title = {Article 1}
}
""")

print("db1.entries contains: {}".format(', '.join(e["ID"] for e in db1.entries)))
print("db1.entries_dict contains: {}".format(', '.join(db1.entries_dict.keys())))

db2 = parser.parse("""
@article{article2,
    title = {Article 2}
}
""")

print("db2.entries contains: {}".format(', '.join(e["ID"] for e in db2.entries)))
print("db2.entries_dict contains: {}".format(', '.join(db2.entries_dict.keys())))

Gives the output:

db1.entries contains: article1
db1.entries_dict contains: article1
db2.entries contains: article2
db2.entries_dict contains: article1

Example of 2

from bibtexparser.bparser import BibTexParser

parser = BibTexParser()

db1 = parser.parse("""
@article{article1,
    title = {Article 1}
}
""")

db2 = parser.parse("""
@article{article2,
    title = {Article 2}
}
""")


print("db1.entries contains: {}".format(', '.join(e["ID"] for e in db1.entries)))
print("db1.entries_dict contains: {}".format(', '.join(db1.entries_dict.keys())))

print("db2.entries contains: {}".format(', '.join(e["ID"] for e in db2.entries)))
print("db2.entries_dict contains: {}".format(', '.join(db2.entries_dict.keys())))

Gives the output:

db1.entries contains: article2
db1.entries_dict contains: article2
db2.entries contains: article2
db2.entries_dict contains: article2

May 30 '17 11:05 joelgoop

Sure, that is some weird behaviour.

I would tend to think that the correct expected output should be (but I maybe missing something, it's quite late :/)

db1.entries contains: article1
db1.entries_dict contains: article1
db2.entries contains: article2
db2.entries_dict contains: article2

for both examples.

There are quite a bunch of PR waiting at the moment, I will try to handle them and then will have a deeper look at the problem. If you have some working solution and feel like you can clean it and make a PR, that would be awesome :)

Jun 01 '17 02:06 Phyks

Yes, that's exactly what I expect! If I can find the time I'll try to clean up my solution and submit a PR. My changes are pretty minimal and seem to work fine for me, but I'll have to dig a little deeper to figure out whether it's a good idea at all and make sure it doesn't break anything else.

(In the script where I am using this, my main problem was that entries_dict wasn't updated when I parsed a new file. Just in case anyone else encounters this issue, that particular problem can be worked around by setting database._entries_dict = {} after each parse.)

Jun 01 '17 14:06 joelgoop

Will be solved with #308

Aug 17 '22 19:08 MiWeiss