When parsing multiple sources with same parser, database object is shared
When the same parser is used to parse multiple source files, the database object is shared and only a reference is returned. This has some (at least to me) unexpected consequences. So far I have run into:
- If
entries_dictis used,_entries_dictalready exists and is not overwritten even thoughentrieshave changed. In such a case,entriesandentries_dictdo not match. - A parsed database is overwritten if the object is not explicitly copied
Is this intended behavior? To me it feels more natural if the parser creates a new database object every time a new source file is parsed.
I have forked and created a hacky solution by moving database initialization to the parse function. It works for my purposes, but I don't fully understand how and why the database object is referenced internally in the parser, so I'm not sure this is a good idea.
Thanks for a great package, by the way!
Example of 1
from bibtexparser.bparser import BibTexParser
parser = BibTexParser()
db1 = parser.parse("""
@article{article1,
title = {Article 1}
}
""")
print("db1.entries contains: {}".format(', '.join(e["ID"] for e in db1.entries)))
print("db1.entries_dict contains: {}".format(', '.join(db1.entries_dict.keys())))
db2 = parser.parse("""
@article{article2,
title = {Article 2}
}
""")
print("db2.entries contains: {}".format(', '.join(e["ID"] for e in db2.entries)))
print("db2.entries_dict contains: {}".format(', '.join(db2.entries_dict.keys())))
Gives the output:
db1.entries contains: article1
db1.entries_dict contains: article1
db2.entries contains: article2
db2.entries_dict contains: article1
Example of 2
from bibtexparser.bparser import BibTexParser
parser = BibTexParser()
db1 = parser.parse("""
@article{article1,
title = {Article 1}
}
""")
db2 = parser.parse("""
@article{article2,
title = {Article 2}
}
""")
print("db1.entries contains: {}".format(', '.join(e["ID"] for e in db1.entries)))
print("db1.entries_dict contains: {}".format(', '.join(db1.entries_dict.keys())))
print("db2.entries contains: {}".format(', '.join(e["ID"] for e in db2.entries)))
print("db2.entries_dict contains: {}".format(', '.join(db2.entries_dict.keys())))
Gives the output:
db1.entries contains: article2
db1.entries_dict contains: article2
db2.entries contains: article2
db2.entries_dict contains: article2
Sure, that is some weird behaviour.
I would tend to think that the correct expected output should be (but I maybe missing something, it's quite late :/)
db1.entries contains: article1
db1.entries_dict contains: article1
db2.entries contains: article2
db2.entries_dict contains: article2
for both examples.
There are quite a bunch of PR waiting at the moment, I will try to handle them and then will have a deeper look at the problem. If you have some working solution and feel like you can clean it and make a PR, that would be awesome :)
Yes, that's exactly what I expect! If I can find the time I'll try to clean up my solution and submit a PR. My changes are pretty minimal and seem to work fine for me, but I'll have to dig a little deeper to figure out whether it's a good idea at all and make sure it doesn't break anything else.
(In the script where I am using this, my main problem was that entries_dict wasn't updated when I parsed a new file. Just in case anyone else encounters this issue, that particular problem can be worked around by setting database._entries_dict = {} after each parse.)
Will be solved with #308