python-bibtexparser icon indicating copy to clipboard operation
python-bibtexparser copied to clipboard

Normalize field keys (to lowercase)

Open csware opened this issue 1 year ago • 8 comments

Describe the bug I have several .bib files that contain (mixed) field keys that are either in lowercase or start with a capital letter, such as "Author" and "Title". No other tooling complains about this.

SeparateCoAuthors does not work and I cannot uniformy access the fields using e.g. entry['title']

A normalization to lowercase of the field keys was conducted in v1.

Maybe this can be fixed using a middleware? I would be really grateful!

Reproducing

Version: e3757c13abf2784bda612464843ab30256317e6c

Code:


#!/usr/bin/python

import bibtexparser
import bibtexparser.middlewares as m

layers = [
	m.LatexDecodingMiddleware(),
	m.MonthIntMiddleware(True), # Months should be represented as int (0-12)
	m.SeparateCoAuthors(True), # Co-authors should be separated as list of strings
	m.SplitNameParts(True), # Individual Names should be split into first, von, last, jr parts
	m.MergeNameParts("last", True) # Individual Names should be merged oto Last, First...
]

bib_database = bibtexparser.parse_file('data/Survey.bib', append_middleware=layers)
for entry in bib_database.entries:
	print(entry['title']);

Bibtex:

@InCollection{Name2006,
  Title                    = {A Title},
  Author                   = {Name, First and Name, Second},
  Booktitle                = {ITS},
  Publisher                = {Some publisher},
  Year                     = {2006},
  Pages                    = {61--70}
}

Remaining Questions (Optional) Please tick all that apply:

  • [ ] I would be willing to contribute a PR to fix this issue.
  • [ ] This issue is a blocker, I'd be grateful for an early fix.

csware avatar Feb 10 '24 22:02 csware

Thanks!

  • [ ] We should add a middleware that normalizes field names.
  • [ ] We could consider a default lower-case mapping.

tdegeus avatar Feb 11 '24 15:02 tdegeus

Maybe something like this (Works For Me™)?

import bibtexparser
from bibtexparser.library import Library
from bibtexparser.model import Block, Entry

class NormalizeFieldNames(bibtexparser.middlewares.middleware.BlockMiddleware):
    def __init__(self,
                 allow_inplace_modification: bool = True):
        super().__init__(allow_inplace_modification=allow_inplace_modification,
                         allow_parallel_execution=True)

    def transform_entry(self, entry: Entry, library: "Library") -> Union[Block, Collection[Block], None]:
        for field in entry.fields:
            field.key = field.key.lower()
        return entry

Usage example:

        library = bibtexparser.parse_file(filename,
                                          append_middleware=[NormalizeFieldNames(),
                                                             bibtexparser.middlewares.SeparateCoAuthors(),
                                                             bibtexparser.middlewares.SplitNameParts()])

Technologicat avatar Feb 14 '24 12:02 Technologicat

That's probably alright. Would you be willing to convert it to a PR (adding a test)? I think this is a quite common use-case that we should support.

tdegeus avatar Feb 14 '24 13:02 tdegeus

Fully agree with @tdegeus, and would appreciate a PR by @Technologicat

Just one remark: We'd have to be able to handle "new" duplicates somehow (i.e., if two field keys exist in the original block which only differ in their capitalization). That's particularly important now that we're pushing the use of entries as dicts. In principle, we have an entry type DuplicateFieldKeyBlock that should be used here, but I am also happy to support additional suggestions. These would probably have to be enabled with a corresponding constructor parameter (e.g. raising an exception). Does this make sense?

MiWeiss avatar Feb 14 '24 20:02 MiWeiss

@tdegeus: Sure.

@MiWeiss: Good point about conflicting keys. But I'll need a bit more information about the desired way to tackle it.

The way this approximately went is, yesterday I got a sudden need to extract some data from BibTeX in Python.

Within an hour, I had installed bibtexparser, upgraded it to 2.x, ran into this issue (since my datafiles happened to use capitalized keys), written the simplest possible field key normalizer, and posted a copy here. So it's fair to say I'm kind of new to this project :)

Technologicat avatar Feb 15 '24 08:02 Technologicat

A solution would be to issue a warning (similar to library.failed_blocks) and use the last key value.

csware avatar Feb 15 '24 08:02 csware

@csware: Thanks. Yes, that's one possible solution, and probably the simplest one that works.

~Considering alternatives, what about the DuplicateFieldKeyBlock mentioned by @MiWeiss?~ EDIT: Nevermind, I think I understood what you all meant now.

Technologicat avatar Feb 15 '24 09:02 Technologicat

Implemented, using @csware's suggestion of emitting a warning and letting the last value win. Please review.

Technologicat avatar Feb 19 '24 11:02 Technologicat