bibtex-tidy icon indicating copy to clipboard operation
bibtex-tidy copied to clipboard

Generate Bibtex Keys Improvement

Open pedropaulofb opened this issue 3 years ago • 7 comments

First of all, thank you for your work and for providing it for free for the community!

Please consider improving the Generate Bibtex Keys functionality. Currently, it gets the first word of the work's title to use after the year, at the pattern LastnameYearFirstword, even when the first work is an article (a, an, the). The key generation without considering the articles and getting the first word after then would generate a better Bibtex key. E.g., for the work "A literature review of the economics of COVID‐19 (Brodeur et. al., 2021)", the better BibTex key would be brodeur2021literature (like Google Scholar does) and not brodeur2021a (like the current BibTeX Tidy version does).

pedropaulofb avatar Mar 11 '22 13:03 pedropaulofb

More generally, I think having some more customization options on how the citekey is created would be great. (casing, inclusion of second author/et als, etc.)

Also thanks for this fantastic little tool!

chrisgrieser avatar Mar 15 '22 00:03 chrisgrieser

Thanks @pedropaulofb, this is a current limitation of the key generation. I'd like to improve it to remove articles, but I'm conscious that this should also work for non-English titles.

@chrisgrieser I agree it would be good to have customisable keys. My current proposal is to allow entering a template string such as:

  • {authors(1)}{year}{word(title)} - surname of the first author, year, and a keyword from the title. e.g. "west2016quantified"
  • {AuthorsEtAl(2)}-{year}-{month}-{Word(journal)} - first two surnames with et al if truncated, year, month, and first word of the journal capitalized, all separated by dashes. E.g. "WestGiordanoEtAl-2016-jan-Computing"

Tokens:

  • authors(limit) - surname of authors, lowercase. limit can be omitted to list all authors.
  • authorsEtAl(limit) - as above but ends with EtAl if there are more authors than the limit.
  • word(field) - first meaningful word (ie not "a", "an", "the", etc, also need to consider other languages) of the given field.
  • date(format) - outputs a date using a given pattern (e.g. YYYY, YYYY-MM-DD). This may be tricky because month is not always present or easily parseable (it could be a number, month name - not always english).

They all output lowercase but the token could be capitalised for capitalised output (e.g. Authors(1))

Would be great to get your thoughts - think this would cover your use cases?

FlamingTempura avatar Mar 17 '22 16:03 FlamingTempura

I personally would be fine with {AuthorsEtAl(2)}{year}, but your template system will be very useful to other users!

An idea would be to use one of the existing template formats for citekey autogeneration, for portability/standardization? BibDesk has one, and JabRef also has one.

chrisgrieser avatar Mar 17 '22 16:03 chrisgrieser

Thanks - wasn't aware of the existing template formats so will check them out

FlamingTempura avatar Mar 17 '22 17:03 FlamingTempura

Hi @FlamingTempura!

Would be great to get your thoughts - think this would cover your use cases?

This would definitely cover all my use cases.

@chrisgrieser I agree it would be good to have customisable keys. My current proposal is to allow entering a template string such as:

Please don't wait until all tokens are implemented to publish a new version of the tool. I think that adding the new features in an incremental way would benefit a lot the users of the tool.

I'd like to improve it to remove articles, but I'm conscious that this should also work for non-English titles.

Considering that more than 90% of the scientific publications are in English (in a quick google search I could find the references [1] and [2]), I think that removing English articles is already enough for most users. Thinking in cover many languages can decrease your focus and make you unmotivated for something that probably is not fundamental.

Once again, congratulations on the tool! =)

References: [1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5226904/ [2] https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0238372

pedropaulofb avatar Mar 21 '22 16:03 pedropaulofb

@FlamingTempura Thanks for your excellent tool! I have been using it for compiling my dissertation.

Personally, I would prefer the citekey generation use letters after the year. For example, instead of having the following:

@book{sweig1942the,
	title        = {The impossible book},
	author       = {Stefa{n} Sweig},
	year         = 1942,
	month        = mar,
	publisher    = {Dead Poet Society}
}
@article{steward2003cooking,
	title        = {Cooking behind bars},
	author       = {Martha Steward},
	year         = 2003,
	publisher    = {Culinary Expert Series}
}
@book{sweig1942the2,
	title        = {The impossible book},
	author       = {Stefan Sweig},
	year         = 1942,
	month        = mar,
	publisher    = {Dead Poet Society}
}

I would prefer to have:

@book{sweig1942a,
	title        = {The impossible book},
	author       = {Stefa{n} Sweig},
	year         = 1942,
	month        = mar,
	publisher    = {Dead Poet Society}
}
@article{steward2003,
	title        = {Cooking behind bars},
	author       = {Martha Steward},
	year         = 2003,
	publisher    = {Culinary Expert Series}
}
@book{sweig1942b,
	title        = {The impossible book},
	author       = {Stefan Sweig},
	year         = 1942,
	month        = mar,
	publisher    = {Dead Poet Society}
}

Using a letter to distinguish citations that have the same author-year is commonly used in various citation/bibliography styles. This would eliminate the worry about using title words. There would just have to be a decision about how to create an ordering. There could be some kind of rule hierarchy to create this ordering. For example, entries with a month field would be ordered first based on their month; entries without a month would be ordered by sorting other authors last names; if these were the same, it would go to the next criteria, etc. This would yield a numeric ordering which could be labeled by letters:

a: 1
b: 2
...
z: 26
aa: 27

For me, the rules for ordering don't much matter if I'm going to autogen the keys every time I make a change to the file. I think the letter ordering after the year provides a clean and concise list of keys. And I can just search and replace the old keys with the new keys in my tex file if they change.

EvanEzell avatar Mar 22 '22 19:03 EvanEzell

Thanks for the nice tool! I am gonna add one more request here: The feature sometimes generates keys with space in it:

@inproceedings{van der lee2019best,
	title        = {Best practices for the human evaluation of automatically generated text},
	author       = {van der Lee, Chris and Gatt, Albert and van Miltenburg, Emiel and Wubben, Sander and Krahmer, Emiel},
	year         = 2019,
	booktitle    = {Proc.\ of INLG},
	url          = {https://www.aclweb.org/anthology/W19-8643/},
	_pages       = {355--368},
}

which can cause issues in applications such as overleaf. Screen Shot 2022-06-03 at 4 51 37 PM

I would have preferred van-der-lee2019best or vanderlee2019best

danyaljj avatar Jun 03 '22 23:06 danyaljj

Citation keys can now be customised using a template language based on JabRef. Documentation is here.

You can find the option in the web UI under Clean Up: image

I'd appreciate any feedback on this. I've not been able to test this as thoroughly as I'd like (e.g. comparing to actual JabRef output) so I've marked the feature as experimental.

All the above cases should now be possible:

Please consider improving the Generate Bibtex Keys functionality. Currently, it gets the first word of the work's title to use after the year, at the pattern LastnameYearFirstword, even when the first work is an article (a, an, the). The key generation without considering the articles and getting the first word after then would generate a better Bibtex key. E.g., for the work "A literature review of the economics of COVID‐19 (Brodeur et. al., 2021)", the better BibTex key would be brodeur2021literature (like Google Scholar does) and not brodeur2021a (like the current BibTeX Tidy version does).

Function words are now omitted. See list of function words here.

More generally, I think having some more customization options on how the citekey is created would be great. (casing, inclusion of second author/et als, etc.)

Casing can be controlled using modifiers. The template can also be configured to include et al.

Personally, I would prefer the citekey generation use letters after the year. For example, instead of having the following:

The default template will continue to use numbers, but [duplicateLetter] can be used to output a letter instead.

The feature sometimes generates keys with space in it:

This has also been fixed; spaces will no longer be output.

FlamingTempura avatar Nov 18 '22 16:11 FlamingTempura

Thank you so much, @FlamingTempura !!! Great job! By the way, you should add a "buy me a coffee" button on the website! You deserve it!

pedropaulofb avatar Nov 19 '22 10:11 pedropaulofb