Implement RefChecker in JabKit
⚠ This is a bigger "first issue". Only take it if you have enough time for it and you follow-up to review comments. ⚠
Context
There are more and more fake references. JabRef has the infrastructure to check it, but it needs to be wired together.
A whole bib file should be checked.
There is a Python-script "RefChecker" doing it, but we want to do it integrated in JabRef
Related Work
See https://github.com/markrussinovich/refchecker
(LinkedIn-Post: https://www.linkedin.com/posts/markrussinovich_github-markrussinovichrefchecker-a-tool-activity-7355654490076696576-WycH?utm_source=share&utm_medium=member_desktop&rcm=ACoAAACCUVQBYmlu_A9exTiDRiuXB95v-LNYD4c)
1. Implement logic
Goal
For each BibEntry, fetch authoritative metadata via its own identifiers and compare local vs fetched to classify into groups.
Groups to ensure (create if missing)
refcheck
├─ real paper
├─ unsure
└─ fake paper
Implementation can mirror org.jabref.gui.groups.GroupTreeViewModel#addSuggestedGroups.
Algorithm (per BibEntry)
-
Convert text to BibEntry
- In Prefernces > Web Search, there is "Default plain citation parser" configured. This one should be used.
- Use
org.jabref.logic.importer.plaincitation.SeveralPlainCitationParserto turn a text into a List of BibEntries
-
Resolve by DOI (preferred)
- If
StandardField.DOIpresent: fetchauthoritativeEntryviaorg.jabref.logic.importer.fetcher.DoiFetcher#performSearchById(doi). - Else try to find a DOI via
org.jabref.logic.importer.fetcher.CrossRef#findIdentifier(entry); if found, fetch viaDoiFetcherand store asauthoritativeEntry
- If
-
Fallback: resolve by arXiv (
authoritativeEntrystill null)- If arXiv ID present or found via
org.jabref.logic.importer.fetcher.ArXivFetcher#findIdentifier(entry), fetch its metadata and store in ``authoritativeEntry`
- If arXiv ID present or found via
-
Compare: local vs
authoritativeEntryUse
org.jabref.logic.database.DuplicateCheck#isDuplicateto determine if local is a duplicate ofauthoritativeEntryIf yes: Add to group
real paper. If not: Add to groupfake paperreturn
Now: authoritativeEntry is null
-
Search paper using fetcher
Look up paper using
org.jabref.logic.importer.fetcher.CompositeSearchBasedFetcher.If something found: check if any entry is a duplicate of
local. If yes: If yes: Add to groupreal paper. If not: Add to groupfake paper
The current proposal does not make use of the group "unsure". Maybe, the DuplicateCheck class needs to be adapted accordingly.
2. Add test
For 1, tests need to be crafted. Think of TDD - and add tests before/while coding
3. Wire into CLI
A. Include refcheck --online/--offline <file.bib> in org.jabref.cli.ArgumentProcessor
B. Include refcheck --online/--offline <file.pdf> in org.jabref.cli.ArgumentProcessor
Note that --online and --offline are optional. If not given, the default plain citation parser is used.
For B
Import references from PDF into .bib using "New library based on references". Users can do --online and --offline (with --online being the default if AI is available. Error if --online and no AI available)
4. Wire into GUI
Create "Tools" > "Ref Checker"
Content:
Tab "Citations" and Tab "PDF File"
Tab Citations: Text field with citations
Tab "PDF File": Filename with "Browse" button
At the end of each tab: "Check". Then the functionality is called. On success, a new library is created in JabRef.
Code hints
Similar comparions are done at
- org.jabref.gui.mergeentries.newmergedialog.FieldRowViewModel#autoSelectBetterValue // most similar approach
- org.jabref.logic.database.DuplicateCheck#isDuplicate / org.jabref.logic.database.DuplicateCheck#compareFieldSet // but cannot be used as we really want to rely on a "high quality" BibEntry
i am interested in it Can you please assign to me
👋 Hey @HK-HARSH001, looks like you’re eager to work on this issue—great! 🎉 It also looks like you skipped reading our CONTRIBUTING.md, which explains exactly how to participate. No worries, it happens to the best of us. Give it a read, and you’ll discover the ancient wisdom of assigning issues to yourself. Trust me, it’s worth it. 🚀
/assign-me
👋 Hey @HK-HARSH001, thank you for your interest in this issue! 🎉
We're excited to have you on board. Start by exploring our Contributing guidelines, and don't forget to check out our workspace setup guidelines to get started smoothly.
For questions on JabRef functionality and the code base, you can consult the JabRef Guru or ask on our Gitter chat.
In case you encounter failing tests during development, please check our developer FAQs!
Having any questions or issues? Feel free to ask here on GitHub. Need help setting up your local workspace? Join the conversation on JabRef's Gitter chat. And don't hesitate to open a (draft) pull request early on to show the direction it is heading towards. This way, you will receive valuable feedback.
Happy coding! 🚀
👋 Hey, looks like you’re eager to work on this issue—great! 🎉 It also looks like you skipped reading our CONTRIBUTING.md, which explains exactly how to participate. No worries, it happens to the best of us. Give it a read, and you’ll discover the ancient wisdom of assigning issues to yourself. Trust me, it’s worth it. 🚀
⏰ Assignment Reminder
Hi @HK-HARSH001, this is a friendly reminder about your assignment to this issue.
[!WARNING] This issue will be automatically unassigned in 11 days if there's no activity.
Remember that you can ask the JabRef Guru or DeepWiki about anything regarding JabRef. Additionally, our contributing guide has hints on creating a pull request and a link to our Gitter chat.
How to keep your assignment
If you are working on it, you can prevent automatic unassignment by:
- Submitting a draft pull request with your progress within 11 days
- Asking for the 📌 Pinned label if you need more time
We appreciate your contribution and are here to help if needed!
Unassigned due to inactivity
Hey I'm interested in taking this up, mind if I work on this issue?
Hey I'm interested in taking this up, mind if I work on this issue?
Yes. Please go through our contributing guidelines and see the process of assignment.
/assign-me
👋 Hey @AniketTelsinge-TomTom, thank you for your interest in this issue! 🎉
We're excited to have you on board. Start by exploring our Contributing guidelines, and don't forget to check out our workspace setup guidelines to get started smoothly.
For questions on JabRef functionality and the code base, you can consult the JabRef Guru or ask on our Gitter chat.
In case you encounter failing tests during development, please check our developer FAQs!
Having any questions or issues? Feel free to ask here on GitHub. Need help setting up your local workspace? Join the conversation on JabRef's Gitter chat. And don't hesitate to open a (draft) pull request early on to show the direction it is heading towards. This way, you will receive valuable feedback.
Happy coding! 🚀
⏰ Assignment Reminder
Hi @AniketTelsinge-TomTom, this is a friendly reminder about your assignment to this issue.
[!WARNING] This issue will be automatically unassigned in 11 days if there's no activity.
Remember that you can ask the JabRef Guru or DeepWiki about anything regarding JabRef. Additionally, our contributing guide has hints on creating a pull request and a link to our Gitter chat.
How to keep your assignment
If you are working on it, you can prevent automatic unassignment by:
- Submitting a draft pull request with your progress within 11 days
- Asking for the 📌 Pinned label if you need more time
We appreciate your contribution and are here to help if needed!
📋 Assignment Update
Hi @AniketTelsinge-TomTom, due to inactivity, you have been unassigned from this issue.
Next steps
If you still want to work on this:
- Submit a pull request showing your current state. You will be automatically assigned again.
- Ask a maintainer to assign you again.
I am interested in undertaking this issue. I just have some questions
-
Just making sure, the GUI integration is not part of this issue, correct? This only influences the logic and jabkit module.
-
About the core functionality, RefCheck implements a bunch of features utilizing LLMs, I assume that this is outside our current scope. What I have understood from the current specification is as follows:
- For each citation try to find, w/o using the DOI, the relevant paper, check (if existing) the current DOI with the found DOI.
- Check the consistency across all other fields ensuring they are the same.
- Finally aggregate each discrepancy and display that to the user.
- Offline/Online mode changes where the information about the correct papers is taken from (online vs local databases)
Is this correct? Are there any additional functionalities I should keep in mind?
Finally, should I implement this for DOI only or add other identifiers as well (e.g. ISBN)?
I am interested in undertaking this issue. I just have some questions
Just making sure, the GUI integration is not part of this issue, correct? This only influences the logic and jabkit module.
About the core functionality, RefCheck implements a bunch of features utilizing LLMs, I assume that this is outside our current scope. What I have understood from the current specification is as follows:
- For each citation try to find, w/o using the DOI, the relevant paper, check (if existing) the current DOI with the found DOI.
- Check the consistency across all other fields ensuring they are the same.
- Finally aggregate each discrepancy and display that to the user.
- Offline/Online mode changes where the information about the correct papers is taken from (online vs local databases)
Is this correct? Are there any additional functionalities I should keep in mind?
Finally, should I implement this for DOI only or add other identifiers as well (e.g. ISBN)?
Pinging @koppor for visibility.
I am interested in undertaking this issue. I just have some questions
OK.
It is hard to follow your questions, because I thought, I outlined it OK enough in the issue description at https://github.com/JabRef/jabref/issues/13604#issuecomment-3428957932.
I went through it and refined it.
* Just making sure, the GUI integration is not part of this issue, correct? This only influences the logic and jabkit module.
I updated the issue description and added the GUI task. This should be only 60 minutes more work.
* About the core functionality, RefCheck implements a bunch of features utilizing LLMs, I assume that this is outside our current scope.
Yes. We want to rely on well-used algorithms for calculating distances.
* For each citation try to find, w/o using the DOI, the relevant paper, check (if existing) the current DOI with the found DOI.
I refined the algorithm description.
The core idea of the issue to wire existing (!) JabRef functionality. JabRef has it all: Plain text to a rich data model. Searching the Web for a reference. Checking if two bibliographic entries are the same. The task here is "only" to wire these things together.
Please go through the issue description again and try to read the JabRef code base.
Finally, should I implement this for DOI only or add other identifiers as well (e.g. ISBN)?
DOI and arXiv as first approach. I think, it can be extended for ISBN easily. -- However, in this setting, one important thing is testing! You need to create test cases. Find them in the real world - maybe at LinkedIn posts - or generate some using ChatGPT...
Thank you for your clarifications, I understand it better now.
The updated issue description provides extra clarity.
/assign-me
⚠️ Assignment Limit Reached for Beginner Issues
Hi @ravatex, you've already reached the limit of 2 active assignments across the labels: ``.
These labels are meant for new contributors to get started, so we encourage you to now explore more advanced issues.
This helps new contributors take their first steps while you continue to grow with more challenging tasks 💪
What you can do next:
- ✅ Finish one of your current beginner-labeled issues, then come back for another.
- 🔄 Use
/unassign-meon an issue you're no longer working on. - 🚀 Ask a maintainer for suggestions on "next-level" issues (e.g.
good second issueor beyond). - 🙋 Request an exception from a maintainer if there’s a special case.
[!TIP] Moving to higher-level issues helps you deepen your skills and contribute more impactfully to the project.
Thanks for your great work and for helping keep the community open to newcomers! 💜
@koppor
I have made progress on the issue, but through testing I have found a maybe undesirable quirk. org.jabref.logic.database.DuplicateCheck#isDuplicate returns true early if the identifier is the same. This creates a problem as references with the same identifier but different values (e.g. wrong name) show that they are real citations.
Should I create a new function in the DuplicateCheck class?
@koppor I have made progress on the issue, but through testing I have found a maybe undesirable quirk.
org.jabref.logic.database.DuplicateCheck#isDuplicatereturns true early if the identifier is the same. This creates a problem as references with the same identifier but different values (e.g. wrong name) show that they are real citations.Should I create a new function in the
DuplicateCheckclass?
Good idea - Maybe public static double compareEntries(BibEntry one, BibEntry two)
Refactor isDuplicate to make use of this method.
The returned double of compareEntries can be used to sort in the respective group.