jabref icon indicating copy to clipboard operation
jabref copied to clipboard

Implement RefChecker in JabKit

Open koppor opened this issue 4 months ago • 20 comments

⚠ This is a bigger "first issue". Only take it if you have enough time for it and you follow-up to review comments. ⚠

Context

There are more and more fake references. JabRef has the infrastructure to check it, but it needs to be wired together.

A whole bib file should be checked.

There is a Python-script "RefChecker" doing it, but we want to do it integrated in JabRef

Related Work

See https://github.com/markrussinovich/refchecker

(LinkedIn-Post: https://www.linkedin.com/posts/markrussinovich_github-markrussinovichrefchecker-a-tool-activity-7355654490076696576-WycH?utm_source=share&utm_medium=member_desktop&rcm=ACoAAACCUVQBYmlu_A9exTiDRiuXB95v-LNYD4c)

1. Implement logic

Goal

For each BibEntry, fetch authoritative metadata via its own identifiers and compare local vs fetched to classify into groups.

Groups to ensure (create if missing)

refcheck
 ├─ real paper
 ├─ unsure
 └─ fake paper

Implementation can mirror org.jabref.gui.groups.GroupTreeViewModel#addSuggestedGroups.

Algorithm (per BibEntry)

  1. Convert text to BibEntry

    • In Prefernces > Web Search, there is "Default plain citation parser" configured. This one should be used.
    • Use org.jabref.logic.importer.plaincitation.SeveralPlainCitationParser to turn a text into a List of BibEntries
  2. Resolve by DOI (preferred)

    • If StandardField.DOI present: fetch authoritativeEntry via org.jabref.logic.importer.fetcher.DoiFetcher#performSearchById(doi).
    • Else try to find a DOI via org.jabref.logic.importer.fetcher.CrossRef#findIdentifier(entry); if found, fetch via DoiFetcher and store as authoritativeEntry
  3. Fallback: resolve by arXiv (authoritativeEntry still null)

    • If arXiv ID present or found via org.jabref.logic.importer.fetcher.ArXivFetcher#findIdentifier(entry), fetch its metadata and store in ``authoritativeEntry`
  4. Compare: local vs authoritativeEntry

    Use org.jabref.logic.database.DuplicateCheck#isDuplicate to determine if local is a duplicate of authoritativeEntry

    If yes: Add to group real paper. If not: Add to group fake paper

    return


Now: authoritativeEntry is null

  1. Search paper using fetcher

    Look up paper using org.jabref.logic.importer.fetcher.CompositeSearchBasedFetcher.

    If something found: check if any entry is a duplicate of local. If yes: If yes: Add to group real paper. If not: Add to group fake paper


The current proposal does not make use of the group "unsure". Maybe, the DuplicateCheck class needs to be adapted accordingly.

2. Add test

For 1, tests need to be crafted. Think of TDD - and add tests before/while coding

3. Wire into CLI

A. Include refcheck --online/--offline <file.bib> in org.jabref.cli.ArgumentProcessor B. Include refcheck --online/--offline <file.pdf> in org.jabref.cli.ArgumentProcessor

Note that --online and --offline are optional. If not given, the default plain citation parser is used.

For B

Import references from PDF into .bib using "New library based on references". Users can do --online and --offline (with --online being the default if AI is available. Error if --online and no AI available)

Image

4. Wire into GUI

Create "Tools" > "Ref Checker"

Content:

Tab "Citations" and Tab "PDF File"

Tab Citations: Text field with citations

Tab "PDF File": Filename with "Browse" button

At the end of each tab: "Check". Then the functionality is called. On success, a new library is created in JabRef.

Code hints

Similar comparions are done at

  • org.jabref.gui.mergeentries.newmergedialog.FieldRowViewModel#autoSelectBetterValue // most similar approach
  • org.jabref.logic.database.DuplicateCheck#isDuplicate / org.jabref.logic.database.DuplicateCheck#compareFieldSet // but cannot be used as we really want to rely on a "high quality" BibEntry

koppor avatar Jul 28 '25 23:07 koppor

i am interested in it Can you please assign to me

HK-HARSH001 avatar Jul 29 '25 01:07 HK-HARSH001

👋 Hey @HK-HARSH001, looks like you’re eager to work on this issue—great! 🎉 It also looks like you skipped reading our CONTRIBUTING.md, which explains exactly how to participate. No worries, it happens to the best of us. Give it a read, and you’ll discover the ancient wisdom of assigning issues to yourself. Trust me, it’s worth it. 🚀

github-actions[bot] avatar Jul 29 '25 01:07 github-actions[bot]

/assign-me

HK-HARSH001 avatar Jul 29 '25 01:07 HK-HARSH001

👋 Hey @HK-HARSH001, thank you for your interest in this issue! 🎉

We're excited to have you on board. Start by exploring our Contributing guidelines, and don't forget to check out our workspace setup guidelines to get started smoothly.

For questions on JabRef functionality and the code base, you can consult the JabRef Guru or ask on our Gitter chat.

In case you encounter failing tests during development, please check our developer FAQs!

Having any questions or issues? Feel free to ask here on GitHub. Need help setting up your local workspace? Join the conversation on JabRef's Gitter chat. And don't hesitate to open a (draft) pull request early on to show the direction it is heading towards. This way, you will receive valuable feedback.

Happy coding! 🚀

github-actions[bot] avatar Jul 29 '25 01:07 github-actions[bot]

👋 Hey, looks like you’re eager to work on this issue—great! 🎉 It also looks like you skipped reading our CONTRIBUTING.md, which explains exactly how to participate. No worries, it happens to the best of us. Give it a read, and you’ll discover the ancient wisdom of assigning issues to yourself. Trust me, it’s worth it. 🚀

github-actions[bot] avatar Jul 29 '25 07:07 github-actions[bot]

⏰ Assignment Reminder

Hi @HK-HARSH001, this is a friendly reminder about your assignment to this issue.

[!WARNING] This issue will be automatically unassigned in 11 days if there's no activity.

Remember that you can ask the JabRef Guru or DeepWiki about anything regarding JabRef. Additionally, our contributing guide has hints on creating a pull request and a link to our Gitter chat.

How to keep your assignment


If you are working on it, you can prevent automatic unassignment by:

  • Submitting a draft pull request with your progress within 11 days
  • Asking for the 📌 Pinned label if you need more time

We appreciate your contribution and are here to help if needed!

github-actions[bot] avatar Aug 14 '25 12:08 github-actions[bot]

Unassigned due to inactivity

subhramit avatar Aug 27 '25 20:08 subhramit

Hey I'm interested in taking this up, mind if I work on this issue?

AniketTelsinge-TomTom avatar Sep 15 '25 07:09 AniketTelsinge-TomTom

Hey I'm interested in taking this up, mind if I work on this issue?

Yes. Please go through our contributing guidelines and see the process of assignment.

subhramit avatar Sep 15 '25 07:09 subhramit

/assign-me

AniketTelsinge-TomTom avatar Sep 15 '25 07:09 AniketTelsinge-TomTom

👋 Hey @AniketTelsinge-TomTom, thank you for your interest in this issue! 🎉

We're excited to have you on board. Start by exploring our Contributing guidelines, and don't forget to check out our workspace setup guidelines to get started smoothly.

For questions on JabRef functionality and the code base, you can consult the JabRef Guru or ask on our Gitter chat.

In case you encounter failing tests during development, please check our developer FAQs!

Having any questions or issues? Feel free to ask here on GitHub. Need help setting up your local workspace? Join the conversation on JabRef's Gitter chat. And don't hesitate to open a (draft) pull request early on to show the direction it is heading towards. This way, you will receive valuable feedback.

Happy coding! 🚀

github-actions[bot] avatar Sep 15 '25 07:09 github-actions[bot]

⏰ Assignment Reminder

Hi @AniketTelsinge-TomTom, this is a friendly reminder about your assignment to this issue.

[!WARNING] This issue will be automatically unassigned in 11 days if there's no activity.

Remember that you can ask the JabRef Guru or DeepWiki about anything regarding JabRef. Additionally, our contributing guide has hints on creating a pull request and a link to our Gitter chat.

How to keep your assignment


If you are working on it, you can prevent automatic unassignment by:

  • Submitting a draft pull request with your progress within 11 days
  • Asking for the 📌 Pinned label if you need more time

We appreciate your contribution and are here to help if needed!

github-actions[bot] avatar Sep 24 '25 12:09 github-actions[bot]

📋 Assignment Update

Hi @AniketTelsinge-TomTom, due to inactivity, you have been unassigned from this issue.

Next steps


If you still want to work on this:

  • Submit a pull request showing your current state. You will be automatically assigned again.
  • Ask a maintainer to assign you again.

github-actions[bot] avatar Oct 14 '25 12:10 github-actions[bot]

I am interested in undertaking this issue. I just have some questions

  • Just making sure, the GUI integration is not part of this issue, correct? This only influences the logic and jabkit module.

  • About the core functionality, RefCheck implements a bunch of features utilizing LLMs, I assume that this is outside our current scope. What I have understood from the current specification is as follows:

    • For each citation try to find, w/o using the DOI, the relevant paper, check (if existing) the current DOI with the found DOI.
    • Check the consistency across all other fields ensuring they are the same.
    • Finally aggregate each discrepancy and display that to the user.
    • Offline/Online mode changes where the information about the correct papers is taken from (online vs local databases)

    Is this correct? Are there any additional functionalities I should keep in mind?

Finally, should I implement this for DOI only or add other identifiers as well (e.g. ISBN)?

ravatex avatar Oct 18 '25 20:10 ravatex

I am interested in undertaking this issue. I just have some questions

  • Just making sure, the GUI integration is not part of this issue, correct? This only influences the logic and jabkit module.

  • About the core functionality, RefCheck implements a bunch of features utilizing LLMs, I assume that this is outside our current scope. What I have understood from the current specification is as follows:

    • For each citation try to find, w/o using the DOI, the relevant paper, check (if existing) the current DOI with the found DOI.
    • Check the consistency across all other fields ensuring they are the same.
    • Finally aggregate each discrepancy and display that to the user.
    • Offline/Online mode changes where the information about the correct papers is taken from (online vs local databases)

    Is this correct? Are there any additional functionalities I should keep in mind?

Finally, should I implement this for DOI only or add other identifiers as well (e.g. ISBN)?

Pinging @koppor for visibility.

subhramit avatar Oct 21 '25 19:10 subhramit

I am interested in undertaking this issue. I just have some questions

OK.

It is hard to follow your questions, because I thought, I outlined it OK enough in the issue description at https://github.com/JabRef/jabref/issues/13604#issuecomment-3428957932.

I went through it and refined it.

* Just making sure, the GUI integration is not part of this issue, correct? This only influences the logic and jabkit module.

I updated the issue description and added the GUI task. This should be only 60 minutes more work.

* About the core functionality, RefCheck implements a bunch of features utilizing LLMs, I assume that this is outside our current scope. 

Yes. We want to rely on well-used algorithms for calculating distances.

  * For each citation try to find, w/o using the DOI, the relevant paper, check (if existing) the current DOI with the found DOI.

I refined the algorithm description.

The core idea of the issue to wire existing (!) JabRef functionality. JabRef has it all: Plain text to a rich data model. Searching the Web for a reference. Checking if two bibliographic entries are the same. The task here is "only" to wire these things together.

Please go through the issue description again and try to read the JabRef code base.

Finally, should I implement this for DOI only or add other identifiers as well (e.g. ISBN)?

DOI and arXiv as first approach. I think, it can be extended for ISBN easily. -- However, in this setting, one important thing is testing! You need to create test cases. Find them in the real world - maybe at LinkedIn posts - or generate some using ChatGPT...

koppor avatar Oct 23 '25 14:10 koppor

Thank you for your clarifications, I understand it better now.

The updated issue description provides extra clarity.

/assign-me

ravatex avatar Oct 24 '25 16:10 ravatex

⚠️ Assignment Limit Reached for Beginner Issues

Hi @ravatex, you've already reached the limit of 2 active assignments across the labels: ``.

These labels are meant for new contributors to get started, so we encourage you to now explore more advanced issues.
This helps new contributors take their first steps while you continue to grow with more challenging tasks 💪


What you can do next:

  • ✅ Finish one of your current beginner-labeled issues, then come back for another.
  • 🔄 Use /unassign-me on an issue you're no longer working on.
  • 🚀 Ask a maintainer for suggestions on "next-level" issues (e.g. good second issue or beyond).
  • 🙋 Request an exception from a maintainer if there’s a special case.

[!TIP] Moving to higher-level issues helps you deepen your skills and contribute more impactfully to the project.

Thanks for your great work and for helping keep the community open to newcomers! 💜

github-actions[bot] avatar Oct 24 '25 16:10 github-actions[bot]

@koppor I have made progress on the issue, but through testing I have found a maybe undesirable quirk. org.jabref.logic.database.DuplicateCheck#isDuplicate returns true early if the identifier is the same. This creates a problem as references with the same identifier but different values (e.g. wrong name) show that they are real citations.

Should I create a new function in the DuplicateCheck class?

ravatex avatar Oct 26 '25 21:10 ravatex

@koppor I have made progress on the issue, but through testing I have found a maybe undesirable quirk. org.jabref.logic.database.DuplicateCheck#isDuplicate returns true early if the identifier is the same. This creates a problem as references with the same identifier but different values (e.g. wrong name) show that they are real citations.

Should I create a new function in the DuplicateCheck class?

Good idea - Maybe public static double compareEntries(BibEntry one, BibEntry two)

Refactor isDuplicate to make use of this method.

The returned double of compareEntries can be used to sort in the respective group.

koppor avatar Oct 30 '25 08:10 koppor