jabref icon indicating copy to clipboard operation
jabref copied to clipboard

Add Integrity Checker for BookTitle

Open TheYorouzoya opened this issue 5 months ago • 14 comments

Closes #12271

This PR adds a new integrity checker for the booktitle field along with the associated cleanup actions.

I'll be doing the development in the following major phases (these will be refined further as I move along):

  • [x] Gathering and defining a clear set of requirements
  • [x] Drafting an implementation approach
  • [x] Adding the core logic/classes required
  • [x] Integrating the feature into the GUI
  • [x] Iterate till feature meets expectations

Steps to test

  1. Integrity Checker marks an improper Booktitle field image

  2. Check Integrity dialogue lists failing check for each embedded field individually image

  3. New booktitle cleanup checkbox and sub-panel added to Cleanup Entries Dialog Box image

  4. Clicking on the "Clean up 'booktitle'..." checkbox will enable the cleanup sub-panel, allowing the user to pick a cleanup action for each individual field found in booktitle image

  5. Post clean up, the fields are moved to their respective field editors image image image

Mandatory checks

  • [x] I own the copyright of the code submitted and I license it under the MIT license
  • [x] I manually tested my changes in running JabRef (always required)
  • [x] I added JUnit tests for changes (if applicable)
  • [x] I added screenshots in the PR description (if change is visible to the user)
  • [x] I described the change in CHANGELOG.md in a way that is understandable for the average user (if change is visible to the user)
  • [/] I checked the user documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request updating file(s) in https://github.com/JabRef/user-documentation/tree/main/en.

TheYorouzoya avatar Sep 11 '25 12:09 TheYorouzoya

Requirements

1. Integrity Checker should flag a booktitle with year numbers, locations, and page numbers

The booktitle field should be marked similar to when an integrity check fails for other fields (for example, Author)

image

The field should also show up in the Check Integrity dialog box in Quality -> Check Integrity

image

2. Integrity Checker should allow the user to move nested fields to their appropriate place via a cleanup action

If a booktitle is found to contain years. locations, or page numbers, the user should be allowed to perform a cleanup action under Quality -> Clean Up Entries dialog box which would move them to their appropriate fields.

image

Since it is possible that the field we're trying to move our extracted data to is already populated, the user should be given a choice whether to perform the move for each piece of data found. This translates to adding three cleanup actions or options to the dialog box (one each for year, page number, and location).

Implementation Approach: Details, Edge Cases, and Questions

I'll take an example from the issue post and break it down as follows:

Input: European Conference on Circuit Theory and Design, {ECCTD} 2015, Trondheim, Norway, August 24-26, 2015 Year: 2015, 2015 Month: August Page Numbers: 24-26 Locations: Trondheim, Norway

Year Numbers

Years can be of the form XXXX with four digits on their own. Since the booktitle field deals with scientific journals, books, and conferences, we can further refine this in the range 16XX to 20XX (the first ever publications are from the 1600s).

Q1. What about multiple years present in a title? Which one do we pick to transfer to the year field? My guess would be the latest one.

Months

Months are spelled out in text (like January, February, and so on). These can be easily picked up via a regex or a simple string comparison. Once found, they can be moved to the Month field under Optional fields.

Page Numbers

These will typically be of the form <number>-<number> (like 22-45) or <number>--<number>(22--45). There aren't any edge cases here to talk about other than whether we want to support more formats.

[!NOTE] I'm assuming that the 24-26 in the example is referring to a page range and not a date. If that is not the case, then I'd like some examples of page numbers in a booktitle for reference.

Locations

These can refer to the names of countries and/or cities present in the field. While a previous attempt doesn't provide much in terms of details, this comment on the original issue post does suggest an offline-friendly approach.

Since we'll encounter location names only in cases of conferences, we're only looking at cities big enough to host one of them. GeoNames provides multiple datasets with cutoffs based on population (>500, >1000, >5000, >15000). Just to be on the safe side, we can pick the >1000 population dataset and incorporate that into our search database.

The data has 162,090 cities in it along with a bunch of associated information. We'll strip away all the metadata and keep only the names. Since the data provides city names in UTF-8 as well as in ASCII as separate columns, we'll flatten it further down to one city per line and deduplicate the resulting dataset. Doing this brings us down to 1.9MB from our ~30MB starting point with the total number of cities going up to 173,371 (not that many cities have different UTF-8 and ASCII names). We can add the list of countries on top of this to get our final dataset.

[!NOTE] If we really want to be stingy about space, we can further compress this down using something like gzip and get down to around ~800KB as plaintext can be compressed quite well.

Loading this many entries into memory shouldn't be that big of burden either. A GPT-assisted rough calculation puts us at around 40MB of heap usage if we're using a HashSet<String> [Edit: I'm now reconsidering the HashSet and instead using a specialized Trie data structure to accommodate for punctuation and whitespace within location names].

Q2. Is this approach okay? If there are issues with the overhead, we can use a bloom filter, but that is a probabilistic data structure which can lead to some false positives.

Q3. Is there a specific field where these should be moved to as part of the cleanup action? I have noticed these fields: address, location, and venue.

TheYorouzoya avatar Sep 12 '25 12:09 TheYorouzoya

@koppor please check if the approach fits with the expectations of the feature, and help clarify those questions. I'll start laying out some of the core logic in the meantime.

TheYorouzoya avatar Sep 12 '25 12:09 TheYorouzoya

Since I have not received any feedback on the approach for the last two weeks, I'll be pushing an implementation as per the outlined approach in a couple of days.

TheYorouzoya avatar Sep 27 '25 10:09 TheYorouzoya

@koppor check if the feature implementation so far is up to expectations. Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

TheYorouzoya avatar Sep 29 '25 18:09 TheYorouzoya

Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

I assume the functionality is existing for the other settings. Thus, please also support this for the new checkbox. Thank you 😅.

koppor avatar Oct 17 '25 12:10 koppor

Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

I assume the functionality is existing for the other settings. Thus, please also support this for the new checkbox. Thank you 😅.

My question wasn't a "whether or not" to add the functionality, it says "whether it requires A OR B" to be modified to have it enabled, i.e., which of the two options need to be updated to get it to work.

TheYorouzoya avatar Oct 17 '25 13:10 TheYorouzoya

Also, please help me figure out whether it requires adding a preference migration or updating preferences in the CliPreferences to have JabRef "remember" the previous choice for the cleanup panel.

I assume the functionality is existing for the other settings. Thus, please also support this for the new checkbox. Thank you 😅.

My question wasn't a "whether or not" to add the functionality, it says "whether it requires A OR B" to be modified to have it enabled, i.e., which of the two options need to be updated to get it to work.

I think, its both. B to have it working, A to update existing preferences to have it enabled for all users.

koppor avatar Nov 18 '25 19:11 koppor

Since I have not received any feedback on the approach for the last two weeks,

Sorry for this. We do this in our freetime and have not enough freetime to cover all contributions. We try to invest more - currently by skipping sports and reducing our sleep. Since a day has only 24 hours, also there are limits.

koppor avatar Nov 18 '25 19:11 koppor

Sorry for this. We do this in our freetime and have not enough freetime to cover all contributions. We try to invest more - currently by skipping sports and reducing our sleep. Since a day has only 24 hours, also there are limits.

Thanks for the update. I completely understand the time constraints. Open source is voluntary and I appreciate the work.

Just to clarify the context of my earlier ping: I wasn’t asking for a full code review (there was no code in the PR at that time). I was hoping to confirm the implementation approach before investing more time in writing code. Because this issue has had discussions spread across several years, it helps to re-establish common ground regarding the expectations and the architecture up front.

As for code, the last message I sent was on September 30th, 50 days ago, to ask for help when I was stuck. That has now been answered.

TheYorouzoya avatar Nov 20 '25 14:11 TheYorouzoya

As for code, the last message I sent was on September 30th, 50 days ago, to ask for help when I was stuck. That has now been answered.

Sure. We also might have missed some messages in the dev chat. It is very OK to ask after a few weeks. It is also very OK to look into other PRs and try to give feedback to other contributors. Some need help with their submodules; this is what most contributors should be able to support. By that, our community grows, contributors get more feedback and it also decreases the delay of answers at the own project, because the whole community spends more time on the project.

koppor avatar Nov 20 '25 15:11 koppor

@TheYorouzoya Please not that we merged a PR started September, 11 also touching the GUI (https://github.com/JabRef/jabref/pull/13852). Maybe, you first remove the radio buttons and then try to merge latest main to resolve the merge conflicts. -- You could also split-up the PR into two: One for the integrity check and one for the GUI. Some of us have success with GitButler, but this tool also needs some time to get used to it. Its benefit its: one sees the changes of all PRs combined - and can assign local changes to PRs. -- You can also "just" finish the PR,


The license of jablib/src/main/resources/util/countries_cities1000.txt needs to be clear. It would be OK if it had a different license; however it should be permissive and not copyleft. MIT would be best.

Since we also asked for an MVStore for that, maybe, a separate PR should start with that... Because this is deep in our build pipeline and you need to get used to JBang etc.

koppor avatar Nov 20 '25 15:11 koppor

What I meant: move if empty, if equal: remove, else noop

koppor avatar Nov 20 '25 16:11 koppor

Your pull request conflicts with the target branch.

Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line.

github-actions[bot] avatar Dec 01 '25 00:12 github-actions[bot]

You could also split-up the PR into two...

Sorry for the late reply here. I will be splitting the PR into two as you mentioned. Though, I will be able to work on it sometime next week if that's alright.

TheYorouzoya avatar Dec 01 '25 06:12 TheYorouzoya