jabref
jabref copied to clipboard
Implement caching for citation relations
Context
PR https://github.com/JabRef/jabref/pull/10980 introduced a memory saving for citation relations to prevent JabRef's memory usage growing constantly. The drawback is that citation relations might become unavailable due to API limitations of the online service.
Quoting https://libguides.ucalgary.ca/c.php?g=732144&p=5260798
The API allows up to 100 requests per 5 minutes. To access a higher rate limit, complete the form to request authentication for your project.
That means, for a large library, I cannot step through the references, because I could hit the rate limit. I know, this is seldom, but it could happen at following setting: In a corporate setting: all requests are going through a proxy. Thus, the rate limit is not per person, but per the SUM of persons. In case 100 researchers work in parallel with JabRef, each researcher can get ONE request per 5 minutes. And per entry TWO requests are needed: For the citing and the cited by.
The following solution would really help to make the citation relations really usable. (Because the information for each DOI is stored independent of each library and is presented as soon it is availbable.)...
Task 1
Let's investiage MVStore. MVStore is a library storing the values of a hashmap on disk. Thus, NOT in memory. Thus, it takes less memory than a full hash map in memory, because it is on disk. -- MVStore routes through the request to a map entry to disk. - See https://www.h2database.com/html/mvstore.html for details.
Estimate: 100 lines of code, but scattered around JabRef. NativeDesktop needs to be touched etc. The most difficult thing will be the closing of the MVStore. Since the DOIs are globally unique, one can close the MVStore when JabRef is shut down. This makes it "easier" (in comparison to https://github.com/JabRef/jabref/pull/10937, where for each tab some closing thing were necessary). -- Nevertheless, it could be that this will be a back-and-forth code development (meaning: code reviews with significant changes could come up).
Implementation hint: Do NOT do it like org.jabref.logic.journals.JournalAbbreviationRepository
, because there, there is no direct access to the MVStore, but new hashmaps are created. We really need to work on the Maps provided by the MVStore. Because the MVStore "intelligent" does memory management.
Task 2
Follow-up requirement: Refresh the DOI information if one week passed since the last fetch. Maybe this can be baked into the HashMap designed for the MVStore.
Hey @koppor, I would like to take this up
Welcome to the vibrant world of open-source development with JabRef!
Newcomers, we're excited to have you on board. Start by exploring our Contributing guidelines, and don't forget to check out our workspace setup guidelines to get started smoothly.
Having any questions or issues? Feel free to ask here on GitHub. Need help setting up your local workspace? Join the conversation on JabRef's Gitter chat. And don't hesitate to open a (draft) pull request early on to show the direction it is heading towards. This way, you will receive valuable feedback.
Happy coding! 🚀
Hello @koppor
Is this issue still reserved ?
If not then I would be quite interested to provide a solution.
After taking a look to the main branch code I see possible improvements in the repository design that could also:
- ease third party library integration if MVStore does not fit (maintenance)
- decouple the calling code from the repository logic
- somehow simplify calling code
I do not know MVStore yet but I saw that H2 provides a fairly extensive documentation.
Hello @koppor and @ThiloteE ,
I was wondering if you could provide some insights to help me understand the whole design:
- Regarding the
citationrelationtab
package, is there any specific reason that it includes the repository, the cache and thesemanticsscholar
code as its scope belongs to the GUI ? From my point of view repository, cache and fetcher logic could live without the GUI and be exposed in a back-end eventually (should it be local or remote). Could it be interesting to move those classes to a dedicated package in thelogic
layer, likelogic.citation.relation
? - To address this issue I would like to try to provide a generic approach to answer multi layer data access but without concrete implementation. I was thinking about creating a dedicated package under
logic
:logic.util.dao
. Do you see other possibilities ? - And finally, are there specific rules to respect regarding package dependencies ?
Regards
Dear @alexandre-cremieux, we are currently either on holidays or busy with other projects. Therefore only a short answer: Your ideas sound great! Separation from GUI and logic is required.
General package dependencies are listed at https://devdocs.jabref.org/getting-into-the-code/high-level-documentation.html. Feel free to refine.
Background: We also get contributions from students with a less strong wish for a clean architecture. After a few rounds, we let the code through: Better a feature with an OKish implementation than a feature not coming, because no one has time to work on clean code...
Hello @koppor
Thanks for your answer and documentation sharing. I was quite busy those last days also.
I understand your position. I might take some time, but I'm on my way to refactor the package as we agree on this.
Depending of my commit management, I may open separate PRs for refactoring and new implementation to try to ease the review.
Regards