jabref New Fetcher that mimics ArXiv fetcher, but imports more information from DOI

trafficstars

Implements #9092

[x] Add new Fetcher that mimics the ArXiv fetcher, but intercept all Bibtex entries and combines them with the corresponding Bibtex entry directly from its ArXiv-issued DOI (see this article)
[ ] Create feature flag for toggling between new and old behavior

[ ] Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
[ ] Tests created for changes (if applicable)
[ ] Manually tested changed features in running JabRef (always required)
[ ] Screenshots added in PR description (for UI changes)
[x] Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
[x] Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Sep 22 '22 20:09 thiagocferr

Below, you can see the difference between import with the old ArXiv fetcher, with the ArXiv-issued DOI and with the new ArXiv fetcher, respectively, for https://arxiv.org/abs/1811.10364. Note that when two fields clash, the chosen one is always the one from the old ArXiv fetcher.

If you have some opinions about this choice, feel free to discuss them.

@Comment{Import with the original 'ArXiv' fetcher}
@Article{Beel2018,
  author        = {Joeran Beel and Andrew Collins and Akiko Aizawa},
  title         = {The Architecture of Mr. DLib's Scientific Recommender-System API},
  year          = {2018},
  month         = nov,
  abstract      = {Recommender systems in academia are not widely available. This may be in part due to the difficulty and cost of developing and maintaining recommender systems. Many operators of academic products such as digital libraries and reference managers avoid this effort, although a recommender system could provide significant benefits to their users. In this paper, we introduce Mr. DLib's "Recommendations as-a-Service" (RaaS) API that allows operators of academic products to easily integrate a scientific recommender system into their products. Mr. DLib generates recommendations for research articles but in the future, recommendations may include call for papers, grants, etc. Operators of academic products can request recommendations from Mr. DLib and display these recommendations to their users. Mr. DLib can be integrated in just a few hours or days; creating an equivalent recommender system from scratch would require several months for an academic operator. Mr. DLib has been used by GESIS Sowiport and by the reference manager JabRef. Mr. DLib is open source and its goal is to facilitate the application of, and research on, scientific recommender systems. In this paper, we present the motivation for Mr. DLib, the architecture and details about the effectiveness. Mr. DLib has delivered 94m recommendations over a span of two years with an average click-through rate of 0.12%.},
  archiveprefix = {arXiv},
  eprint        = {1811.10364},
  file          = {:http\://arxiv.org/pdf/1811.10364v1:PDF},
  keywords      = {cs.IR, cs.AI, cs.DL, cs.LG},
  primaryclass  = {cs.IR},
}

@Comment{Import with only the 'DOI' fetcher}
@Misc{Beel2018a,
  author    = {Beel, Joeran and Collins, Andrew and Aizawa, Akiko},
  title     = {The Architecture of Mr. DLib's Scientific Recommender-System API},
  year      = {2018},
  copyright = {arXiv.org perpetual, non-exclusive license},
  doi       = {10.48550/ARXIV.1811.10364},
  keywords  = {Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Digital Libraries (cs.DL), Machine Learning (cs.LG), FOS: Computer and information sciences},
  publisher = {arXiv},
}

@Comment{Import with the new 'ArXivWithDoi' fetcher}
@Article{Beel2018b,
  author        = {Joeran Beel and Andrew Collins and Akiko Aizawa},
  title         = {The Architecture of Mr. DLib's Scientific Recommender-System API},
  year          = {2018},
  month         = nov,
  abstract      = {Recommender systems in academia are not widely available. This may be in part due to the difficulty and cost of developing and maintaining recommender systems. Many operators of academic products such as digital libraries and reference managers avoid this effort, although a recommender system could provide significant benefits to their users. In this paper, we introduce Mr. DLib's "Recommendations as-a-Service" (RaaS) API that allows operators of academic products to easily integrate a scientific recommender system into their products. Mr. DLib generates recommendations for research articles but in the future, recommendations may include call for papers, grants, etc. Operators of academic products can request recommendations from Mr. DLib and display these recommendations to their users. Mr. DLib can be integrated in just a few hours or days; creating an equivalent recommender system from scratch would require several months for an academic operator. Mr. DLib has been used by GESIS Sowiport and by the reference manager JabRef. Mr. DLib is open source and its goal is to facilitate the application of, and research on, scientific recommender systems. In this paper, we present the motivation for Mr. DLib, the architecture and details about the effectiveness. Mr. DLib has delivered 94m recommendations over a span of two years with an average click-through rate of 0.12%.},
  archiveprefix = {arXiv},
  copyright     = {arXiv.org perpetual, non-exclusive license},
  doi           = {10.48550/ARXIV.1811.10364},
  eprint        = {1811.10364},
  file          = {:http\://arxiv.org/pdf/1811.10364v1:PDF},
  keywords      = {cs.IR, cs.AI, cs.DL, cs.LG},
  primaryclass  = {cs.IR},
  publisher     = {arXiv},
}

Sep 22 '22 20:09 thiagocferr

@thiagocferr Thank you for tackling this issue.

"Import with the new 'ArXivWithDoi' fetcher" contains more information about the reference. That is nice!

I am not sure about the field file = {:http\://arxiv.org/pdf/1811.10364v1:PDF}. Should'nt it be converted to url = {http://arxiv.org/pdf/1811.10364v1} ? @ThiloteE : any suggestion?

About which field to keep:

I think we should keep the keywords provided by the DOI fetcher instead of the ones provided by original arXiv fetcher.
For authors with compound last names (such as in https://arxiv.org/abs/2209.11222), the format provided by the DOI fetcher is better. Please, keep this one.

Sep 23 '22 09:09 mlep

@mlep

About the `file` vs `url` fields

In the JabRef UI, having one or the other allows for quicker access to the file from either

The entry's General tab;
The Entry Table's columns Linked files (for field file) / Linked identifiers (field url)

The thing is, by replacing file with url, the direct reference to the file is no longer on the Linked files (represented by a 'file' symbol) column and goes to the Linked identifiers (a 'globe' symbol). However, this field contain more than one element, like access links to the ArXiv page (not the PDF), so accessing the PDF file is a two-step process of clicking the field, opening a small menu, and then clicking the url item opens the PDF. This is shown on the image below:

By using the file field, on the other hand, you can more easily access the PDF from the Entry Table by simply pressing the 'PDF' icon, while leaving the other field to access link to ArXiv.org. This can be see below as well:

Although I see this a better workflow, I think having both file and url may have broader usefulness, both programmatically (as in having the field on the Bibtex file) and for reducing the number of Entry Table columns. What do you think?

About fields to keep

Yes, the DOI fetcher seems to get more "human-friendly" information about keywords. If this is the intended direction, should it also remove the abbreviation that might come from the original ArXiv entry (i.e. "(cs.IR)" from "Information Retrieval (cs.IR)")?
It does seem to seem that the last name, first name is easier on Bibtex, so I'll be leaving the one from the DOI fetcher.

Sep 23 '22 18:09 thiagocferr

I just had a look at this pr.

It took me ages to search for https://arxiv.org/abs/1811.10364 with ArXiv websearch. - Is this normal?
I think adding the following preference would probably be overkill: Following the principle of "fetching as much metadata as possible", it should be sufficient to merge the old ArXiv fetcher and the new implementation, as you attempt to do and simply make this the default :-)
My thoughts aboutfile field:
- If you have downloaded the file to your local harddrive and the link points to the path on your harddrive, then the file field should be used.
- If the url directly refers to a pdf on a webpage, i think the file field is fair enough. Example: https://arxiv.org/pdf/1811.10364v1.pdf
- If the url leads to a normal webpage (not directly to the pdf), then I would put it into the url field :-) Example: https://arxiv.org/abs/1811.10364v1

Sep 23 '22 18:09 ThiloteE

The thing is, by replacing file with url, the direct reference to the file is no longer on the Linked files (represented by a 'file' symbol) column and goes to the Linked identifiers (a 'globe' symbol). However, this field contain more than one element, like access links to the ArXiv page (not the PDF), so accessing the PDF file is a two-step process of clicking the field, opening a small menu, and then clicking the url item opens the PDF. This is shown on the image below:

This argument only holds true if there are not multiple linked files to the entry, which can happen quite easily. The reverse also can happen: multiple files are linked to an entry, but only one url, so i think this argument does not really help in solving this dilemma.

Sep 23 '22 18:09 ThiloteE

It does seem to seem that the last name, first name is easier on Bibtex, so I'll be leaving the one from the DOI fetcher.

@ThiloteE Do you mean "keeping the one from the DOI fetcher"?

Thank you for the rationale about the field file. Difficult to decide... So I suggest we leave the file field as it is, as long as users do not raise the issue.

Sep 23 '22 19:09 mlep

~@btut~ @mlep i think you meant to tag thiagocferr

Sep 23 '22 19:09 ThiloteE

@ThiloteE huh?

Sep 23 '22 19:09 btut

oooh lol. Sry I wanted to tag @mlep :rofl:

Sep 23 '22 20:09 ThiloteE

OK.... Let's to guess this straight...

It does seem to seem that the last name, first name is easier on Bibtex, so I'll be leaving the one from the DOI fetcher.

@thiagocferr Do you mean "keeping the one from the DOI fetcher"?

@ThiloteE Thank you for the rationale about the field file. Difficult to decide... So I suggest we leave the file field as it is, as long as users do not raise the issue.

Sep 23 '22 20:09 mlep

It does seem to seem that the last name, first name is easier on Bibtex, so I'll be leaving the one from the DOI fetcher.

@thiagocferr Do you mean "keeping the one from the DOI fetcher"?

Yes, my bad.

@ThiloteE

It took me ages to search for https://arxiv.org/abs/1811.10364 with ArXiv websearch. - Is this normal?

What exactly do you mean by that? If you mean searching using the link itself, I guess it doesn't work, as this error is logged when you use the link itself on the ArXiv Web search:

[2022-09-23 17:07:17 [pool-2-thread-1] org.jabref.logic.importer.fetcher.transformers.AbstractQueryTransformer.transform()
ERROR: Unsupported case when transforming the query:
 <regexp field='https' term=''/>
2022-09-23 17:07:17 [pool-2-thread-1] org.jabref.logic.importer.fetcher.transformers.AbstractQueryTransformer.transform()
ERROR: Unsupported case when transforming the query:
 <regexp field='default' term='abs'/>](https://arxiv.org/abs/1811.10364)

Using the ArXiv ID, it goes very fast for me...

I think adding the following preference would probably be overkill:

Yes, I was wondering the same, but decided to just do it as both a learning exercise and a tool for helping me debug. I may just remove this when taking this PR for review.

Following the principle of "fetching as much metadata as possible", it should be sufficient to merge the old ArXiv fetcher and the new implementation, as you attempt to do and simply make this the default :-)

Yes, I do think that having two files marked as the 'ArXiv fetcher' is kind of bad, and if we adhere to getting the maximum information possible, it would be bad if any other file could directly instantiate the fetcher with less information.

However, I do think a separation between the class that does the heavy lifting (ArXiv.java) and the one who just does post-processing (ArXivWithDoi.java) (specially because it directly uses another fetcher) is required as good coding practice. So, I'm about to make another commit doing exactly this: keep only one class, ArXivFetcher, a copy of ArXivWithDoi with the previous ArXiv class as an internal, private class.

3. My thoughts about`file` field:
   
   * If you have downloaded the file to your local harddrive and the link points to the path on your harddrive, then the file field should be used.
   * If the url directly refers to a pdf on a webpage, i think the file field is fair enough.
     Example: `https://arxiv.org/pdf/1811.10364v1.pdf`
   * If the url leads to a normal webpage (not directly to the pdf), then I would put it into the url field :-)
     Example: `https://arxiv.org/abs/1811.10364v1`

This take is reasonable, indeed...

The thing is, by replacing file with url, the direct reference to the file is no longer on the Linked files (represented by a 'file' symbol) column and goes to the Linked identifiers (a 'globe' symbol). However, this field contain more than one element, like access links to the ArXiv page (not the PDF), so accessing the PDF file is a two-step process of clicking the field, opening a small menu, and then clicking the url item opens the PDF. This is shown on the image below:

This argument only holds true if there are not multiple linked files to the entry, which can happen quite easily. The reverse also can happen: multiple files are linked to an entry, but only one url, so i think this argument does not really help in solving this dilemma.

In a general setting, yes, this argument does not hold. However, the imported information from ArXiv seems to have an standard about this. Not to say that I've made extensive testing, but they all seem to include at least the Eprint field, which occupies the Linked identifier column (and occasional a DOI field; the ones manually introduced, that is). If we were to convert the file to url, this would create at least 2 Linked identifier (and that little submenu), but no direct access via Linked files. This is why I think eliminating file would be detrimental in this situation.

Now, about the use of url, I think this could be a good addition. It could be (optionally) used in Bibtex references (together with access date) and is, overall, more registered information. However, I think this change is out of the scope of this PR, as it could really be done with only the information retrieved from the ArXiv API (the file field is filled by default), and this PR should only focus on the merging of the new info from the DOI API.

Because of all of this, I think we should really just leave the way it is: file with no url, for now.

Sep 23 '22 21:09 thiagocferr

Codewise it looks good so far. I would leave the storing of the url in linked files, in the importer JabRef can automatically download then. Please have a look at the failing tets. I think for the architecture tests you just have to rename as well or adapt to the new class names

Sep 26 '22 19:09 Siedlerchr

So, I happened to get stuck on a weird bug since this commit (in reality, since this branch's first commit, according to my tests) and I made a change that seems to fix it, although I don't really get these are could possibly be related.

From my latest commit:

Now, the thing that mostly took my time since the last commit was this weird bug: when saving a database with imported entries from the new ArXiv fetcher, a prompt "The library has been modified by another program." would always apper, prompting to accept some changes, which always included a modification to the newly added entry. This made no sense, as there was neither an involvment from an external program, nor a modification since manually saving the database.

I seem to have found a possible very weird cause: this would always happen when setting the 'KEYWORD' field of the resulting BibEntry to the raw string from the DOI BibTex (as discussed before, it contains more detailed information, so it was included on the "prioritized fields" from DOI). The thing is, this string contained a duplicated "keyword", the FOS (that I suppose stands for "Field Of Subject" or similar) of the entry. You can see this behavior by making a GET request to https://doi.org/[ArXiv-assigned DOI] with header "Accept=application/x-bibtex". When removing this duplication, this bug suddenly disappeared (it showed once, but not since).

Maybe future commits will include a more resolute fix for this bug, but the current fix cannot really affect the end result (as unique keywrods is what one would expect), so I leave at that for now.

You can see the related fix here.

From previous debugging sessions, I looked for reasons why there were differences between the internal and disk database (and there were quite a few when inspecting both entries), but the discrepancies made no sense to me, so I reached the previous conclusion by try-and-error (i.e. seeing the results from gradually dropping the overwrite of each field for one specific ArXiv example).

If you still see this behavior, please let me know the conditions in which this happened. I'd also appreciate if someone could help me reach a conclusion for the underline cause of this bug and if there would be a more concrete fix. Anyway, I think I'll be moving to fixing the failed tests next.

Sep 27 '22 02:09 thiagocferr

: this would always happen when setting the 'KEYWORD' field of the resulting BibEntry to the raw string from the DOI BibTex (as discussed before, it contains more detailed information, so it was included on the "prioritized fields" from DOI). The thing is, this string contained a duplicated "keyword", the FOS (that I suppose stands for "Field Of Subject" or similar) of the entry.

Thanks for the investigation. This library modification thing is haunting us since a while already. I recently fixed most occurrences of them but noticed that with keywords as well. But I could not pinpoint it. Can you open a new issue with your findings?

Sep 27 '22 17:09 Siedlerchr

@ThiloteE Thank you for the rationale about the field file. Difficult to decide... So I suggest we leave the file field as it is, as long as users do not raise the issue.

@ThiloteE 2 things to consider in the rationale:

In the entry editor, for the field File, when you click on the "+" icon, a window opens and let you select a file reachable by your OS. You cannot directly set a URL (to do this, you have to edit an existing item and paste the URL in place of the file path, which works, but is a weird to do). Hence, currently, the field File is designed for actual files (i.e. a reachable path on your system), not a link to the web.
the biblatex documentation specifies that:

file field (verbatim)
A local link to a pdf or other version of the work. Not used by the standard bibliography styles.

This concurs not to use the field file for URLs. Hence, the file field provided by the 'ArXiv' fetcher should be converted to an url field.

Sep 28 '22 09:09 mlep

It took me ages to search for https://arxiv.org/abs/1811.10364 with ArXiv websearch. - Is this normal?

What exactly do you mean by that? If you mean searching using the link itself, I guess it doesn't work, as this error is logged when you use the link itself on the ArXiv Web search:
[2022-09-23 17:07:17 [pool-2-thread-1] org.jabref.logic.importer.fetcher.transformers.AbstractQueryTransformer.transform()
ERROR: Unsupported case when transforming the query:
 <regexp field='https' term=''/>
2022-09-23 17:07:17 [pool-2-thread-1] org.jabref.logic.importer.fetcher.transformers.AbstractQueryTransformer.transform()
ERROR: Unsupported case when transforming the query:
 <regexp field='default' term='abs'/>](https://arxiv.org/abs/1811.10364)
Using the ArXiv ID, it goes very fast for me...

What I mean is, the web search seems imperfect:

Searching for "The Architecture of Mr. DLib's Scientific Recommender-System API" takes long (I canceled) :x:
Searching for "https://arxiv.org/abs/1811.10364" takes long (I canceled) :x:
Searching for "1811.10364" is fast :heavy_check_mark:
Searching for "arXiv:1811.10364" takes long (I canceled) :x:
Searching for "https://doi.org/10.48550/arXiv.1811.10364" takes long (I canceled) :x:
Searching for "Joeran Beel, Andrew Collins, Akiko Aizawa" takes a little, but will get found eventually :signal_strength:

Sep 28 '22 10:09 ThiloteE

@mlep JabRef will (if activated) automatically download the pdf file on import and replace the entry in the file field as far as I know

@ThiloteE The web search != the new entry by id. Web search is for searching by author or title. For ids -> New entry from id

Sep 28 '22 15:09 Siedlerchr

So, I've re-implemented most of the main combined fetching process for allowing parallel API calls, which, from my experience, seems to reduce waiting time to about half of the previous implementation. Initially, I was going to only make parallel request with search queries (as it was really trivial to do), but I wanted to try more in-depth parallelization on Java and though the more responsiveness on the UI would be neat (and, from what it seems, it's a little better than before)

Please, point it out if this approach would be problematic for some reason I might not know. If I were to continue this path, I may be looking for a way to do retries when API throttling happens and then doing more automated tests (specially regarding processing errors and how the code handles them).

But I do have an issue that I don't really know what to do and want opinions on: the separation of the new keywords field.

For some reason, ArXiv has only 2 Category Taxonomies with expanded names containing commas: cs.CE (Computational Engineering, Finance, and Science) and cs.DC (Distributed, Parallel, and Cluster Computing). These extended names are what goes into the new keywords field, so if we assume the default keywords separator being comma, JabRef will wrongly interpret what are the specific keywords.

Knowing this, what kind of approach would you suggest? Replacing commas with something else? Just removing it? Something else? @mlep @ThiloteE @Siedlerchr

PS: How do I make the Deployment / Create installer and portable version for macOS (pull_request) test pass anyway?

Oct 03 '22 21:10 thiagocferr

@thiagocferr Thanks for your work so far! Regarding the parallelization, I think this depends. We have to be careful not hitting any limits from the publishers.

For mac: You can ignore this. Background is that the mac build requires some signing keys (that are stored in Github secrets and those are not available to forks outside and thus it can't build.

Regarding the keywords. I am not sure, but you could try the Keyword list merging:

Character delimiter = preferencesService.getGroupsPreferences().getKeywordSeparator();
        return KeywordList.merge(keywordsA, keywordsB, delimiter).getAsString(delimiter);

Oct 04 '22 07:10 Siedlerchr

For the keywords, I suggest keeping the ones provided by the DOI fetcher only. Removing the parts in parentheses might be a plus.

Oct 04 '22 07:10 mlep

@thiagocferr What is the status here? It would be great to have this feature integrated soon!

Nov 01 '22 18:11 Siedlerchr

@Siedlerchr from my position, this PR is concluded. However, I don´t think an external code review was made. If necessary, I´ll make corrections to it.

Nov 01 '22 19:11 thiagocferr

@thiagocferr Okay cool! Before merging a review of two maintainers is required. I will mark the PR as ready for review and try to go trough it in more detail in the next days.

Nov 01 '22 19:11 Siedlerchr

About the file vs url fields

The file field having an https content is also supported by JabRef. One can (IMHO) directly click on the download the file. At least, I saw something in the code being able to handle URLs there.

Nov 07 '22 21:11 koppor

Wait... is something going on with the CI server? It says Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0., but it uses Gradle 7.5.1...

Edit: Oh, I though that was the cause for tests failing, but it was an error somewhere else...

Nov 08 '22 19:11 thiagocferr

You can ignore the gradle version. it's just a warning. No error

Nov 08 '22 19:11 Siedlerchr

Oh, wait, there's still a bug on CompositeIdFetcherTest. It should be a quick fix.

Nov 09 '22 22:11 thiagocferr

Ok, now it should be good to go. The errors on Fetcher Test don't seem to be related to this PR.

Nov 09 '22 23:11 thiagocferr

Thank you very much for your PR!

Nov 10 '22 21:11 Siedlerchr

jabref jabref copied to clipboard

New Fetcher that mimics ArXiv fetcher, but imports more information from DOI

About the file vs url fields

About fields to keep

About the file vs url fields

jabref
jabref copied to clipboard

About the `file` vs `url` fields

About the `file` vs `url` fields