pybliometrics
pybliometrics copied to clipboard
Implement `ScienceDirectSearch` using the `PUT` method
PR for Issue #395. To implement the PUT method the whole search pipeline had to be extended. Summarised, the following changes were conducted:
- Extension of
get_contentto use thePUTmethods - Extension of
Base. Unfortunately, the pagination logic is different to theScopusSearchwhich usesGET. Therefore I had to use a new conditional clause. - Extension of the
Retrieveclass. Here were also some incompatibilities (query is a nested dictionary) which involved new conditional clauses. - Complete new
ScienceDirectSearchclass since the returned results where in a new format.
That's a same that the query string changes. I'm thinking about how to minimize the necessary changes on the user end and how to stay as close to the other classes as possible.
My first question is whether the qs key in the search dict is mandatory.
Here is the request schema from the documentation:
{
authors: string,
date: string,
display: {
highlights: boolean,
offset: integer,
show: integer,
sortBy: string
},
filters: {
openAccess: boolean
},
issue: string,
loadedAfter: string,
page: string,
pub: string,
qs: string,
title: string,
volume: string
}
There are no mandatory fields. However if the query has to many results we get Rate of requests exceeds specified limits. Recommend lowering request rate and/or concurrency of requests.
Do users use either qs or the others?
There is no restriction, users can also use both. However, using for example title and qs does not make sense since qs already queries all the fields.
From the documentation:
A free text search using the GET interface is equivalent to using
qswith the PUT interface.
In this case I would suggest the following:
The default way of interacting with this class is the qs string, which we calll query. That's the same way as other classes expect input. This will also serve as filename (in the hashed version).
Then we enable kwds and args to take over some other fields.
Consistency is key, as is the requirement to use information for the cache file.
Can you implement that please?
I implemented the suggestion to pass qs via the query argument. Now we can use the class by passing a query string and keyword arguments:
sds = ScienceDirectSearch('"neural radiance fields" AND "3D rendering"', date='2024')
Regarding the cache, we cannot only use the query argument since queries with an empty query would have the same filename. Consider the following example:
sds_1 = ScienceDirectSearch(title='Assessing LLMs in malicious code deobfuscation of real-world malware campaigns', date='2024')
sds_2 = ScienceDirectSearch(title='Sampling latent material-property information from LLM-derived embedding representations', date='2024')
sds_1._cache_file_path == sds_2._cache_file_path
True
Therefore I suggest to keep the current implementation and use the complete flattened query dictionary as name.
Therefore I suggest to keep the current implementation and use the complete flattened query dictionary as name.
Agreed, but let's use that only when there is no query.
It will be a problem if we remove functionality from the current classes. They're in use already. And frankly I prefer more data even if it is retrieved in a non-recommended way than less data retrieved the right way.
I would put the PR on hold until ScienceDirect removes the GET method altogether.
With the GET method we only have two extra fields which we could reconstruct (details below).
Old Method (GET) |
New Method (PUT) |
|---|---|
| authors | authors |
| first_author | Na |
| doi | doi |
| title | title |
| link | uri |
| load_date | loadDate |
| openaccess_status | openAccess |
| pii | pii |
| coverDate | publicationDate |
| endingPage | last_page |
| publicationName | sourceTitle |
| startingPage | first_page |
| api_link | Na |
| volume | volumeIssue |
The PUT method also returns the order of the authors:
{'authors': [{'order': 1, 'name': 'Constantinos Patsakis'},
{'order': 2, 'name': 'Fran Casino'},
{'order': 3, 'name': 'Nikolaos Lykousas'}]}
The api_link can be reconstructed with the pii: f"https://api.elsevier.com/content/article/pii/{pii}"
We could also keep the field names of the old method (GET)
Finally, what I find most problematic of GET is not being able to filter the results by date and therefore getting misleading counts.
Ok, then we can keep it. But in any case, this requires a new major version. Not soon, though.