pybliometrics icon indicating copy to clipboard operation
pybliometrics copied to clipboard

Implement `ScienceDirectSearch` using the `PUT` method

Open nils-herrmann opened this issue 6 months ago • 10 comments
trafficstars

PR for Issue #395. To implement the PUT method the whole search pipeline had to be extended. Summarised, the following changes were conducted:

  • Extension of get_content to use the PUT methods
  • Extension of Base. Unfortunately, the pagination logic is different to the ScopusSearch which uses GET. Therefore I had to use a new conditional clause.
  • Extension of the Retrieve class. Here were also some incompatibilities (query is a nested dictionary) which involved new conditional clauses.
  • Complete new ScienceDirectSearch class since the returned results where in a new format.

nils-herrmann avatar May 07 '25 16:05 nils-herrmann

That's a same that the query string changes. I'm thinking about how to minimize the necessary changes on the user end and how to stay as close to the other classes as possible.

My first question is whether the qs key in the search dict is mandatory.

Michael-E-Rose avatar May 12 '25 08:05 Michael-E-Rose

Here is the request schema from the documentation:

{
    authors: string,
    date: string,
    display: {
        highlights: boolean,
        offset: integer,
        show: integer,
        sortBy: string
    },
    filters: {
        openAccess: boolean
    },
    issue: string,
    loadedAfter: string,
    page: string,
    pub: string,
    qs: string,
    title: string,
    volume: string
}

There are no mandatory fields. However if the query has to many results we get Rate of requests exceeds specified limits. Recommend lowering request rate and/or concurrency of requests.

nils-herrmann avatar May 12 '25 08:05 nils-herrmann

Do users use either qs or the others?

Michael-E-Rose avatar May 12 '25 09:05 Michael-E-Rose

There is no restriction, users can also use both. However, using for example title and qs does not make sense since qs already queries all the fields.

From the documentation:

A free text search using the GET interface is equivalent to using qs with the PUT interface.

nils-herrmann avatar May 12 '25 09:05 nils-herrmann

In this case I would suggest the following: The default way of interacting with this class is the qs string, which we calll query. That's the same way as other classes expect input. This will also serve as filename (in the hashed version). Then we enable kwds and args to take over some other fields.

Consistency is key, as is the requirement to use information for the cache file.

Can you implement that please?

Michael-E-Rose avatar May 15 '25 11:05 Michael-E-Rose

I implemented the suggestion to pass qs via the query argument. Now we can use the class by passing a query string and keyword arguments:

sds = ScienceDirectSearch('"neural radiance fields" AND "3D rendering"', date='2024')

Regarding the cache, we cannot only use the query argument since queries with an empty query would have the same filename. Consider the following example:

sds_1 = ScienceDirectSearch(title='Assessing LLMs in malicious code deobfuscation of real-world malware campaigns', date='2024')
sds_2 = ScienceDirectSearch(title='Sampling latent material-property information from LLM-derived embedding representations', date='2024')
sds_1._cache_file_path == sds_2._cache_file_path

True

Therefore I suggest to keep the current implementation and use the complete flattened query dictionary as name.

nils-herrmann avatar May 18 '25 18:05 nils-herrmann

Therefore I suggest to keep the current implementation and use the complete flattened query dictionary as name.

Agreed, but let's use that only when there is no query.

Michael-E-Rose avatar May 23 '25 07:05 Michael-E-Rose

It will be a problem if we remove functionality from the current classes. They're in use already. And frankly I prefer more data even if it is retrieved in a non-recommended way than less data retrieved the right way.

I would put the PR on hold until ScienceDirect removes the GET method altogether.

Michael-E-Rose avatar Jun 16 '25 10:06 Michael-E-Rose

With the GET method we only have two extra fields which we could reconstruct (details below).

Old Method (GET) New Method (PUT)
authors authors
first_author Na
doi doi
title title
link uri
load_date loadDate
openaccess_status openAccess
pii pii
coverDate publicationDate
endingPage last_page
publicationName sourceTitle
startingPage first_page
api_link Na
volume volumeIssue

The PUT method also returns the order of the authors:

{'authors': [{'order': 1, 'name': 'Constantinos Patsakis'},
                    {'order': 2, 'name': 'Fran Casino'},
                    {'order': 3, 'name': 'Nikolaos Lykousas'}]}

The api_link can be reconstructed with the pii: f"https://api.elsevier.com/content/article/pii/{pii}"

We could also keep the field names of the old method (GET)

Finally, what I find most problematic of GET is not being able to filter the results by date and therefore getting misleading counts.

nils-herrmann avatar Jun 17 '25 15:06 nils-herrmann

Ok, then we can keep it. But in any case, this requires a new major version. Not soon, though.

Michael-E-Rose avatar Jun 18 '25 18:06 Michael-E-Rose