pybliometrics
pybliometrics copied to clipboard
Add cache status to objects
For any of the data entities (i.e. AuthorRetrieval, ContentAffiliationRetrieval, AbstractRetrieval, and conceivably also the search types) it would be helpful to include a property/method that indicates whether a local data cache already exists for that entity, and if so, how old it is. This allows a script to inspect if the data needs to be fetched/refreshed from the REST endpoint, which in turn can be used to apply throttling when needed.
Background: Note that the Scopus API endpoints enforce throttling; any requests that exceed the default request/seconds limit will fail. Also, any client that continuously exceeds throttling limits, risks having its API key suspended. This means that the client needs to monitor/control the rate at which it is calling the API to avoid such failed requests, e.g. by including a timeout (`sleep') when looping over API calls. The challenge is that this timeout is not necessary when initiating a retrieval/search object for which a cache already existed, as for such cached objects, the API call isn't made. In fact, doing so would be unhelpful, as looping with a timeout over a series of objects that have been cached, means that initiating those objects will take longer than needed, unnecessarily increasing program run time.
(A more elegant approach would be for pybliometrics to enforce throttling, eg. by building a timeout into the get_content.py module - but that requires that module to persist the timestamp of the last request made to api.elsevier.com one way or another, which isn't trivial as this either needs to be persisted on-disk - or maintained in memory, like the elsapy library does.)
Hi @ale-de-vries and thanks so much for this issue. You raise many of connected issues, all of which are worth thinking about!
I respond in reverse order:
- We cannot enforce throttling on a global level of pybliometrics (between different queries) without a lot of change to the backend. But we can easily slow down requests within one query. A colleague of mine actually experimented with this once as an effort to reduce the number of incidences of broken request and missing data in one query, but to no avail. But well, if it should help in principle, let's do it.
- I long thought about adding a property to all classes telling the user about when the file has last been cached (i.e. created or modified) as well. Doing so requires a new base class from which both the Search() and the Retrieval() class inherit from. Getting the modified timestamp via
osis easy. - Using the timestamp from 2., I plan to adapt the
refreshparameters slightly. User will be able to provide an integer additional to providing a boolean. The integer will be interpreted as maximum age of the cache in days. If the file is older than the provided value, pybliometrics refreshes the file. - Given these, I don't see so much the point of having a property telling the user whether the file has been cached or not. For one, there is the
downloadparameter in the search classes. If it's set toFalseand the file exists, the relevant parameters are still filled. So that's how users see whether the file exists. For second, I don't see a use case for having information on the cache status if it's not True. That is, why would someone be interested in knowing whether the cached file is already there and then decide to not retrieve the corresponding information? Of course, I am open to discussion here.
With fde4a8c81f3a2f6a9de99b3dd18ae9f22c519689, any pybliometrics class can show how old the cached file is. That's certainly a good step in the right direction.
Throttling implemented in e32c349a00f29c83c5a7a92e2807cab3aa7748ef