metha
metha copied to clipboard
Why is data only harvested up to the last day?
The readme says
Currently, there is a limitation which only allows to harvest data up to the last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.
which is indeed the current behavior. Do you remember what the reason for this limitation is? Is it something inherent in the OAI protocol, or does it come from somewhere else?
I'm using metha to harvest the arXiv and am curious about this one-day delay.
It is not a limitation of the protocol, but a implementation tradeoff - that I'd like to fix in some future version.
Basically: OAI allows two date granularities, day and second. In order to have a single filename type on disk (e.g. 2018-04-30-00000000.xml.gz), we used the coarser granularity. Also, we wanted to avoid having to check for duplicates (e.g. when requesting an endpoint, that only supports day-granularity every hour).
It's not ideal, and I have some prototypes for more seamless handling, already - just need to weave it into metha.
Thanks, that's fair enough. If you'd like some help with or review of any of those prototypes or their design or implementation, I'd be happy to be of assistance.
If you'd like some help with or review of any of those prototypes or their design or implementation, I'd be happy to be of assistance.
Thanks.