metha icon indicating copy to clipboard operation
metha copied to clipboard

Why is data only harvested up to the last day?

Open gunnihinn opened this issue 5 years ago • 3 comments

The readme says

Currently, there is a limitation which only allows to harvest data up to the last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.

which is indeed the current behavior. Do you remember what the reason for this limitation is? Is it something inherent in the OAI protocol, or does it come from somewhere else?

I'm using metha to harvest the arXiv and am curious about this one-day delay.

gunnihinn avatar Mar 25 '19 15:03 gunnihinn

It is not a limitation of the protocol, but a implementation tradeoff - that I'd like to fix in some future version.

Basically: OAI allows two date granularities, day and second. In order to have a single filename type on disk (e.g. 2018-04-30-00000000.xml.gz), we used the coarser granularity. Also, we wanted to avoid having to check for duplicates (e.g. when requesting an endpoint, that only supports day-granularity every hour).

It's not ideal, and I have some prototypes for more seamless handling, already - just need to weave it into metha.

miku avatar Mar 25 '19 16:03 miku

Thanks, that's fair enough. If you'd like some help with or review of any of those prototypes or their design or implementation, I'd be happy to be of assistance.

gunnihinn avatar Mar 25 '19 21:03 gunnihinn

If you'd like some help with or review of any of those prototypes or their design or implementation, I'd be happy to be of assistance.

Thanks.

miku avatar Mar 26 '19 09:03 miku