metha
metha copied to clipboard
Selective Harvesting and metha-cat
Hi @miku,
We are adding more and more OAI-PMH endpoints and metha does a great job!
I have a question about selective harvesting and metha-cat. I have automated harvesting via crontab.
After an initial harvest that gets all records from the earliest day on, we do one selective harvest a week:
metha-sync -T 5m -r 20 -base-dir /mydir -format marcmxl https://zenodo.org/oai2d
Since all previous harvests are written to /mydir (local cache), metha-sync implicitly sets the -from param according to the last harvest, correct?
Now with metha-cat (without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times in metha-cat's output). Is this interpretation correct?
EDIT: What I'd like to get is the latest version of each record via metha-cat.
Thanks and kind regards,
Tobias
Sorry for my overly delayed reply.
Since all previous harvests are written to /mydir (local cache), metha-sync implicitly sets the -from param according to the last harvest, correct?
Yes.
Now with metha-cat (without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times in metha-cat's output). Is this interpretation correct?
Yes.
EDIT: What I'd like to get is the latest version of each record via metha-cat.
Yes, I understand. So metha does not do much except caching responses so subsequent invocations are faster (that's something I haven't seen a lot in other tools). So be on the safe side with respect to updates, one can always delete the cache for a particular endpoint and start anew.
$ rm $(metha-sync -dir http://my.server.org)
That of course requires some tolerance of possibly stale records - depending on the requirements.
No problem and thanks for your response. I'll have a closer look at an endpoint's cache where I assume that a lot of updated records flow in.
Otherwise, metha works nicely and stable :-) it's a part of our automated workflow since a couple of months.