inspire-next
inspire-next copied to clipboard
HoldingPen: arXiv reappears
Expected Behavior
harvests of an article we harvested previously should be filtered
I'm not sure whether this is already_harvested or previously_rejected. In any case: @fschwenn and I don't want to see an article more than once in the holdingpen.
Current Behavior / Example
An arxiv article (1707.02681) was harvested and rejected yesterday (https://labs.inspirehep.net/holdingpen/675354). Today another harvest of the same article is again waiting in the HP (https://labs.inspirehep.net/holdingpen/676378).
https://labs.inspirehep.net/holdingpen/list/?page=1&size=10&q=1707.02681
Screenshots (if appropriate):
This seems to happen even for pending articles. https://labs.inspirehep.net/holdingpen/677472 was harvested while https://labs.inspirehep.net/holdingpen/676633 was already waiting for a decision. It has not been updated on arXiv (https://arxiv.org/abs/1707.02998) meanwhile.
This is normal, until we merge: #2423. The current situation is that we don't know if something is an update or not. So there is a stupid check: if something has a creation date older than 2 weeks it's rejected assuming it's duplicate. Once #2423 is merged (which will happen as soon as we implement the filtering of updated Vs. new records I mentioned on standup) then you will see updates treated as updates.
Re: what @fschwenn says... mmh...
The record you mention has been modified last time the 12th according to arXiv, so it make sense that it has been harvested on the 13th... (assuming my explanation that we are not able yet to distinguish updates).
we don't need the merger. We need to know whether we saw it before. I.e. it should be enough to search for the arXiv-id in INSPIRE - if it's there it's an update - do whatever was done on legacy; if not in INSPIRE query holdingpen - if it's there reject automatically. At least reduce the time-window to max. 4 days. Even on holidays and weekends I guess we get the first harvest within 4 days.
there is no difference in the metadata for 676633 and 677472, maybe a status-change triggered the update. diff_676633.vs.677472.txt
we don't need the merger.
@ksachs we do because it's all implemented into #2423 which is basically ready. No point in adding an other hack that will only postpone having the functionality.
OK, understood. So we'll continue with the DESY workflow until then.
Yeah, it's still great if you monitor, just ignore the issue of duplicate harvesting.
just a quick check of astro articles from Wednesday that were not harvested directly to INSPIRE and should be handled via HP (as far as I understand it). I don't know how to query HP via python, otherwise I would have checked all subjects.
not in HP
[1707.03130, 1707.03155, 1707.03179, 1707.03180, 1707.03343, 1707.03345, 1707.03390]
once in HP
[1707.02980, 1707.02982, 1707.02985, 1707.02989, 1707.03064, 1707.03120, 1707.03121, 1707.03170, 1707.03171, 1707.03208]
twice in HP
[1707.03009, 1707.03026, 1707.03105, 1707.03108, 1707.03118, 1707.03256, 1707.03356, 1707.02990, 1707.02996]
3-times in HP
[1707.02997, 1707.03078]
Just to show you that @fschwenn and I have to reject (almost) everything twice. Every day! And usually independently - the records come in again the next day. It's super annoying. What I'm trying to say: the matcher is good enough - just looking for IDs. It's mostly records that are rejected, i.e. no merger needed. You would do us an enormous favor if these double records could be avoided.
Indeed @ammirate and I confirm that there is such matching already in place but only for holding pen entries that are pending, not completed ones.
If we indeed decide to match also completed ones, it means that we all agree that, once a record is rejected, there is no amount of additional information that could make this decision to change. Are we all OK with this?
This means permanent reject of records.
BTW here's the line of code to amend this rule: https://github.com/inspirehep/inspire-next/blob/master/inspirehep/modules/workflows/tasks/matching.py#L303
could it be done only for arXiv? if a rejected arXiv eprint pops up again in a user submission/journal harvest we might want to reconsider.
it doesn't work for pending ones either - otherwise we would not see identical records twice waiting for a decision.
We can safely reject 'identical' records. I.e. same ID from same source. Maybe excluding user submissions. It applies both for arXiv and re-appearing records from publishers.
I havn't seen re-appearing records since a long time. Here is one: https://labs.inspirehep.net/holdingpen/list/?page=1&size=10&q=1712.06950
Maybe this is due to @david-caro messing around with the holdingpen. Just to let you know.
That harvest is an update of the old one, that was actually accepted. Shouldn't we be taking updates?
About it being halted for curator approval, should be fixed, what happened is that the accepted record was in WAITING
state, waiting for the webcoll callback from legacy, and during that lapse, the update came in. It seems though that there was some issue with prodsync, and the record was not pushed to labs until 4 hours after the update came in, so the update was not detected as such.
As such, it just passed by, and stopped the workflow in 'waiting' state, as it has newer info. This will not happen once labs is master, but in the meantime we might want to handle those cases.
It might also related to the reindexing + remigration of all the records that happened tonight (introducing some delays and such, though I don't see how right away).
Hi David,
I doubt updates via labs work. If I accept the second record it would just cause an error for the upload job. Updates are handled on legacy directly as far as I know.
They should work if they are detected as such, we are only considering the last 15 days of updates from the arxiv categories that are not harvested on legacy. And they are not yet auto-accepted (so they show up), once we have the merger machinery in, they will only be halted if there are any conflicts.
more examples: https://labs.inspirehep.net/holdingpen/list/?page=1&size=10&q=1801.00565 https://labs.inspirehep.net/holdingpen/list/?page=1&size=10&q=1801.00491
yep, those are updates. The error they show:
Traceback (most recent call last):
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 529, in _process
self.run_callbacks(callbacks, objects, obj)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 454, in run_callbacks
indent + 1)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 454, in run_callbacks
indent + 1)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 481, in run_callbacks
self.execute_callback(callback_func, obj)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 564, in execute_callback
callback(obj, self)
File "/opt/inspire/src/inspire/inspirehep/modules/workflows/utils/__init__.py", line 143, in _decorator
res = func(*args, **kwargs)
File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/upload.py", line 42, in store_record
record = InspireRecord.get_record(obj.extra_data['head_uuid'])
KeyError: 'head_uuid'
Should be fixed in the next deploy (and also, the auto-approval, so you will not see any more halted updates either).