Replace A_TIMESTAMP with A_FETCH_BEGAN_TIME
It seems that A_TIMESTAMP went out of favor quite a long time ago. A_FETCH_BEGAN_TIME is used within FetchHistoryProcessor and throws an exception as is because of it.
This seems reasonable but I'm not familiar enough with this module to confidently review this. I'm hoping someone more knowledgeable about this comments, otherwise I'll merge it in a few days if we don't hear anything.
This is @kngenie's code. I don't know if it's still used.
I was trying to use it in conjunction with the HBase recrawl modules but everything besides these class seems to use A_FETCH_BEGAN_TIME and so they're writing to the wrong attribute and FetchHistoryProcessor throws an exception here: https://github.com/internetarchive/heritrix3/blob/9927e1325f132adc4e31d8bf7a990f3c3a928004/modules/src/main/java/org/archive/modules/recrawl/FetchHistoryProcessor.java#L125
We use A_TIMESTAMP but can switch if that makes sense. However, I'm having trouble seeing what exactly you are referring to @hennekey because your links go to your private GitHub Enterprise instance. Any chance you can open it up? Or refer to this repo instead?
@anjackson Sorry about that, I've updated the link to the code.
Any explanation about how you make use of A_TIMESTAMP would be helpful so I can do the right thing.
Thanks, but I think your original link refers to:
https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/contrib/src/main/java/org/archive/modules/recrawl/wbm/WbmPersistLoadProcessor.java#L519-L521
?
We use it in much the same way, I think. We use it for deduplication and for checking how recently we visited to see if we need to re-crawl yet. But it looks like I hit similar issues because I populate both fields.
I'm not sure how important these semantics are -- the timestamp in question may not necessarily be the time the fetch began, but in practice I think it always is, so I think it's fine to reduce it all to that name.
Yeah, I did reference that comment.
It occurs to me now that perhaps semantically A_TIMESTAMP is meant to be the time the history map entry was created for that CrawlURI and A_FETCH_BEGAN_TIME is more specifically the point in time that the fetch action started.
I defer to you to the semantics; you've obviously got more experience with it. I'd be happy to make whatever changes we deem necessary as a result.
Your use of this constant also makes it clear that it's a more dangerous change than I thought. It is public afterall so there may be other users.
Perhaps the proper solution is to clarify the semantics and use the proper value in FetchHistoryProcessor instead?