heritrix3 icon indicating copy to clipboard operation
heritrix3 copied to clipboard

Replace A_TIMESTAMP with A_FETCH_BEGAN_TIME

Open hennekey opened this issue 5 years ago • 8 comments

It seems that A_TIMESTAMP went out of favor quite a long time ago. A_FETCH_BEGAN_TIME is used within FetchHistoryProcessor and throws an exception as is because of it.

hennekey avatar Jan 14 '20 20:01 hennekey

This seems reasonable but I'm not familiar enough with this module to confidently review this. I'm hoping someone more knowledgeable about this comments, otherwise I'll merge it in a few days if we don't hear anything.

ato avatar Jan 16 '20 06:01 ato

This is @kngenie's code. I don't know if it's still used.

nlevitt avatar Jan 16 '20 19:01 nlevitt

I was trying to use it in conjunction with the HBase recrawl modules but everything besides these class seems to use A_FETCH_BEGAN_TIME and so they're writing to the wrong attribute and FetchHistoryProcessor throws an exception here: https://github.com/internetarchive/heritrix3/blob/9927e1325f132adc4e31d8bf7a990f3c3a928004/modules/src/main/java/org/archive/modules/recrawl/FetchHistoryProcessor.java#L125

hennekey avatar Jan 16 '20 19:01 hennekey

We use A_TIMESTAMP but can switch if that makes sense. However, I'm having trouble seeing what exactly you are referring to @hennekey because your links go to your private GitHub Enterprise instance. Any chance you can open it up? Or refer to this repo instead?

anjackson avatar Jan 22 '20 23:01 anjackson

@anjackson Sorry about that, I've updated the link to the code.

Any explanation about how you make use of A_TIMESTAMP would be helpful so I can do the right thing.

hennekey avatar Jan 23 '20 14:01 hennekey

Thanks, but I think your original link refers to:

https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/contrib/src/main/java/org/archive/modules/recrawl/wbm/WbmPersistLoadProcessor.java#L519-L521

?

We use it in much the same way, I think. We use it for deduplication and for checking how recently we visited to see if we need to re-crawl yet. But it looks like I hit similar issues because I populate both fields.

I'm not sure how important these semantics are -- the timestamp in question may not necessarily be the time the fetch began, but in practice I think it always is, so I think it's fine to reduce it all to that name.

anjackson avatar Jan 23 '20 15:01 anjackson

Yeah, I did reference that comment. It occurs to me now that perhaps semantically A_TIMESTAMP is meant to be the time the history map entry was created for that CrawlURI and A_FETCH_BEGAN_TIME is more specifically the point in time that the fetch action started. I defer to you to the semantics; you've obviously got more experience with it. I'd be happy to make whatever changes we deem necessary as a result.

hennekey avatar Jan 23 '20 15:01 hennekey

Your use of this constant also makes it clear that it's a more dangerous change than I thought. It is public afterall so there may be other users.

Perhaps the proper solution is to clarify the semantics and use the proper value in FetchHistoryProcessor instead?

hennekey avatar Jan 23 '20 16:01 hennekey