outbackcdx resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record

I observed this with fuzzy match handling, not sure whether this is an issue for other records.

The original records:

com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=1&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=1&rpc_init=1&conviva_environment=test text/html 200 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 3163 294009561 ...
com,twitter)/i/videos/tweet/731859894710718465?conviva_environment=test&embed_source=clientlib&player_id=2&rpc_init=1 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704 ...

Only one record returned from outbackcdx after ingest:

fuzzy:com,twitter)/i/videos/tweet/731859894710718465? 20160805135628 https://twitter.com/i/videos/tweet/731859894710718465?embed_source=clientlib&player_id=2&rpc_init=1&conviva_environment=test warc/revisit 0 OQCJZ7JKWZI37BWVSOJMM7MYHARP5XK4 - - 1068 294017704  ...

Note that I was able to update this record properly by re-posting only the status 200 record.

Jan 16 '20 01:01 galgeek

Oh, that's unfortunate. Pity we can't use merge operators from Java - doing a get before each put seems likely to have a significant effect on bulk data load performance. I guess we'd only need to do the lookup for revisits though at least.

Jan 16 '20 03:01 ato

We've also observed this with redirects, where the 30x overwrites a 200 record for the same key (mostly http->https redirects) because the indexer happened to process the WARC records in that order.

This is breaking things for us, and because we do large crawls spread over many WARCs, I'm not sure how to fix it. I guess maybe a second pass on the indexer that only posts 200s?

Jan 20 '20 10:01 anjackson

I guess another solution would be to move the original url into the database key to break the tie. That'd mean all of them would be returned in search results and it'd then be up to wayback to do something sensible in face of the collision.

Jan 20 '20 10:01 ato

FWIW, based on the not-wholly-unrelated https://github.com/ukwa/ukwa-pywb/issues/53 I think pywb expects to resolve these collisions.

EDIT but I guess we'd also need to pull in the MIME type to avoid the original warc/revisit collision.

Jan 20 '20 11:01 anjackson

Well galgeek's original example had two different URLs (player_id=1 vs player_id=2) that were only the same under fuzzy rules. But yes an exact revisit within the same second does sound plausible, particularly in browser-based crawling scenario. If we add the content-type the question simply becomes is it a problem if revisits clobber each other?

I guess the real unique identifier for a record is the combination of warc filename and byte offset. If those are the same then its the same record and we should clobber. If they differ then it's not and we should list it separately, even if everything else is the same.

So...

keyv3 = (urlkey, timestamp)
keyv4 = (urlkey, timestamp, filename, offset)

How to handle the upgrade will need some thought.

Edit: Changed v1/v2 to v3/v4 so we have one record version scheme rather than separate version numbers for keys and values. v0,1,2,3 all use the same key format.

Jan 20 '20 15:01 ato

I like this idea.

keyv2 = (urlkey, timestamp, filename, offset)

Jan 21 '20 00:01 nlevitt

Making the system handle upgrades nicely is going to be fairly complicated. Would it help to stage the process, e.g. starting by implementing this as a start-up/CLI option where you can choose the longer key but you need to use separate index files, and looking into migration later on once we've got the thing working?

Feb 17 '20 11:02 anjackson

@anjackson see #78 for roughly what I had in mind (without migration logic). I haven't really tested it much yet.

Feb 17 '20 13:02 ato

I think we could implement an online upgrade by roughly this process:

Temporarily until the index has a 'v4 migration complete' flag:
- PUT should issue a delete for the v3 key and then put the v4 record.
- DELETE should issue deletes for both the v3 and v4 key.
A background thread should walk the entire index rewriting every v3 record to v4. After all records have been visited, flag the index as 'v4 migration complete'. Might be a good idea to periodically save the key the migration is up to so that it can resume if the process is restarted.

I'm not sure what the performance or disk space implications are for doing this. Mass deletes in RocksDB are a little odd due to tombstones. If an excess of tombstones becomes a problem during migration it might be necessary to periodically call CompactRange too.

Feb 17 '20 13:02 ato

outbackcdx outbackcdx copied to clipboard

resolving a surt-timestamp collision should not replace a non-revisit record with a revisit record

outbackcdx
outbackcdx copied to clipboard