OpenWPM icon indicating copy to clipboard operation
OpenWPM copied to clipboard

Improve Unstructured Aggregator

Open vringar opened this issue 4 years ago • 4 comments

The last crawl has some content_hashes that can't be found on GCS so we are loosing some data there. I'll need to investigate how many entries we are missing.

We're also currently storing the data uncompressed which probably isn't something we should do as we think about 1M site crawls.

vringar avatar Apr 08 '21 18:04 vringar

For the last 100k crawl running OpenWPM v0.14.0 we got the following stats:

Data that appears as a hash in the data but not in the bucket: 4725 or in percent 0.47750376192117194 % Blobs that appear in the bucket but aren't referenced in the data: 63088 or in percent 6.3756100173720425 %

For reference a crawl running on v0.10.0: Data that appears as a hash in the data but not in the bucket: 1 or in percent 0.00010873619186283581 % Blobs that appear in the bucket but aren't referenced in the data: 42169 or in percent 4.585296474663924 %

vringar avatar Apr 14 '21 09:04 vringar

My first instinct was that we were not awaiting content, but that doesn't seem to be the case https://github.com/mozilla/OpenWPM/blob/358c8a73373abf84e6bea4600e6e154e7a903854/openwpm/storage/storage_controller.py#L138-L141

My second thought was that there might be a bug in the implementation of GcsUnstructuredProvider but so far I haven't found one.

vringar avatar Apr 14 '21 09:04 vringar

One thing that makes me curious, even though I don't think it's the issue is the fact that we add the entry to the cache after having the transaction to write it out to disk be completed. This means, there could be a point where a resource gets saved out twice by the same instance, resulting in one write overwriting the other. I assume that writes to GCS especially when wrapped in a transaction are atomic, so the last transaction to finish should win, which wouldn't result in data loss. I've found this documentation page, and from my understanding a global read-after-write consistency guarantee should mean whoever last uploads to a bucket wins.

vringar avatar Apr 14 '21 09:04 vringar

Investigation update: There is an individual visit_id that have lost 114 records with the second highest one being 55, trailing down to one hash being unavailable. Each affected browser lost data for 1-9 visit_ids so I'm assuming the instance doesn't crash when it doesn't save out.

vringar avatar Apr 14 '21 15:04 vringar