probe-scraper
probe-scraper copied to clipboard
Filter out missing revisions
Somewhere along the line we are retrieving revisions for hashes which don't exist on hg.mozilla.org. How or where this is happening should be investigated, and those revisions stripped out.
Here is the list of invalid URLs being fetched by probe-scraper: https://gist.github.com/fbertsch/f0d27f697dec888e1e7ed88a048b2ad3
cc @mdboom you mentioned your team was interested in working on bugs here, is this something you all would have the bandwidth to take on?
Sure. I made a bugzilla issue to point to this one (which will make it easier for my team to not lose it): https://bugzilla.mozilla.org/show_bug.cgi?id=1628725
Having trouble reproducing this. Does the probe-scraper deploy cache the repository? Maybe it's getting stale/broken?
Indeed it does. The cache is here: https://github.com/mozilla/probe-scraper/blob/master/probe_scraper/runner.py#L300
So maybe forcing a clean checkout of m-c in the cache would fix the problem? (Of course, that won't help us understand how we got to this bug in the first place...)
Agreed. Running P-S in a fresh cache location would be a good start (IIRC this should take 5-6 hours), and will hit M-C a bunch. Are you up for that, Mike?
Well, I did that locally already last week (well, in fairness I ran it for the first time on a new machine) and wasn't able to reproduce the bug. Where do I start with trying to do that in deployment?
Ah, gotcha. You'd need AWS creds to run this with a separate bucket. Sounds like this investigation needs to happen on more the ops side.