probe-scraper icon indicating copy to clipboard operation
probe-scraper copied to clipboard

probe scraper unable to download file for tree: integration/mozilla-inbound

Open relud opened this issue 3 years ago • 4 comments

revision in tree integration/mozilla-inbound isn't available outside of probe-scraper's cache:

Retreiving Buildhub results for channel nightly
  4645 revisions found
...
  Downloading files for revision number 494/4645 - revision: 46fe2115d46a5bb40523b8466341d8f9a26e1bdf, tree: integration/mozilla-inbound, version: 49.0a1
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/app/probe_scraper/runner.py", line 833, in <module>
    main(
  File "/app/probe_scraper/runner.py", line 647, in main
    upload_paths += load_moz_central_probes(
  File "/app/probe_scraper/runner.py", line 323, in load_moz_central_probes
    revision_data = moz_central_scraper.scrape_channel_revisions(
  File "/app/probe_scraper/scrapers/moz_central_scraper.py", line 207, in scrape_channel_revisions
    files = download_files(
  File "/app/probe_scraper/scrapers/moz_central_scraper.py", line 123, in download_files
    raise Exception(
Exception: Request returned status 404 for https://hg.mozilla.org/releases/integration/mozilla-inbound/raw-file/46fe2115d46a5bb40523b8466341d8f9a26e1bdf/toolkit/components/telemetry/Histograms.json

This is locally reproducible for me by running:

python3 -m probe_scraper.runner --out-dir=temp/probe_data --cache-dir temp/probe_cache --moz-central --firefox-version=49 --firefox-channel=nightly

and is fixed by manually downloading s3://telemetry-airflow-cache/cache/probe-scraper/hg/46fe2115d46a5bb40523b8466341d8f9a26e1bdf/toolkit/components/telemetry/Histograms.json into my local cache.

relud avatar Sep 08 '22 22:09 relud

I modified probe_scraper/scrapers/moz_central_scraper.py to try and find all missing revisions, and this appears to be the only one.

my changes:
diff --git a/probe_scraper/scrapers/moz_central_scraper.py b/probe_scraper/scrapers/moz_central_scraper.py
index 61dea29..4c5ed1f 100644
--- a/probe_scraper/scrapers/moz_central_scraper.py
+++ b/probe_scraper/scrapers/moz_central_scraper.py
@@ -194,25 +194,34 @@ def scrape_channel_revisions(

         print("  " + str(num_revisions) + " revisions found")

+        trees = set()
         for i, rd in enumerate(revision_dates):
-            revision = rd["revision"]
+            if rd["tree"] not in trees:
+                if rd["tree"] != "integration/mozilla-inbound":
+                    trees.add(rd["tree"])

-            print(
-                (
-                    f"  Downloading files for revision number {str(i+1)}/{str(num_revisions)}"
-                    f" - revision: {revision}, tree: {rd['tree']}, version: {str(rd['version'])}"
+                revision = rd["revision"]
+
+                print(
+                    (
+                        f"  Downloading files for revision number {str(i+1)}/{str(num_revisions)}"
+                        f" - revision: {revision}, tree: {rd['tree']}, version: {str(rd['version'])}"
+                    )
                 )
-            )
-            version = extract_major_version(rd["version"])
-            files = download_files(
-                channel, revision, folder, error_cache, version, tree=rd["tree"]
-            )
-
-            results[channel][revision] = {
-                "date": rd["date"],
-                "version": version,
-                "registries": files,
-            }
-            save_error_cache(folder, error_cache)
+                version = extract_major_version(rd["version"])
+                try:
+                    files = download_files(
+                        channel, revision, folder, error_cache, version, tree=rd["tree"]
+                    )
+
+                    results[channel][revision] = {
+                        "date": rd["date"],
+                        "version": version,
+                        "registries": files,
+                    }
+                except Exception:
+                    import traceback
+                    traceback.print_exc()
+                save_error_cache(folder, error_cache)

     return results

relud avatar Sep 08 '22 22:09 relud

for now I've asked Data SRE to copy the missing cache file to the new cache location, https://mozilla-hub.atlassian.net/browse/DSRE-1001?focusedCommentId=590672, but idk if there's a long-term solution needed here.

cc @chutten

relud avatar Sep 08 '22 22:09 relud

...why are we pulling mozilla-inbound? Surely we only care about mozilla-central? Branches on /integration/ don't ship binaries we'd expect to receive data from, so we shouldn't need to care much about what is or isn't present on them.

chutten avatar Sep 12 '22 19:09 chutten

we're pulling from that tree because it's listed by buildhub. we don't (currently) filter what buildhub returns for firefox versions when scraping legacy telemetry in prod. specifically for firefox nightly 49.0a1, buildhub returns a list that includes revision: 46fe2115d46a5bb40523b8466341d8f9a26e1bdf, tree: integration/mozilla-inbound

relud avatar Sep 12 '22 19:09 relud