etl
etl copied to clipboard
M-Lab ingestion pipeline
SELECT * FROM `mlab-oti.ndt.traceroute` WHERE Parseinfo.TaskFileName = "gs://archive-measurement-lab/ndt/traceroute/2019/11/02/20191102T020000.909108Z-traceroute-mlab4-lhr05-ndt.tgz" LIMIT 1000 Row | partition_date | uuid | TestTime | Parseinfo.TaskFileName | Parseinfo.ParseTime | Parseinfo.ParserVersion | Parseinfo.Filename | start_time | stop_time |...
Gardner jobs/ already supports a bq table, but it should also allow other kinds of destination, e.g. GCS JSONL files. Parsers need to respect the destination requested by Gardener.
Processed 723 files, 0 nil data, 0 rows committed, 722 failed, from gs://archive-measurement-lab/ndt/ndt5/2019/11/12/20191112T201953.499657Z-ndt5-mlab3-syd02-ndt.tgz into ndt5_20191112 Failure rate sounds very high for a single tarball
We expect those numbers to be very close ~95+% -- but when it drops too low there may be another problem with data collection or parsing or inserting to BQ...
When gardener updates fail, the parser should start a goroutine to retry the update. Otherwise update may be entirely lost. If the update is the state change, then the job...
Trying to find a combination of build directives and docker base images that works reliably turns out to be non-trivial. I had hoped to use alpine with appropriate static linking,...
Inserts are sometimes failing on tcpinfo buffers. Likely due to large number of snapshots for some rows. Should make two changes: 1. Limit number of snapshots in a row. Perhaps...
Part of m-lab/dev-tracker#501 Use gardener update/ to send per task updates. Use gardener heartbeat/ to send per job heartbeat, once per minute. This allows gardener to detect ETL instance crashes.
Deployments are failing, apparently in the schema sync stage. It appears that bq.py flags may have changed, and we need to update the scripts.
There are separate Window Scale parameters for each half of the connection. They appear in tcp_info as rcv_wscale and snd_wscale. We are probably parsing both into a single field.