etl icon indicating copy to clipboard operation
etl copied to clipboard

M-Lab ingestion pipeline

Results 105 etl issues
Sort by recently updated
recently updated
newest added

Need row filter tags on server name or identity and protocol family At the very least the (protocol specifici) host name: ndt.iupui.mlab1v6.lga03.measurement-lab.org. Better would be to bust it out into...

P2
Story
backlog
Q4

SELECT result.StartTime, _PARTITIONTIME AS pt FROM `mlab-oti.base_tables.ndt5` WHERE DATE(result.StartTime) != DATE(_PARTITIONTIME) This shows that files created late in the day are tarred into files with next day's date, and the...

P2
Story
backlog
Q4

#685 addresses the large buffer problem, but does not address the large single row problem. We should limit the size of a single row, by reducing the number of snapshots.

P3
backlog
Q4

1. Design Doc 2. Large machine config 3. File iterator 4. Load management

Story
8
2019
current
Week 38

Part of #624

P1
Story
1
backlog
2019
Sprint 6

There are also anomalies in scraper files, so should work out the proper handling there too. In the long term, we should partition by start time.

P2

TEST_SERVICE_ACCOUNT_mlab_testing is currently base64 encoded. Update to use mechanism in etl-gardener.

P2
1

TCPParser unit tests appear to depend on annotator. ``` 2019/06/19 17:56:45 task.go:138: Processed 364 files, 0 nil data, 362 rows committed, 0 failed, from testdata/20190516T013026.744845Z-tcpinfo-mlab4-arn02-ndt.tgz into ndt_test --- PASS: TestTCPParser...

P2
backlog
Q4

Should check a number of fields for negative, large, or zero values, both in web100 and tcpinfo. Also see https://github.com/m-lab/etl/issues/682

P2

Current parser/schema does not include the filename. UUID specifies a connection, but there may be more than one file for connections that are open more than 10 minutes. The dedup...

P1
backlog