python_mozetl icon indicating copy to clipboard operation
python_mozetl copied to clipboard

ETL jobs for Firefox Telemetry

Results 23 python_mozetl issues
Sort by recently updated
recently updated
newest added

https://github.com/mozilla/python_mozetl/blob/32d78c34dbb3c9ff5542f1ebc110f5aeb7fce340/mozetl/taar/taar_similarity.py#L131 The diversity of the donor pool is only ensured by the assumption that higher level clustering is substantially diverse. This could be improved by verification of cross-cluster diversity in...

https://github.com/mozilla/python_mozetl/blob/491fbda515f985f3156ff0c70859624fd4961ea8/mozetl/taar/taar_similarity.py#L248 Consider emphasizing the categorical features more by adding a 1-h(x,y) when using hamming distance.

Currently, `get_addon_limits_by_locale()` returns a limit of 1 for each locale in the dataset. We plan to change this to a slightly more sophisticated threshold selection method that uses a simple...

https://github.com/mozilla/python_mozetl/blob/0f8189f87f857f43e9c0142f9c612a0bcc28978c/tests/test_taar_similarity.py#L258-L263 ``` ________________________________________________ test_compute_donors ________________________________________________ spark = addon_whitelist = ['system-addon-guid', 'var-0-guid-0', 'var-0-guid-1', 'var-0-guid-2', 'var-1-guid-0', 'var-1-guid-1', ...] multi_clusters_df = DataFrame[client_id: string, normalized_channel: string, geo_city: array _, donors_df = taar_similarity.get_donors(spark, 3, 10,...

The current instructions include a guide for installing the JRE and snappy dependencies on Linux. The dependencies should be available via `brew` on macs.

P2

Currently, every PR requires running tests for all modules if even a single line is changed. A blog post by digitalocean [describing their go monorepo](https://blog.digitalocean.com/cthulhu-organizing-go-code-in-a-scalable-repo/) showcases the benefits of an...

The logging should be consistent with telemetry-batch-view. Include documentation about good things to log (start, stop, exceptions/errors, audit information).

P2

https://github.com/mozilla/python_mozetl/blob/05d7c1e1e3b0f4b3ea0c2d26f2f4d1f111bae478/mozetl/taar/taar_similarity.py#L97 Currently relies on a hard-coded threshold expressed as number of clusters. Ecological validity of the model would be ameliorated by making use of a condition based on minimal number...