python_mozetl
python_mozetl copied to clipboard
ETL jobs for Firefox Telemetry
https://github.com/mozilla/python_mozetl/blob/32d78c34dbb3c9ff5542f1ebc110f5aeb7fce340/mozetl/taar/taar_similarity.py#L131 The diversity of the donor pool is only ensured by the assumption that higher level clustering is substantially diverse. This could be improved by verification of cross-cluster diversity in...
https://github.com/mozilla/python_mozetl/blob/491fbda515f985f3156ff0c70859624fd4961ea8/mozetl/taar/taar_similarity.py#L248 Consider emphasizing the categorical features more by adding a 1-h(x,y) when using hamming distance.
Currently, `get_addon_limits_by_locale()` returns a limit of 1 for each locale in the dataset. We plan to change this to a slightly more sophisticated threshold selection method that uses a simple...
https://github.com/mozilla/python_mozetl/blob/0f8189f87f857f43e9c0142f9c612a0bcc28978c/tests/test_taar_similarity.py#L258-L263 ``` ________________________________________________ test_compute_donors ________________________________________________ spark = addon_whitelist = ['system-addon-guid', 'var-0-guid-0', 'var-0-guid-1', 'var-0-guid-2', 'var-1-guid-0', 'var-1-guid-1', ...] multi_clusters_df = DataFrame[client_id: string, normalized_channel: string, geo_city: array _, donors_df = taar_similarity.get_donors(spark, 3, 10,...
The current instructions include a guide for installing the JRE and snappy dependencies on Linux. The dependencies should be available via `brew` on macs.
Currently, every PR requires running tests for all modules if even a single line is changed. A blog post by digitalocean [describing their go monorepo](https://blog.digitalocean.com/cthulhu-organizing-go-code-in-a-scalable-repo/) showcases the benefits of an...
The logging should be consistent with telemetry-batch-view. Include documentation about good things to log (start, stop, exceptions/errors, audit information).
https://github.com/mozilla/python_mozetl/blob/05d7c1e1e3b0f4b3ea0c2d26f2f4d1f111bae478/mozetl/taar/taar_similarity.py#L97 Currently relies on a hard-coded threshold expressed as number of clusters. Ecological validity of the model would be ameliorated by making use of a condition based on minimal number...