edx-analytics-pipeline
edx-analytics-pipeline copied to clipboard
Add tool for performing curation.
This command-line tool generates curated synthetic enrollment event files from log files created separately by enrollment validation runs.
These event files are in a format where they can be directly added to the pipeline tasks' eventlog source directories.
@mulby @jab5569
Note that this uses the s3util functionality, so it's branched off of that. Once that makes it to master, this can be rebased.
@brianhw what was the genesis for this addition? I'm just wondering how / when we'd use it. Is it a replacement for synthetic event creation?
It is a replacement for how I have been doing synthetic event creation, which has involved setting up directories, running my personal s3_util tool, running a bash script to grep over files, and then manually comparing the results. This automates all that in a way that someone else could perform the same operations, both here at edX and at other installations.
@mulby @HassanJaveed84 I've added some doc to this, as well as the statistics that make it more useful. I would like to merge this at some point. (After thumbs, I'll squash.)
Current coverage is 79.37% (diff: 0.00%)
@@ master #107 diff @@
==========================================
Files 194 195 +1
Lines 20531 20659 +128
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
Hits 16399 16399
- Misses 4132 4260 +128
Partials 0 0
Powered by Codecov. Last update 89a8a1d...6bf43d4
I suspect that this will at least require adding a hostname argument to s3_connect() in s3util.py. Or using ScalableS3Client.s3 instead. But it still won't work with assumed roles.