edx-analytics-pipeline icon indicating copy to clipboard operation
edx-analytics-pipeline copied to clipboard

Add tool for performing curation.

Open brianhw opened this issue 9 years ago • 6 comments

This command-line tool generates curated synthetic enrollment event files from log files created separately by enrollment validation runs.

These event files are in a format where they can be directly added to the pipeline tasks' eventlog source directories.

@mulby @jab5569

brianhw avatar May 21 '15 06:05 brianhw

Note that this uses the s3util functionality, so it's branched off of that. Once that makes it to master, this can be rebased.

brianhw avatar May 21 '15 06:05 brianhw

@brianhw what was the genesis for this addition? I'm just wondering how / when we'd use it. Is it a replacement for synthetic event creation?

johnalbaker avatar May 21 '15 09:05 johnalbaker

It is a replacement for how I have been doing synthetic event creation, which has involved setting up directories, running my personal s3_util tool, running a bash script to grep over files, and then manually comparing the results. This automates all that in a way that someone else could perform the same operations, both here at edX and at other installations.

brianhw avatar May 21 '15 20:05 brianhw

@mulby @HassanJaveed84 I've added some doc to this, as well as the statistics that make it more useful. I would like to merge this at some point. (After thumbs, I'll squash.)

brianhw avatar Jul 07 '16 17:07 brianhw

Current coverage is 79.37% (diff: 0.00%)

Merging #107 into master will decrease coverage by 0.49%

@@             master       #107   diff @@
==========================================
  Files           194        195     +1   
  Lines         20531      20659   +128   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
  Hits          16399      16399          
- Misses         4132       4260   +128   
  Partials          0          0          

Powered by Codecov. Last update 89a8a1d...6bf43d4

codecov-io avatar Jan 13 '17 05:01 codecov-io

I suspect that this will at least require adding a hostname argument to s3_connect() in s3util.py. Or using ScalableS3Client.s3 instead. But it still won't work with assumed roles.

brianhw avatar Mar 12 '19 20:03 brianhw