data-testing-tutorial Workshop ruminations

hey @ericmjl this Workshop is really exciting. I have now gone through all NBs and attempted to run everything locally.

My feedback is below. I'm on a plane and git pulled before take-off so some of may feedback may already be deprecated. I have included some check boxes, if it helps. Between now and the Workshop, I am happy to help implement any of these or other changes you see fit. Happy to chat here but would be good to sync and chat also. See you in Portland!

general:

[ ] what are your requirements on an attendee? Intermediate Python programming skills, for example, know how to write functions, pandas slicing, raising errors in functions (could be good to include in README.md to manage expectations); I could imagine that code such as the following may require explanation for many attendees:

hashes['concat'] = df.apply(lambda x: ''.join(str(x[col]) for col in df.columns), axis=1)

[x] when i run checkenv.py, I get

ModuleNotFoundError: No module named 'colorama' include in environment? it's funny that i have all packages required for the Workshop installed but not all packages required to check the environment! :D

2-pytest-intro

[x] 1st exercise says to create test_datafuncs.py but it's already in repo: perhaps this is instructor repo? in general, it seems as though many of the exercises have already been done in this repo, e.g. min-max scaler (prob intentional)
[x] it isn't obvious why running py.test shuold do anything; perhaps explain in a few words? The terminal is a mystical place for many.
[x] for the first test_min_max_scaler() , do you need to import numpy in test_datafuncs.py?
[x] the following line of code in the 2nd test_min_max_scaler() makes pytest throw a SyntamError; can you reproduce this?

assert np.allclose(tfm, np.array([0, 0.5, 1])

[x] i really love the textual data example; i'm wondering if you expect attendees to know what lines of code such the following do or if you'll explain it? return ''.join(s for s in text if s not in exclude)

similarly with the tests; also, why do you run the test functions are defining them in the NB? Just wondering :)

3-file-integrity

[ ] cell 2: attendees may wonder what sha256(), update() and hexdigest() are
[ ] in cell 12 ( defining hash_file), for the 1st time you're using the update() method several times; this is worth a few words;

love the tinydb :)

4-data-checks

[ ] typo: under Schema Checks, 'expected' is missing the 'd'
[x] yaml: you could remind them that they used one to set up their system with the conda env! :D
[x] you write 'Let's now switch roles, and pretend that we're on side of the "analyst" and are no longer the "data provider".' but this is the first time the idea of a "data provider has come up in the NB"
[x] in the exercise to write function test_data_columns, do you want them to run py.test to see the fruits of their labour?
[ ] when you write ' Take the schema spec file and write a test for it.', which schema spec file are you talking about?

i love missingno!
interesting to use pandas_summary; idiomatic pandas would generally lead me down the path of using dfs.describe() and dfs.info()

[x] after they write the function test_data_completeness() and add it to test_datafuncs.py, running py.test doesn't find a DataFrame df; similarly with test_data_range(); i'm probably missing something silly;

great ECDFs! ;)

[ ] you write 'We can take the EDA portion further, by doing an empirical cumulative distribution plot for each data column.' but then do something else first; explain why we need compute_dimensions()?

wrt K-S test, also check out Lilliefors: https://en.wikipedia.org/wiki/Lilliefors_test

5-test-coverage

coverage seems really cool! running py.test --cov, though, throws

py.test: error: unrecognized arguments: --cov

do you plan to expand on this NB? Can I help at all? One way would be to play around with the functions and test functions and see the differing results;

May 13 '17 02:05 hugobowne

Given today's dry run at Boston Python, I have the following epiphanies:

Environments are difficult. I've had relatively smooth sailing with NAMS because environment instructions included both Anaconda + venv systems. I've pushed up instructions for both now.
You were right about py.test being "oh my gosh unicorn magic!" to a lot of people. Going to make sure that I provide an explanation of what's going on.
I was able to pinpoint with greater certainty what kind of attendee I'm expecting for the workshop. It's now documented on the README file.

Also a few good suggestions from the crowd:

Less is more, especially for a first-time tutorial. I've thus restructured the material as such:
1. Three "mandatory" notebooks. (intro, pytest, data checks).
2. Four "bonus material" notebooks. (file integrity, test coverage, property-based testing, projects)
In conjunction with the re-structuring, I'll use whatever time is needed to "lecture + hands-on" for the first three, and leave whatever's remaining for the students to independently explore on the "bonus material". They can literally choose between three of the four, and it'll be self-contained; the fourth (mini-projects) are for those who are fast enough to finish everything.
I was advised to take out Hypothesis from the tutorial, but I think it's better as a standalone self-exploration notebook.
Add in a notebook documenting resources to learn more things.

Responding to a select subset of your issues:

pandas_summary: it returns a dataframe that explicitly lets us select missing from the index. We can programmatically check for "no missing values" using this. Not possible with regular pandas, I think.
YAML: Yes! Totally forgot about that today. Going to make sure that's emphasized for the tutorial.
min_max_scaler error: I forgot to close the parentheses. (!!!!!!! noob mistake)
File integrity is going to be made an independent exploration notebook. I will beef up the amount of text in there to enable this.
"provider" + "analyist": classic error on my side! Fixed.
Environment checks now work. I've also provided a requirements.txt, and a venv-setup.sh script.

May 13 '17 21:05 ericmjl

Btw, @hugobowne, thanks for taking the time to provide the feedback! I've incorporated parts of it as I've updated the material this afternoon!

May 14 '17 00:05 ericmjl

@ericmjl it was an absolute pleasure going through these materials and I am PUMPED for thursday.

Everything above looks good to me; pre-req knowledge in readme is great; i am happy for you to close this issue at any point. Let me know what else you need from me :)

May 15 '17 20:05 hugobowne

Thanks @hugobowne! I'm flying tomorrow, see you in Portland! I'll close this issue once the tutorial is over; it'll give me a bit of a visual reminder about what might be left.

May 16 '17 02:05 ericmjl

data-testing-tutorial data-testing-tutorial copied to clipboard

Workshop ruminations

general:

2-pytest-intro

3-file-integrity

4-data-checks

5-test-coverage

data-testing-tutorial
data-testing-tutorial copied to clipboard