data-testing-tutorial
data-testing-tutorial copied to clipboard
Workshop ruminations
hey @ericmjl this Workshop is really exciting. I have now gone through all NBs and attempted to run everything locally.
My feedback is below. I'm on a plane and git pulled before take-off so some of may feedback may already be deprecated. I have included some check boxes, if it helps. Between now and the Workshop, I am happy to help implement any of these or other changes you see fit. Happy to chat here but would be good to sync and chat also. See you in Portland!
general:
- [ ] what are your requirements on an attendee? Intermediate Python programming skills, for example, know how to write functions, pandas slicing, raising errors in functions (could be good to include in README.md to manage expectations); I could imagine that code such as the following may require explanation for many attendees:
hashes['concat'] = df.apply(lambda x: ''.join(str(x[col]) for col in df.columns), axis=1)
- [x] when i run
checkenv.py, I get
ModuleNotFoundError: No module named 'colorama' include in environment? it's funny that i have all packages required for the Workshop installed but not all packages required to check the environment! :D
2-pytest-intro
- [x] 1st exercise says to create test_datafuncs.py but it's already in repo: perhaps this is instructor repo? in general, it seems as though many of the exercises have already been done in this repo, e.g. min-max scaler (prob intentional)
- [x] it isn't obvious why running
py.testshuold do anything; perhaps explain in a few words? The terminal is a mystical place for many. - [x] for the first
test_min_max_scaler(), do you need to import numpy in test_datafuncs.py? - [x] the following line of code in the 2nd
test_min_max_scaler()makes pytest throw aSyntamError; can you reproduce this?
assert np.allclose(tfm, np.array([0, 0.5, 1])
- [x] i really love the textual data example; i'm wondering if you expect attendees to know what lines of code such the following do or if you'll explain it?
return ''.join(s for s in text if s not in exclude)
- similarly with the tests; also, why do you run the test functions are defining them in the NB? Just wondering :)
3-file-integrity
- [ ] cell 2: attendees may wonder what
sha256(),update()andhexdigest()are - [ ] in cell 12 ( defining
hash_file), for the 1st time you're using theupdate()method several times; this is worth a few words;
- love the tinydb :)
4-data-checks
- [ ] typo: under Schema Checks, 'expected' is missing the 'd'
- [x] yaml: you could remind them that they used one to set up their system with the conda env! :D
- [x] you write 'Let's now switch roles, and pretend that we're on side of the "analyst" and are no longer the "data provider".' but this is the first time the idea of a "data provider has come up in the NB"
- [x] in the exercise to write function
test_data_columns, do you want them to run py.test to see the fruits of their labour? - [ ] when you write ' Take the schema spec file and write a test for it.', which schema spec file are you talking about?
- i love
missingno! - interesting to use
pandas_summary; idiomatic pandas would generally lead me down the path of usingdfs.describe()anddfs.info()
- [x] after they write the function
test_data_completeness()and add it totest_datafuncs.py, running py.test doesn't find a DataFramedf; similarly withtest_data_range(); i'm probably missing something silly;
- great ECDFs! ;)
- [ ] you write 'We can take the EDA portion further, by doing an empirical cumulative distribution plot for each data column.' but then do something else first; explain why we need
compute_dimensions()?
- wrt K-S test, also check out Lilliefors: https://en.wikipedia.org/wiki/Lilliefors_test
5-test-coverage
coverageseems really cool! runningpy.test --cov, though, throws
py.test: error: unrecognized arguments: --cov
- do you plan to expand on this NB? Can I help at all? One way would be to play around with the functions and test functions and see the differing results;
Given today's dry run at Boston Python, I have the following epiphanies:
- Environments are difficult. I've had relatively smooth sailing with NAMS because environment instructions included both Anaconda + venv systems. I've pushed up instructions for both now.
- You were right about
py.testbeing "oh my gosh unicorn magic!" to a lot of people. Going to make sure that I provide an explanation of what's going on. - I was able to pinpoint with greater certainty what kind of attendee I'm expecting for the workshop. It's now documented on the README file.
Also a few good suggestions from the crowd:
- Less is more, especially for a first-time tutorial. I've thus restructured the material as such:
- Three "mandatory" notebooks. (intro, pytest, data checks).
- Four "bonus material" notebooks. (file integrity, test coverage, property-based testing, projects)
- In conjunction with the re-structuring, I'll use whatever time is needed to "lecture + hands-on" for the first three, and leave whatever's remaining for the students to independently explore on the "bonus material". They can literally choose between three of the four, and it'll be self-contained; the fourth (mini-projects) are for those who are fast enough to finish everything.
- I was advised to take out Hypothesis from the tutorial, but I think it's better as a standalone self-exploration notebook.
- Add in a notebook documenting resources to learn more things.
Responding to a select subset of your issues:
pandas_summary: it returns a dataframe that explicitly lets us selectmissingfrom the index. We can programmatically check for "no missing values" using this. Not possible with regularpandas, I think.- YAML: Yes! Totally forgot about that today. Going to make sure that's emphasized for the tutorial.
min_max_scalererror: I forgot to close the parentheses. (!!!!!!! noob mistake)- File integrity is going to be made an independent exploration notebook. I will beef up the amount of text in there to enable this.
- "provider" + "analyist": classic error on my side! Fixed.
- Environment checks now work. I've also provided a
requirements.txt, and avenv-setup.shscript.
Btw, @hugobowne, thanks for taking the time to provide the feedback! I've incorporated parts of it as I've updated the material this afternoon!
@ericmjl it was an absolute pleasure going through these materials and I am PUMPED for thursday.
Everything above looks good to me; pre-req knowledge in readme is great; i am happy for you to close this issue at any point. Let me know what else you need from me :)
Thanks @hugobowne! I'm flying tomorrow, see you in Portland! I'll close this issue once the tutorial is over; it'll give me a bit of a visual reminder about what might be left.