docs
docs copied to clipboard
Define a "standard" process to follow when solving simple data science problems
Some short things this should talk about:
-
Data preparation
- cleaning (removal of incomplete data, lookup of further data, get everything in one matrix/table/format)
- whitening (normalize, center & de-corrolate - Warning: throws away corrolation-information!)
- dimension-reduction (PCA, ICA, ... - Warning: throws away information!)
-
Algorithm selection
- supervised? unsupervised?
- typical solutions for typical problems (classification, corrolation, non-metric-solutions (i.e. NLP with suffix-trees, edit-distance, etc.))
- for each algorithm
- when and when NOT to use
- further reading
-
Ways to present/interpret results
- statistical significance?
- typical tests/metrics (AUC, F_1 score, sensitivity/specificity, etc.)
@ixxie wants to write about reproducibility with Jupyter and Nix, I've added to the DH members'list, he should see this soon as well
Hmmm, I am not sure if this is quite relevant to this; my goal is more to try and create easily reproducible infrastructure as code, i.e. to allow anybody to deploy a data science platform relatively easily. Reproducibility of individual computations is also of great interest and Nix can help with this, but I don't know much about this atm (would be willing to look into it some time!).
FWIW, it seems a bit far fetched to be able to specify a simple decision tree recipe for doing data science; the way I would approach this is to think of it like a bipartite graph: list some problems (e.g. tokenization, classification, clustering, etc) and some algorithms (CRFs, RNNs, HDBSCAN) and link between them.