docs icon indicating copy to clipboard operation
docs copied to clipboard

Define a "standard" process to follow when solving simple data science problems

Open NickSeagull opened this issue 7 years ago • 3 comments

NickSeagull avatar Jan 18 '18 18:01 NickSeagull

Some short things this should talk about:

  • Data preparation

    • cleaning (removal of incomplete data, lookup of further data, get everything in one matrix/table/format)
    • whitening (normalize, center & de-corrolate - Warning: throws away corrolation-information!)
    • dimension-reduction (PCA, ICA, ... - Warning: throws away information!)
  • Algorithm selection

    • supervised? unsupervised?
    • typical solutions for typical problems (classification, corrolation, non-metric-solutions (i.e. NLP with suffix-trees, edit-distance, etc.))
    • for each algorithm
      • when and when NOT to use
      • further reading
  • Ways to present/interpret results

    • statistical significance?
    • typical tests/metrics (AUC, F_1 score, sensitivity/specificity, etc.)

Drezil avatar Jan 19 '18 13:01 Drezil

@ixxie wants to write about reproducibility with Jupyter and Nix, I've added to the DH members'list, he should see this soon as well

ocramz avatar Jan 25 '18 16:01 ocramz

Hmmm, I am not sure if this is quite relevant to this; my goal is more to try and create easily reproducible infrastructure as code, i.e. to allow anybody to deploy a data science platform relatively easily. Reproducibility of individual computations is also of great interest and Nix can help with this, but I don't know much about this atm (would be willing to look into it some time!).

FWIW, it seems a bit far fetched to be able to specify a simple decision tree recipe for doing data science; the way I would approach this is to think of it like a bipartite graph: list some problems (e.g. tokenization, classification, clustering, etc) and some algorithms (CRFs, RNNs, HDBSCAN) and link between them.

ixxie avatar Apr 28 '18 18:04 ixxie