data_science_delivered
data_science_delivered copied to clipboard
notes on stuff I should add
- if you lack constraints on datastores then duplicates will occur
- how to create setup.py
- hypothesis can fuzz mysql to make sure the data going in and back out is the same
- assume during data ingestion that you'll have duplications/redundancy - how to spot and remove?
- starting point for data ingestion - assume this is a sequence of processes that build on each other, not a single process with all the steps done at once. this way you can swap things in, test in isolation and scale to more machines
- list some text similarity metrics fuzzywuzzy, levenshtein, note doing char or word based similarity or char n-gram similarity, maybe removing punctuation/case/unicode is useful?
- pandas read_csv dayfirst=False (by default, consider different for euro poorly specified dates)
- consider linking to http://datapatterns.org/pattern/
learning strategies
- more clean data (probably) beats smarter algorithms
clustering for EDA
- t-sne in sklearn, visualisations https://lvdmaaten.github.io/tsne/ to help understand what to expect (stuff close in n-dimensions should be close in 2d)
cleaning
- glueviz should be noted for EDA (and qgrid?) and https://pypi.python.org/pypi/pivottablejs
- if during cleaning you have to deal with internationalised code (e.g. Russian "Альфа-Банк") be aware that if you lack tests then a naive bit of processing (e.g. lowercasing and some cleaning rules in C#) might give you "?????-????", which you blindly store in database - this is a danger for mix-programming-language transformations (C#'s .net rules vs Python's rules) where they do different things
- example of bad encoding twice " Électricité de France "
- date parsing: http://blog.scrapinghub.com/2015/11/09/parse-natural-language-dates-with-dateparser/
- https://github.com/aparrish/pycorpora/blob/master/README.rst lots of nice mappings using small JSON datasets
process
- list project-types that might work and why, @springcoil talks on the requirement to invest in tooling to deliver working systems
- r&d != engineering
- how might r&d (e.g. 1 person) interface with an eng team?
- which bits of an agile process seem to work well? do sprints work well (depends on the task-type)?
- how 'owns' the data/process, can that cause problems?
- does the lack of a shared language hinder things?
- data scientists need clean data, the system will probably always have some dirty data, there is a need for a data-cleaning process (data eng team?) who try to improve the data quality to an agreed schema and who can export/transform the data so it can be used by the r&d team
- building mini-monolithic-blocks is normal, remember to break them up into smaller services that can be tested else critical testing can easily be avoided (costing later development speed)
- add logging early for anything production-like
- luigi for task pipelines to avoid manual steps
getting hired:
- what you need to show if you want to get hired (github, talks)
- minimal stuff you should do to be more visible
list of tools I'd like to see
- auto-possible-euro-datetime-checker (icy.py?) for pandas when reading ambiguous datetimes
- string->unit converter (e.g. for relative times like "7 minutes" and weights and measures e.g. "23cm", "1inch", "1 in.", "2000m", "2kilometres", "1 pound", "23oz.", "0.25kg")
- datetime parsing http://crsmithdev.com/arrow/ (stronger parser than labix dateutil I think), https://github.com/bear/parsedatetime/ (human friendly input?), https://dateparser.readthedocs.org/en/latest/ (relative dates as input)
- anonymisation http://blog.applied.ai/approaches-to-data-anonymisation/
- data generators eg https://github.com/jbrambleDC/simulacram?files=1
further reading
- https://github.com/rasbt/python-machine-learning-book/tree/master/faq notes from the book
- https://github.com/hangtwenty/dive-into-machine-learning
pipeline building
- add https://github.com/airbnb/airflow
tools on my radar
- https://github.com/ceumicrodata/mETL for ETL via YAML, no programming required
review:
- https://svaksha.github.io/pythonidae/
I'm @springcoil on this
Ooops, fixed, cheers :-) /cc @springcoil