data_science_delivered icon indicating copy to clipboard operation
data_science_delivered copied to clipboard

notes on stuff I should add

Open ianozsvald opened this issue 9 years ago • 2 comments

  • if you lack constraints on datastores then duplicates will occur
  • how to create setup.py
  • hypothesis can fuzz mysql to make sure the data going in and back out is the same
  • assume during data ingestion that you'll have duplications/redundancy - how to spot and remove?
  • starting point for data ingestion - assume this is a sequence of processes that build on each other, not a single process with all the steps done at once. this way you can swap things in, test in isolation and scale to more machines
  • list some text similarity metrics fuzzywuzzy, levenshtein, note doing char or word based similarity or char n-gram similarity, maybe removing punctuation/case/unicode is useful?
  • pandas read_csv dayfirst=False (by default, consider different for euro poorly specified dates)
  • consider linking to http://datapatterns.org/pattern/

learning strategies

  • more clean data (probably) beats smarter algorithms

clustering for EDA

  • t-sne in sklearn, visualisations https://lvdmaaten.github.io/tsne/ to help understand what to expect (stuff close in n-dimensions should be close in 2d)

cleaning

  • glueviz should be noted for EDA (and qgrid?) and https://pypi.python.org/pypi/pivottablejs
  • if during cleaning you have to deal with internationalised code (e.g. Russian "Альфа-Банк") be aware that if you lack tests then a naive bit of processing (e.g. lowercasing and some cleaning rules in C#) might give you "?????-????", which you blindly store in database - this is a danger for mix-programming-language transformations (C#'s .net rules vs Python's rules) where they do different things
  • example of bad encoding twice " Électricité de France "
  • date parsing: http://blog.scrapinghub.com/2015/11/09/parse-natural-language-dates-with-dateparser/
  • https://github.com/aparrish/pycorpora/blob/master/README.rst lots of nice mappings using small JSON datasets

process

  • list project-types that might work and why, @springcoil talks on the requirement to invest in tooling to deliver working systems
  • r&d != engineering
  • how might r&d (e.g. 1 person) interface with an eng team?
  • which bits of an agile process seem to work well? do sprints work well (depends on the task-type)?
  • how 'owns' the data/process, can that cause problems?
  • does the lack of a shared language hinder things?
  • data scientists need clean data, the system will probably always have some dirty data, there is a need for a data-cleaning process (data eng team?) who try to improve the data quality to an agreed schema and who can export/transform the data so it can be used by the r&d team
  • building mini-monolithic-blocks is normal, remember to break them up into smaller services that can be tested else critical testing can easily be avoided (costing later development speed)
  • add logging early for anything production-like
  • luigi for task pipelines to avoid manual steps

getting hired:

  • what you need to show if you want to get hired (github, talks)
  • minimal stuff you should do to be more visible

list of tools I'd like to see

  • auto-possible-euro-datetime-checker (icy.py?) for pandas when reading ambiguous datetimes
  • string->unit converter (e.g. for relative times like "7 minutes" and weights and measures e.g. "23cm", "1inch", "1 in.", "2000m", "2kilometres", "1 pound", "23oz.", "0.25kg")
  • datetime parsing http://crsmithdev.com/arrow/ (stronger parser than labix dateutil I think), https://github.com/bear/parsedatetime/ (human friendly input?), https://dateparser.readthedocs.org/en/latest/ (relative dates as input)
  • anonymisation http://blog.applied.ai/approaches-to-data-anonymisation/
  • data generators eg https://github.com/jbrambleDC/simulacram?files=1

further reading

  • https://github.com/rasbt/python-machine-learning-book/tree/master/faq notes from the book
  • https://github.com/hangtwenty/dive-into-machine-learning

pipeline building

  • add https://github.com/airbnb/airflow

tools on my radar

  • https://github.com/ceumicrodata/mETL for ETL via YAML, no programming required

review:

  • https://svaksha.github.io/pythonidae/

ianozsvald avatar Oct 19 '15 13:10 ianozsvald

I'm @springcoil on this

springcoil avatar Mar 14 '16 14:03 springcoil

Ooops, fixed, cheers :-) /cc @springcoil

ianozsvald avatar Mar 14 '16 20:03 ianozsvald