machine-learning-articles Good coding practices for Data Science

Good coding practices for Data Science

Open khuyentran1401 opened this issue 4 years ago • 0 comments

An efficient workflow for data science

https://towardsdatascience.com/good-coding-practices-for-data-science-e9237783784c

Specification Files: Files to specify various parameters for the code (YAML or JSON). Benefit: use the code in different ways with no code changes
Utilities: Save the files that are reproducible and generic for future projects.
Core Functionality: Separate the pipeline of your project into different files (data extraction, data exploration, data engineering, modeling). Benefit: Easy to change and manipulate the file without running the entire code. Organize your projects for easy reviewing
Main Executable: main.py for execute the entire code. Should be short for someone else to understand how pieces of files are integrated together

Maintain a Readme page for keeping track of the code changes. Useful for others to look at your code and understand how to use it.

Comment on the top of every file for you to organize and for reader to understand the function of the files

Benefits: collaborations, can switch back to the older version. Useful for experimenting, editing, and comparing different versions

Use unittest to validate the functionality of different parts of the code

Knowing these helpful techniques, we should gradually adopt these practices for efficient project management
Things that we could add into this workflow:

Mectrics and logging to keep track of metrics and data with MlFlow
A tool to easily create a comprehensible config with Hydra.cc
If the workflow of one project seems to be efficient to us, we can create a template with Cookiecutter

Apr 10 '20 22:04 khuyentran1401