Data-Science icon indicating copy to clipboard operation
Data-Science copied to clipboard

Projects and awesome list for all Data Science fields

Hits

🔄 Constantly updated. Subscribe not to miss anything.

  • [ ] For Deep Learning algorithms please check Deep Learning repository.

Data Science Tasks

Folders with all materials for specific task/domain

Educational Platforms

University courses 👩‍🎓

Title Description
MIT OpenCourseWare

Julia language

Title Description
Introduction to Computational Thinking

Time Series

Title Description
MIT 18.S096 Topics in Mathematics w Applications in Finance

    The purpose of the class is to expose undergraduate and graduate students to the mathematical concepts and techniques used in the financial industry. Mathematics lectures are mixed with lectures illustrating the corresponding application in the financial industry. MIT mathematicians teach the mathematics part while industry professionals give the lectures on applications in finance.

  • Video lectures

Online courses

GitHub Repositories :octocat:

Title Description
Data Science for Beginners - A Curriculum Azure Cloud Advocates at Microsoft are pleased to offer a 10-week, 20-lesson curriculum all about Data Science. Each lesson includes pre-lesson and post-lesson quizzes, written instructions to complete the lesson, a solution, and an assignment. Our project-based pedagogy allows you to learn while building, a proven way for new skills to 'stick'.
Machine Learning for Beginners - A Curriculum Azure Cloud Advocates at Microsoft are pleased to offer a 12-week, 26-lesson curriculum all about Machine Learning. In this curriculum, you will learn about what is sometimes called classic machine learning, using primarily Scikit-learn as a library and avoiding deep learning, which is covered in our forthcoming 'AI for Beginners' curriculum.
start-machine-learning A complete guide to start and improve in machine learning (ML), artificial intelligence (AI) in 2021 without ANY background in the field and stay up-to-date with the latest news and state-of-the-art techniques
[Data Science Specialization John Hopkins Coursera](https://github.com/mGalarnyk/datasciencecoursera)

Books

GitHub Repositories :octocat:

Title Description
Awesome Artificial Intelligence (AI) A curated list of Artificial Intelligence (AI) courses, books, video lectures and papers.
ml-surveys Survey papers summarizing advances in deep learning, NLP, CV, graphs, reinforcement learning, recommendations, graphs, etc.
awesome-analytics-engineering Awesome list of resources for analytics engineers.

Tools

Title Description
Weight Watcher WeightWatcher (WW): is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data.

Papers

Title Description, Information
2021: A Year Full of Amazing AI papers- A Review / 📌 [work in progress...] A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code. [work in progress]

Certifications

Online Conferences, Meetups, Data Summer Schools

Twitter

Podcasts

Blogs

Companies Blogs

Other Blogs

Articles

Communities

Title Description
Coursera Comminity Data Science
Locally Optimistic A community for current and aspiring data analytics leaders. Started in NYC in early 2018 as an outgrowth of a slack channel / extremely informal meetup group, we hope to share our thoughts / opinions / experiences / trials / tribulations with others in the community.
Deepchecks Community A place to talk about MLOps news, articles, conferences, and really just anything in the MLOps space.

Telegram Chanels

  • DataScience Digest
    • Collection of the top articles, videos, events, books and jobs on Machine Learning, Deep Learning, NLP, Computer Vision and other aspects of Data Science.

Main skills required by the data scientists vacancies

The research made by Faculty of Applied Sciences at UCU. Link on main article.

Big Data Software Engineer / Data Engineer

  1. Linear algebra. Calculus. Statistics and Probability Theory.
  2. Machine Learning Algorithms: regression, simulation, scenario analysis, modeling, clustering, decision trees, etc.
  3. Python 3, Pandas, Scikit Learn, Keras, Tensor Flow, Numpy, PyTorch.
  4. Data visualization.
  5. Software engineering methodologies, functional programming or object-oriented programming.
  6. DevOps: containerization and orchestration.
  7. Classic DBs (relational or object): MySQL, PostgreSQL, RDS.
  8. NoSQL (documented): MongoDB, Cassandra, HBase, Elasticsearch, Redis, DynamoDB.
  9. NewSQL (hybrid/in memory): Memsql, VoltDB.
  10. Query engines: Impala, Presto.
  11. Cloud platforms (GCP, AWS). Cloud computation (Dataflow, Dataproc). Streaming (Pub/Sub, Kafka). Data storage (BigQuery, Cloud SQL, Cloud Spanner, Firestore, BigTable).
  12. ETL Concepts / Processes.
  13. Data Warehouse technologies, Data Lake architecture.
  14. Data modeling: Bachman diagrams, Chen’s Notation, Object-relational mapping, etc.
  15. Processing frameworks: Apache Spark (Pyspark/SparkR/sparklyr), Flink, Beam, Kafka streams
  16. Data pipeline and workflow management tools: Azkaban, Luigi, Airflow, etc.

Data Scientist

  1. Python (PyCharm, Pandas, NumPy, bs4, sklearn, scipy). R.
  2. Linear algebra. Calculus. Statistics.
  3. Machine Learning techniques (Decision Trees, Random Forest, SVM, Bayesian, XG Boost, K-Nearest Neighbors) and concepts: regression and classification, clustering, feature selection, feature engineering, the curse of dimensionality, bias-variance tradeoff, SVMs.
  4. Data visualization.
  5. Data Mining (Clustering, Frequent Pattern Mining, Outliers Detection).
  6. Neural Networks and ML Packages (sklearn/sqboost/Tensorflow/Keras, H20).
  7. Cloud platforms (GCP, AWS). Cloud computation (Dataflow, Dataproc). Streaming (Pub/Sub, Kafka). Data storage (BigQuery, Cloud SQL, Cloud Spanner, Firestore, BigTable).
  8. Databases: SQL and non-SQL, AWS cloud storage, GDPR data privacy.
  9. Processing frameworks: Hadoop, Spark.
  10. Business Intelligence Software (Power BI, Tableau, Qlik, Cognos Analytics).

Machine Learning Engineer

  1. Computer science fundamentals, algorithms, mathematics, linear algebra, probability, and statistics.
  2. Python (Pandas, Numpy, Scikit-Learn, Tensorflow, Keras).
  3. Python visualization tools: matplotlib/seaborn, Plotly.
  4. Machine Learning techniques (Decision Trees, Random Forest, SVM, Bayesian, XG Boost, K-Nearest Neighbors) and concepts: regression and classification, clustering, feature selection, feature engineering, the curse of dimensionality, bias-variance tradeoff, SVMs.
  5. Deep Learning: Recurrent Neural Network (LSTM/GRU units), Convolutional Neural Network.
  6. Machine learning frameworks (TensorFlow, Caffe2, PyTorch, Spark ML, scikit-learn) and ML techniques: GAN, ASR, RL.
  7. Databases: SQL and non-SQL. Hadoop ecosystem.
  8. Processing frameworks: Apache Spark (Pyspark/SparkR/sparklyr)
  9. Cloud platforms (GCP, AWS).

Data Analyst

  1. Math, Statistics (regression, properties of distributions, statistical tests, and proper usage, etc.) and Probability Theory.
  2. Statistical programming software (R, Python, SAS, Matlab).
  3. Predictive analytics (regression models, time-series analysis and forecasting, survival or duration analysis).
  4. BI tools: Google Data Studio / Microsoft PowerBI / Tableau.
  5. Classic DBs: MySQL.
  6. MS Excel.
  7. A/B testing.

NLP Engineer / NLP Data Scientist

  1. Python (sklearn, nltk, gensim, spacy, Tensor Flow, PyTorch, Keras) and Python Data Science toolkit: Jupyter Notebook, Pandas, Numpy, Matplotlib/Seaborn, Scipy.
  2. Databases: SQL and NoSQL (MySQL, MongoDB, PostgreSQL ) .
  3. NLP libraries: NLTK, SpaCy, Stanford CoreNLP etc.
  4. NLP techniques for text representation: (TF-IDF, Word2Vec), semantic extraction, data structures and modeling.
  5. Methods of Information Extraction (NER, terminology extraction, keywords extraction, etc.)
  6. Machine Learning techniques and concepts (regression, trees, SVM, ensembles) for NLP tasks.

CV Engineer

  1. Linear Algebra. Geometry. Calculus. Statistics and Probability theory.
  2. Python3, numpy, pandas, seaborn, scipy.
  3. Computer vision / image processing libraries such as: OpenCV, Pillow.
  4. Convolutional Neural Networks (LSTM, inception, residual, GAN).
  5. Neural network frameworks: TensorFlow, PyTorch.
  6. Computer vision algorithms and architectures: object detection, segmentation, face recognition, image processing, video processing.
  7. Real-time CV systems based on Deep Learning.
  8. Cloud model training (GCP, AWS), Cloud integration, Cloud Platforms.
  9. Performance metrics in object detection and classification, such as mAP and related.
  10. Big Data (Hadoop, Spark, Hive).

Deep Learning Engineer / Deep Learning Research Engineer

  1. Python3: numpy, scikit-learn, pandas, scipy.
  2. Statistics (regression, properties of distributions, statistical tests, and proper usage, etc.) and probability theory.
  3. Deep learning frameworks: Tensorflow, PyTorch; MxNet, Caffe, Keras.
  4. Deep learning architectures: VGG, ResNet, Inception, MobileNet.
  5. Deepnets, hyperparameter optimization, visualization, interpretation.
  6. Machine learning models.

The Data Science Interview Preparation

Typical interview construction

  1. Software Engineering (for more visit Interview Preparation Repository)
  2. Applied Statistics
  3. Machine Learning
  4. Data Wrangling, Manipulation and Visualisation

2. Applied Statistics

  • Descriptive statistics (What distribution does my data follow, what are the modes of the distribution, the expectation, the variance)
  • Probability theory (Given my data follows a Binomial distribution, what is the probability of observing 5 paying customers in 10 click-through events)
  • Hypothesis testing (forming the basis of any question on A/B testing, T-tests, anova, chi-squared tests, etc).
  • Regression (Is the relationship between my variables linear, what are potential sources of bias, what are the assumptions behind the ordinary least squares solution)
  • Bayesian Inference (What are some advantages/disadvantages vs frequentist methods)
  1. Introduction to Probability and Statistics, an open course on everything listed above including questions and an exam to help you test your knowledge.
  2. Machine Learning: A Bayesian and Optimization Perspective by Sergios Theodoridis. This is more a machine learning text than a specific primer on applied statistics, but the linear algebra approaches outlined here really help drive home the key statistical concepts on regression.