machine-learning-novice-sklearn icon indicating copy to clipboard operation
machine-learning-novice-sklearn copied to clipboard

Revamp of lesson structure + content

Open mike-ivs opened this issue 1 year ago • 4 comments

Hi Team! (the repo looked a bit quiet... I hope this hasn't gone stale! <3 )

We recently ran a "carpentries style" Introduction to Python/ML/DL workshop for which we included this incubator lesson (over other pre-alpha/alpha carpentry incubators) alongside Novice-inflammation and Intro-to-Deep-learning (incubator in Beta).

We were a bit surprised that there is no formal "intro to ML" lesson in the carpentries and so we decided (as others have #37 and here) to pick this incubator lesson as the most established/suited and make a few further changes to content and structure before we delivered.

Now that we've made and delivered the first bunch of these changes we thought it would be useful to feed them back into the lesson and community to get some wider feedback and hopefully help the carpentries get an established "intro to ML" lesson.

I've submitted our changes all at once and will summarise them below in a bit more detail. I'm happy to re-submit them in smaller, by-episode chunks if that is easier for you.

Changes

Overall structure

We've adjusted the overall structure of the lesson to give you a more balanced overview of supervised and unsupervised learning, with examples of regression, classification(new), clustering, and dimension reduction.

For each of those episodes we made sure to show and compare two different techniques to give a flavour of the topics:

  • regression - linear vs polynomial
  • classification - Decision tree vs SVM
  • clustering - k-means vs spectral
  • dim red - PCA vs t-SNE

We also tried to reduce the conceptual overhead for ML / gradually introduce concepts as the lesson progressed:

  • in ep.1 we touch on "what if we compare against new data" and in ep.2 we introduce train-test splits
  • in ep.2 we touch on "over-fitting vs model complexity" and in ep.3 we play more with hyper-parameters

We also made some tweaks across the whole lesson to improve text flow/clarity/formatting, and added in a few more figures / more plotting code to help reinforce things with the visual aspect of learning.

Introduction

We overhauled the introduction to give a clearer explanation of:

  • what is machine learning
  • where is it used in our daily lives
  • AI vs ML vs DL (very similar to the intro-to-DL lesson, shameless figure reuse)
  • Types of machine learning; summary of which are covered in the lesson
  • limitations of ML

We removed the "over hyping" section as, while it may be true that ML/AI is overhyped, it felt like a bit too negative of a tone to take for an introduction to the topic.

Regression

We decided to remove the "create your own python regression" lesson in favour of using purely SKlearn by combining the two regression lessons into one. We needed extra time to teach classification, and while I understand the reasoning behind doing a manual regression before using SKlearn it felt like quite a time sink to not use it in a lesson about "ML with SKlearn".

We added in a quick section to introduce Supervised learning and Sklearn before moving onto regression. We also used a small test dataset instead of the gapminder dataset (as done by #39 ) to try and reduce the learner burden of having to understand the dataset alongside learning ML for the first time. (maybe it's too small of a dataset...)

Classification

This one felt like it was missing from the original! We made a quick classification lesson, based upon the same penguin dataset as the "intro-to-DL" lesson. It steps up the complexity of the coding from a simple 2-list dataset, but it feels like a nice intermediate between the regression lesson and the eventual "intro to DL" lesson.

Clustering

We added in a section to explain the idea of unsupervised learning, touched a little on the concept of hyper-parameters, and broke up the code to make a few more plots to give bit more of a visualisation of the clustering process.

Dimension reduction

We expanded this section to try and give a better overview of the MNIST dataset and the higher dimensionality of these images. We also tried to give a better explanation of PCA, though have only just glanced through #39 it would be worth including some of those changes into the lesson!

Neural Networks

We left this section mostly unchanged (apart from minor grammar/flow changes). Given that we ran "Intro to ML" AND "intro to DL" we actually left the NN section to the "Intro to DL" part of our workshop, in favour of covering the classical learning in ML.

My two cents on the direction of development

Given the advanced development of the "intro to DL" lesson it might be worth dropping the NN section of this lesson and instead focusing on Ensemble learning and/or Reinforcement learning in future expansions of this lesson - they seem to be the only big "ML" topics that aren't covered whereas NNs are a mandatory concept for the "intro to DL"

Thanks for all the effort put in so far, and happy to discuss this PR :)

mike-ivs avatar May 03 '23 04:05 mike-ivs

Dear @mike-ivs,

Thank you for your (massive) contribution! I would be happy to review it. Upon a glance, it seems to me that it will improve the lesson significantly. However, since @colinsauze is the main author of this lesson, it is best that we hear from them before making any changes.

@colinsauze could you please let us know what you think?

Thanks, V

vinisalazar avatar May 12 '23 02:05 vinisalazar

Apologies for not replying sooner, unfortunately I'm really busy with other things right now and haven't had much time to develop this lesson. Just as a bit of background, this lesson is older than the other ML lessons in the incubator, so it does include things which have since ended up being covered in more detail by other lessons.

At a glance I think I agree with about 90% of your suggested changes, but need to look through them in more detail first. Before doing that what I think we really need is a clearer picture of what the purpose of this lesson is and how it fits with the other lessons. I do find it interesting how you picked this lesson as giving the best introduction to ML and perhaps targetting complete beginners for a one day intro to ML is how this lesson fits in with others. I'd really like to writeup some personas and lesson design before accepting any more major pull requests. This should give the lesson a clearer distinction from other lessons and a more obvious direction for future development. Perhaps you could write up one or two of these based upon the people who attended your workshop and submit them as a separate PR?

Once that is done I would also like to get other people to take on more of a maintainer role on this lesson and help see it through to the beta and carpentries lab stages as I really don't have the time to do that myself. Would you be interested in becoming a maintainer too @mike-ivs?

colinsauze avatar May 12 '23 08:05 colinsauze

Apologies for my sluggish reply @vinisalazar and @colinsauze! 90% is a cracking number - happy to have detailed feedback and input!

We used the lesson as a 5-day carpentry-style workshop "introduction to Python, ML, and DL" (alongside Novice-inflammation and Intro-to-DL ).

We'd pre-surveyed our attendees and yes the majority of them were complete beginners to machine learning and so we chose to go wide with the content, rather than deep, so that they could get a general understanding of ML (types, concepts) and how they could start applying it to their work.

[personas] Perhaps you could write up one or two of these based upon the people who attended your workshop and submit them as a separate PR?

Sure, I can try and rustle something up along these lines based on our pre/post survey info.

Would you be interested in becoming a maintainer too @mike-ivs?

Sure, i'd be happy to try and keep the momentum up and get this to the next stages :)

mike-ivs avatar May 22 '23 05:05 mike-ivs

Would it be the case to try to organise a community discussion for this?

I think there would definitely be interest from the community in transitioning this lesson to the Carpentries Lab, and it could potentially be the seed for a ML Curriculum in the long run.

cc @tobyhodges

vinisalazar avatar May 22 '23 05:05 vinisalazar

Closing for now due to significant changes

mike-ivs avatar Jul 30 '24 22:07 mike-ivs

Reopening after a chat with Colin :)

I'll go through and make a summary of all the changes we've done along the way, a combination of the initial changes we mentioned in the PR and all the additional changes we built upon those.

The new lesson can be previewed here - https://mike-ivs.github.io/machine-learning-novice-sklearn/

mike-ivs avatar Sep 23 '24 21:09 mike-ivs

Overall structure

We've adjusted the overall structure of the lesson to give a broad overview of basic ML: what ML is (vs DL+AI), supervised vs non-supervised, regression, classification, clustering, dimensionality reduction, and ensemble learning.

For each of those episodes we made sure to show and compare two different techniques to give a flavour of the topics:

  • regression: linear vs polynomial
  • classification: Decision tree vs SVM
  • ensemble: Bagging vs Stacking
  • clustering: k-means vs spectral
  • dimensionality reduction: PCA vs t-SNE

We also tried to reduce the conceptual overhead for ML / gradually introduce concepts as the lesson progressed:

  • in ep.1 we introduce the general "ML/DL" workflow, fit some data, and ease towards the concept of overfitting on a data subset.
  • in ep.2 we introduce "train-test-split" and the concept of hyper parameters.
  • in ep.3 we build on regression/classification using Ensemble techniques (random forest)
  • in ep.4 build on the concept of hyper parameters and introduce the idea of performance (tradeoffs)
  • in ep.5 we look at larger/complex data, and frame dimensionality reduction as a useful step prior to other ML techniques.

We've tried to function'ise the code as much as possible, the idea being we slowly go through the process of creating reusable workflow functions before putting them into practice multiple times (new data, hyperparameter changes, etc) i.e. teaching the underlying workflow before practicing doing it a few times.

We've also tried to keep the datasets as "built-in" as possible to reduce any prep-overhead prior to teaching a workshop.

mike-ivs avatar Sep 24 '24 02:09 mike-ivs