DataScienceTutorials.jl Create Tutorial 3Models.ipynb by Cruz

A new tutorial showcasing MLJ by C. Cruz

Jun 08 '20 14:06 drcxcruz

this is ongoing work on /cruz2; it will take me some time to go through the full tutorial and adjust a few things, sorry

Jun 19 '20 17:06 tlienart

Great, thank you so much

On Fri, Jun 19, 2020, 1:14 PM Thibaut Lienart [email protected] wrote:

this is ongoing work on /cruz2; it will take me some time to go through the full tutorial and adjust a few things, sorry

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/DataScienceTutorials.jl/pull/78#issuecomment-646764786, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOF3FWQF6KS4CVELYNTPWBTRXOMIBANCNFSM4NYPPUJQ .

Jun 20 '20 13:06 drcxcruz

Actually Clarman, after now reading 2/3 of your tutorial and fixing a few things, I'm a bit uncomfortable with the fact that it's synthetic data; I thing a tutorial with this kind of depth would be great for real data because people could relate to the data and do further analysis and uncover things that may match their expectations or surprise them. Synthetic data is great for small tutorials where you show one thing; but here it's a bit awkward because explanations go in quite some depth to give context etc but ultimately the data is generated.

What do you think?

Jun 20 '20 19:06 tlienart

Hi

You have a good point. Random data might confuse the reader. I didn't find real data when I created the lab. This, I randomly created the data.

It will try to find some "real" data for the lab. If find some, I will create another version of the lab with it and with less analytics.

Thanks for your time and expertise.

On Sat, Jun 20, 2020, 3:47 PM Thibaut Lienart [email protected] wrote:

Actually Clarman, after now reading 2/3 of your tutorial and fixing a few things, I'm a bit uncomfortable with the fact that it's synthetic data; I thing a tutorial with this kind of depth would be great for real data because people could relate to the data and do further analysis and uncover things that may match their expectations or surprise them. Synthetic data is great for small tutorials where you show one thing; but here it's a bit awkward because explanations go in quite some depth to give context etc but ultimately the data is generated.

What do you think?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/DataScienceTutorials.jl/pull/78#issuecomment-647038224, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOF3FWWOKOBNJ6KJBHJ6GXLRXUG3NANCNFSM4NYPPUJQ .

Jun 22 '20 23:06 drcxcruz

Thanks a lot this is much appreciated!

For good data sources: https://datasetsearch.research.google.com also UCI (https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=&area=&numAtt=&numIns=&type=&sort=dateDown&view=table) for UCI I'd suggest taking anything that's more recent than 2010 and seems interesting for you.

Jun 23 '20 08:06 tlienart

this one might be fun: https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data

Jun 23 '20 08:06 tlienart

If you find one at OpenML, you can load it directly from MLJ using OpenML.load.

Jun 24 '20 20:06 ablaom