Create Tutorial 3Models.ipynb by Cruz
A new tutorial showcasing MLJ by C. Cruz
this is ongoing work on /cruz2; it will take me some time to go through the full tutorial and adjust a few things, sorry
Great, thank you so much
On Fri, Jun 19, 2020, 1:14 PM Thibaut Lienart [email protected] wrote:
this is ongoing work on /cruz2; it will take me some time to go through the full tutorial and adjust a few things, sorry
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/DataScienceTutorials.jl/pull/78#issuecomment-646764786, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOF3FWQF6KS4CVELYNTPWBTRXOMIBANCNFSM4NYPPUJQ .
Actually Clarman, after now reading 2/3 of your tutorial and fixing a few things, I'm a bit uncomfortable with the fact that it's synthetic data; I thing a tutorial with this kind of depth would be great for real data because people could relate to the data and do further analysis and uncover things that may match their expectations or surprise them. Synthetic data is great for small tutorials where you show one thing; but here it's a bit awkward because explanations go in quite some depth to give context etc but ultimately the data is generated.
What do you think?
Hi
You have a good point. Random data might confuse the reader. I didn't find real data when I created the lab. This, I randomly created the data.
It will try to find some "real" data for the lab. If find some, I will create another version of the lab with it and with less analytics.
Thanks for your time and expertise.
On Sat, Jun 20, 2020, 3:47 PM Thibaut Lienart [email protected] wrote:
Actually Clarman, after now reading 2/3 of your tutorial and fixing a few things, I'm a bit uncomfortable with the fact that it's synthetic data; I thing a tutorial with this kind of depth would be great for real data because people could relate to the data and do further analysis and uncover things that may match their expectations or surprise them. Synthetic data is great for small tutorials where you show one thing; but here it's a bit awkward because explanations go in quite some depth to give context etc but ultimately the data is generated.
What do you think?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/DataScienceTutorials.jl/pull/78#issuecomment-647038224, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOF3FWWOKOBNJ6KJBHJ6GXLRXUG3NANCNFSM4NYPPUJQ .
Thanks a lot this is much appreciated!
For good data sources: https://datasetsearch.research.google.com also UCI (https://archive.ics.uci.edu/ml/datasets.php?format=&task=&att=&area=&numAtt=&numIns=&type=&sort=dateDown&view=table) for UCI I'd suggest taking anything that's more recent than 2010 and seems interesting for you.
this one might be fun: https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data
If you find one at OpenML, you can load it directly from MLJ using OpenML.load.