deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

[FEATURE] Tutorial notebooks for popular datasets

Open kristinagrig06 opened this issue 3 years ago • 35 comments

🚨🚨 Feature Request

  • [ ] Related to an existing Issue
  • [x] A new implementation (Improvement, Extension)

If your feature will improve HUB

Create notebooks with training pipelines using popular datasets available in Hub. A list of all datasets from activeloop can be found by running: activeloop list-datasets --workspace activeloop

Difficulty: Easy

Note: If you have a solution to this issue, please make a Pull Request to our Examples Repository and not to this repository!

kristinagrig06 avatar Sep 12 '21 19:09 kristinagrig06

Can we use Open Source Libraries/ Packages for so? I guess I do have an approach in mind.

Eeshaan-Dutt avatar Oct 02 '21 03:10 Eeshaan-Dutt

Hey @Eeshaan-Dutt, you can use them, but please keep in mind that Hub should be the star of these tutorials - whatever can be done with Hub, should be done!

dhiganthrao avatar Oct 02 '21 06:10 dhiganthrao

@dhiganthrao , is this issue closed or available for contribution?

Anaxagoras7 avatar Oct 12 '21 05:10 Anaxagoras7

@Anaxagoras7, there's a PR open for the same, but it hasn't been updated for some time. If you think you have a good solution to this, go for it!

dhiganthrao avatar Oct 12 '21 05:10 dhiganthrao

General update: If you have a solution to this issue, please make a Pull Request to our Examples Repository and not to this repository!

dhiganthrao avatar Oct 12 '21 12:10 dhiganthrao

Sure @dhiganthrao

Anaxagoras7 avatar Oct 16 '21 14:10 Anaxagoras7

Hey @Eeshaan-Dutt and @Anaxagoras7! Any updates/questions you want to share?

dhiganthrao avatar Oct 20 '21 13:10 dhiganthrao

@dhiganthrao, apologies on the delay, got caught up in something. I wanted to know, if the pipeline is just a demonstration of the various datasets available using ML algos, or am I missing something. And could you please elaborate a little this issue if possible, as I am a bit of a newbie in the Open Source world ? Also, Im not able to access the datasets list using the command listed above too, is there a way to fix that?

Anaxagoras7 avatar Oct 22 '21 15:10 Anaxagoras7

@Anaxagoras7, you can create a Jupyter notebook containing details on how to build an ML pipeline using Hub. An ML pipeline would involve loading the data, preprocessing it, loading an ML/DL model, and training that model on your data. So instead of local data, Hub can be used for the same. You can refer to this example on what it looks like. You don't need to write code for uploading a Hub dataset, but it would be helpful if you do!

Regarding you not able to access the list of datasets, can you please elaborate? It would be helpful if you could upload the error traceback you get when you run the command, for debugging purposes 😄

Feel free to ping me again if you have any questions, and please consider joining our Slack Community for all updates on everything Hub!

dhiganthrao avatar Oct 22 '21 16:10 dhiganthrao

Thank you for the help @dhiganthrao . Also I got the problem resolved! Will send a PR shortly.

Anaxagoras7 avatar Oct 23 '21 14:10 Anaxagoras7

@Anaxagoras7 did you send a PR for this? I had a hard time tracking this down. If not, this issue is still up for grabs in case anyone is interested!

mikayelh avatar Jan 13 '22 19:01 mikayelh

Hi! I am just thinking of grabbing this issue....But just wanna recollect and clarify about whatever I understood, Please feel free to correct me: So, basically, I have to make a Jupyter notebook where I need to consider different datasets with any ML algorithm and prepare a pipeline using HUB. So, will I have to consider different different ML algorithms also?

jaivanti avatar Mar 11 '22 11:03 jaivanti

it's ok to stick to one model, but the training ideally should happen both with PyTorch and Tensorflow!

mikayelh avatar Mar 11 '22 15:03 mikayelh

Alright! I will give it a try then

jaivanti avatar Mar 12 '22 06:03 jaivanti

https://colab.research.google.com/drive/13rkYj5qfAn8YdoomNV8fLcH7--gb_vBQ#scrollTo=iKEAxW7FENld This is a mock notebook of ML pipeline I have prepared using CIFAR-10 and HUB for Image Classification using tensorflow... Is this fine to proceed with...Please let me know so i will make more changes

jaivanti avatar Mar 12 '22 19:03 jaivanti

This is for pytorch implementation using hub: https://colab.research.google.com/drive/1K1zTX0Xmh8DNKkDhDERK-uX8pf-aLp_5 Do let me know for updates

jaivanti avatar Mar 14 '22 17:03 jaivanti

@mikayelh Should i raise the pull request with this work? Like do youall want me to make any changes into it?

jaivanti avatar Mar 15 '22 07:03 jaivanti

@jaivanti hi! thanks for following up. @farizrahman4u will review this and get back to you asap (@tatevikh FYI). Thanks a lot for the contribution (upon quick glance looks ok, but @farizrahman4u definitely will have more tips).

Maybe you can add a screenshot to the colab from app.activeloop.ai and say "you can also visualize the dataset at [dataset link]".

mikayelh avatar Mar 15 '22 07:03 mikayelh

Thanks @mikayelh for the response! I have added whatever changes you mentioned.

jaivanti avatar Mar 19 '22 17:03 jaivanti

I created a docker, hub, tensorboard, jupyter notebook example based on pytorch MNIST example. Wondering if that is of any use.

https://github.com/ubergeekNZ/pytorch_and_hub

@jaivanti hi! thanks for following up. @farizrahman4u will review this and get back to you asap (@tatevikh FYI). Thanks a lot for the contribution (upon quick glance looks ok, but @farizrahman4u definitely will have more tips).

Maybe you can add a screenshot to the colab from app.activeloop.ai and say "you can also visualize the dataset at [dataset link]".

ubergeekNZ avatar Mar 21 '22 09:03 ubergeekNZ

@jaivanti The notebooks look good, maybe format the cells with black? Also instead of comments, in some places its more appropriate to use text cells.

farizrahman4u avatar Mar 21 '22 12:03 farizrahman4u

@ubergeekNZ just make sure to call the example "Using Activeloop Hub as a dataloader with Tensorboard & Docker to train a model in PyTorch".

Load mnist data from activeloop.ai hub -> this is Fashion MNIST, and not MNIST. We also refer to hub as either hub or Activeloop Hub (not activeloop.ai hub). Please fix this before we merge it into activeloopai/examples!

mikayelh avatar Mar 21 '22 15:03 mikayelh

@farizrahman4u I have added the black extension to the cells and also provided text instead of comments

jaivanti avatar Mar 21 '22 18:03 jaivanti

Should I pull a PR for this? I have made most of the changes as prescribed. Thanks

jaivanti avatar Mar 23 '22 09:03 jaivanti

Thanks for the ping @jaivanti ! adding @tatevikh to the thread.

mikayelh avatar Mar 24 '22 16:03 mikayelh

@jaivanti Sure, go ahead.

farizrahman4u avatar Mar 24 '22 18:03 farizrahman4u

Can multiple people contribute to creating Notebook Tutorials? If yes then I can try this one.

brlrb avatar Mar 31 '22 03:03 brlrb

yes @brlrb , absolutely. Do you have a tutorial in mind?

mikayelh avatar Mar 31 '22 06:03 mikayelh

@mikayelh what I had in mind is that I can pick up any dataset that does not have a tutorial or can be improved from https://docs.activeloop.ai/datasets/ and then write a tutorial. One example could be that a dataset can have a tutorial with PyTorch but I can write them in TensorFlow. A couple of questions for you:

  • is there an existing tutorial I can refer to?
  • what are the datasets that need the most attention to have tutorials?

I am interested in the NLP dataset and want to work with Hugging Face APIs but I am open to any other that is a priority.

brlrb avatar Mar 31 '22 19:03 brlrb

@brlrb i'm tagging @istranic who has some ideas re: which tutorials would be more interesting/priority for Hub. Thanks for your ideas!

mikayelh avatar Apr 04 '22 15:04 mikayelh