deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

[FEATURE] Text dataset example

Open farizrahman4u opened this issue 2 years ago • 2 comments

🚨🚨 Feature Request

Create a notebook demonstrating uploading a text dataset such as Squad. Raw text tensors can be accompanied by token index tensors which can be used for training.

farizrahman4u avatar Mar 15 '22 09:03 farizrahman4u

hi, I want to work on this issue.

uday-uppal avatar Mar 19 '22 14:03 uday-uppal

Hi @uday-uppal thanks for your interest in Hub. I've assigned it to you. Let us know if you have any questions.

tatevikh avatar Mar 19 '22 17:03 tatevikh

Hi @tatevikh , I want to work on this issue. Can you please elaborate what needs to be done ?

R-Yash avatar Jan 22 '23 10:01 R-Yash

Hi @R-Yash . Thanks for your interest in Activeloop. The issue requires creating a notebook using deeplake that showscases uploading a text dataset. Let me know if you have any other questions.

tatevikh avatar Jan 22 '23 15:01 tatevikh

@tatevikh So basically I have to create a notebook with steps on how to make a deeplake dataset from a normal one, right? Can I use a dataset like Spam Text Message Classification.

R-Yash avatar Jan 22 '23 17:01 R-Yash

Yeap, you got the idea. Can you do one of the following datasets though?! That would be very helpful to us:) https://paperswithcode.com/dataset/multinli https://paperswithcode.com/dataset/natural-questions https://paperswithcode.com/dataset/wikitext-2

tatevikh avatar Jan 22 '23 20:01 tatevikh

Sure. I'll work on that.

R-Yash avatar Jan 22 '23 21:01 R-Yash

@tatevikh Can you please give me a brief rundown about what to do. I am having some trouble following the documentation as it is for image datasets.

R-Yash avatar Jan 23 '23 14:01 R-Yash

Hi @R-Yash,

The API for uploading text data is very similar to uploading images:

import deeplake as dp
ds = dp.empty("path/for/ds")
with ds:
    ds.create_tensor("x", htype="text")
    ds.x.append("hi")
    ds.x.append("hello")
    ...
    print(ds.x.numpy())

There is a lot more you can do, such as using transforms to speed up your uploads, using compression to reduce storage etc., but this should get you started. Let me know if you have any questions.

farizrahman4u avatar Jan 23 '23 15:01 farizrahman4u

@farizrahman4u I have created the notebook. Where should I upload it ?

R-Yash avatar Jan 23 '23 16:01 R-Yash

Hey @R-Yash . Can you please share a colab notebook, so we can have a look how it works before uploading. Thanks!

tatevikh avatar Jan 23 '23 18:01 tatevikh

@tatevikh https://colab.research.google.com/drive/1yU4cSBm-4JnrakZ5ODjyxcvkKzUcpPWo?usp=sharing

R-Yash avatar Jan 23 '23 18:01 R-Yash

Thanks @R-Yash . This looks good. Can you please upload it in https://github.com/activeloopai/examples. Thanks again for the contribution!

tatevikh avatar Jan 24 '23 19:01 tatevikh

Can I do these other two datasets as well? https://paperswithcode.com/dataset/natural-questions https://paperswithcode.com/dataset/wikitext-2

R-Yash avatar Jan 24 '23 19:01 R-Yash

hey @R-Yash , priority wise, we need those datasets uploaded for Deep Lake community use in Deep Lake format. Just a notebook would be insufficient in this case. :)

mikayelh avatar Jan 24 '23 19:01 mikayelh

@mikayelh Ok. I will try my best to make them ready for community use and upload them.

R-Yash avatar Jan 24 '23 19:01 R-Yash

once you are approximately ready, please let us know (we recommend uploading a small subset and asking us to verify against QA guidelines that we have available)

mikayelh avatar Jan 24 '23 19:01 mikayelh

Ok. Can you please share a link of the QA guidelines

R-Yash avatar Jan 24 '23 20:01 R-Yash

Apologies for the late reply on this @R-Yash , please join the community slack for better coordination. Here's the link to the worksheet.

mikayelh avatar Jan 27 '23 23:01 mikayelh

Hello @mikayelh I have made a deeplake dataset from a small sample of wikitext dataset. Please have a look. https://colab.research.google.com/drive/1553A8hz4Tbi5RM789hP6PHZLJrDa8kja?usp=sharing

R-Yash avatar Jan 30 '23 19:01 R-Yash

Wouldlike to contribute

rajveer43 avatar Sep 30 '23 07:09 rajveer43