deeplake
deeplake copied to clipboard
[FEATURE] Text dataset example
🚨🚨 Feature Request
Create a notebook demonstrating uploading a text dataset such as Squad. Raw text tensors can be accompanied by token index tensors which can be used for training.
hi, I want to work on this issue.
Hi @uday-uppal thanks for your interest in Hub. I've assigned it to you. Let us know if you have any questions.
Hi @tatevikh , I want to work on this issue. Can you please elaborate what needs to be done ?
Hi @R-Yash . Thanks for your interest in Activeloop. The issue requires creating a notebook using deeplake that showscases uploading a text dataset. Let me know if you have any other questions.
@tatevikh So basically I have to create a notebook with steps on how to make a deeplake dataset from a normal one, right? Can I use a dataset like Spam Text Message Classification.
Yeap, you got the idea. Can you do one of the following datasets though?! That would be very helpful to us:) https://paperswithcode.com/dataset/multinli https://paperswithcode.com/dataset/natural-questions https://paperswithcode.com/dataset/wikitext-2
Sure. I'll work on that.
@tatevikh Can you please give me a brief rundown about what to do. I am having some trouble following the documentation as it is for image datasets.
Hi @R-Yash,
The API for uploading text data is very similar to uploading images:
import deeplake as dp
ds = dp.empty("path/for/ds")
with ds:
ds.create_tensor("x", htype="text")
ds.x.append("hi")
ds.x.append("hello")
...
print(ds.x.numpy())
There is a lot more you can do, such as using transforms to speed up your uploads, using compression to reduce storage etc., but this should get you started. Let me know if you have any questions.
@farizrahman4u I have created the notebook. Where should I upload it ?
Hey @R-Yash . Can you please share a colab notebook, so we can have a look how it works before uploading. Thanks!
@tatevikh https://colab.research.google.com/drive/1yU4cSBm-4JnrakZ5ODjyxcvkKzUcpPWo?usp=sharing
Thanks @R-Yash . This looks good. Can you please upload it in https://github.com/activeloopai/examples. Thanks again for the contribution!
Can I do these other two datasets as well? https://paperswithcode.com/dataset/natural-questions https://paperswithcode.com/dataset/wikitext-2
hey @R-Yash , priority wise, we need those datasets uploaded for Deep Lake community use in Deep Lake format. Just a notebook would be insufficient in this case. :)
@mikayelh Ok. I will try my best to make them ready for community use and upload them.
once you are approximately ready, please let us know (we recommend uploading a small subset and asking us to verify against QA guidelines that we have available)
Ok. Can you please share a link of the QA guidelines
Apologies for the late reply on this @R-Yash , please join the community slack for better coordination. Here's the link to the worksheet.
Hello @mikayelh I have made a deeplake dataset from a small sample of wikitext dataset. Please have a look. https://colab.research.google.com/drive/1553A8hz4Tbi5RM789hP6PHZLJrDa8kja?usp=sharing
Wouldlike to contribute