wrench icon indicating copy to clipboard operation
wrench copied to clipboard

Handling datasets using Hugging Face datasets and Hub

Open dvsrepo opened this issue 4 years ago • 5 comments

Hi,

Love this initiative, congrats!

Would it be possible to integrate the datasets into the huggingface Hub? Besides from the technical effort, would there be any copyright, licensing issues? If not I wouldn't mind to help out with this

dvsrepo avatar Sep 29 '21 20:09 dvsrepo

Hi,

Thank you!! Would you mind waiting until the ICLR ddl? will be back soon!

JieyuZ2 avatar Sep 30 '21 08:09 JieyuZ2

Thanks for your quick response! That's perfect, ping me if you'd like me to help out

dvsrepo avatar Oct 08 '21 07:10 dvsrepo

@dvsrepo Hey, I think it's a good idea! tho I'm not familiar with huggingface Hub. One potential issue is that each dataset is coupled with a matrix that's the weak labels, wondering if that could also be incorporated or just raw data?

JieyuZ2 avatar Oct 09 '21 17:10 JieyuZ2

Hi @JieyuZ2 , I think it shouldn't be a problem.

Just to be sure, the datasets can be instantiated from the json files here?: https://drive.google.com/drive/folders/1v55IKG2JN9fMtKJWU48B_5_DcPWGnpTq?usp=sharing

And the format described here?

https://github.com/JieyuZ2/wrench/wiki/Dataset:-Format-and-Usage

Or there's some additional matrix data files?

dvsrepo avatar Oct 15 '21 08:10 dvsrepo

@dvsrepo Yes, the additional matrix data is stored in the "weak_labels" field of the json. No other additional data file.

JieyuZ2 avatar Oct 15 '21 10:10 JieyuZ2