wrench
wrench copied to clipboard
Handling datasets using Hugging Face datasets and Hub
Hi,
Love this initiative, congrats!
Would it be possible to integrate the datasets into the huggingface Hub? Besides from the technical effort, would there be any copyright, licensing issues? If not I wouldn't mind to help out with this
Hi,
Thank you!! Would you mind waiting until the ICLR ddl? will be back soon!
Thanks for your quick response! That's perfect, ping me if you'd like me to help out
@dvsrepo Hey, I think it's a good idea! tho I'm not familiar with huggingface Hub. One potential issue is that each dataset is coupled with a matrix that's the weak labels, wondering if that could also be incorporated or just raw data?
Hi @JieyuZ2 , I think it shouldn't be a problem.
Just to be sure, the datasets can be instantiated from the json files here?: https://drive.google.com/drive/folders/1v55IKG2JN9fMtKJWU48B_5_DcPWGnpTq?usp=sharing
And the format described here?
https://github.com/JieyuZ2/wrench/wiki/Dataset:-Format-and-Usage
Or there's some additional matrix data files?
@dvsrepo Yes, the additional matrix data is stored in the "weak_labels" field of the json. No other additional data file.