avalanche icon indicating copy to clipboard operation
avalanche copied to clipboard

Add support for text datasets

Open radandreicristian opened this issue 3 years ago • 1 comments

Hello!

I would like to suggest adding support for text datasets, for NLP. As I am planning to implement and use some of them for my own research anyway, I thought adding them to the Avalanche library could be beneficial for the community.

If this is something compatible with the current roadmap/plans, I would be happy to contribute.

Some datasets:

Etc.

Let me know what you think!

radandreicristian avatar Sep 20 '22 10:09 radandreicristian

Hi @radandreicristian, this is an important issue. We do some research on text datasets at UNIPI, so we are definitely interested in its support. Currently:

  • I'm working on refactoring Avalanche datasets to make them more and abstract the details specific to supervised image classifications tasks #1118
  • @AndreaCossu did some work using pretrained language models using huggingface, and we are evaluating whether it's possible to integrate its datasets somehow.

If you have a specific dataset that you really care about, we can help you with the integration. On our side, we are working on the integration but it may take us a bit of time.

AntonioCarta avatar Sep 20 '22 11:09 AntonioCarta

Hello! Thanks for your response, I'd like to also ask to consider adding support for hugging face datasets, both for translation/text classification. The underlying datasets in hf library are based on Apache Arrow. Generally speaking, each item in the dataset is a dictionary e.g: {'label': 1, 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}

It would be convenient to specify a list of key so that the Avalanche Dataset can use it internally without having to override the _process_pattern(): method.

Thanks in advance!

m-resta avatar Sep 26 '22 18:09 m-resta

@radandreicristian here you can find an example creating a benchmark starting from huggingface datasets. It's just a toy example for the moment (same data for all tasks) but it shows the support of NLP stuff. You can also see how to modify strategies to handle huggingface samples.

I'm closing this issue but feel free to open others if you have any suggestions.

AntonioCarta avatar Nov 17 '22 15:11 AntonioCarta