avalanche
avalanche copied to clipboard
Add support for text datasets
Hello!
I would like to suggest adding support for text datasets, for NLP. As I am planning to implement and use some of them for my own research anyway, I thought adding them to the Avalanche library could be beneficial for the community.
If this is something compatible with the current roadmap/plans, I would be happy to contribute.
Some datasets:
- Lifelong Learning for Language Models (A. Hussain et. al., 2021) - open-source, MIT license.
- decaNLP (McCann et. al., 2018) - open-source, BDS 3-clause license.
- CALM, (G. Kruszewski et. al.) - open-source, BSD license.
- Lifelong Text Classification (d'Autume et. al., 2019) - open-source, MIT license.
Etc.
Let me know what you think!
Hi @radandreicristian, this is an important issue. We do some research on text datasets at UNIPI, so we are definitely interested in its support. Currently:
- I'm working on refactoring Avalanche datasets to make them more and abstract the details specific to supervised image classifications tasks #1118
- @AndreaCossu did some work using pretrained language models using huggingface, and we are evaluating whether it's possible to integrate its datasets somehow.
If you have a specific dataset that you really care about, we can help you with the integration. On our side, we are working on the integration but it may take us a bit of time.
Hello!
Thanks for your response, I'd like to also ask to consider adding support for hugging face datasets, both for translation/text classification.
The underlying datasets in hf library are based on Apache Arrow. Generally speaking, each item in the dataset is a dictionary e.g:
{'label': 1, 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}
It would be convenient to specify a list of key so that the Avalanche Dataset can use it internally without having to override the
_process_pattern(): method.
Thanks in advance!
@radandreicristian here you can find an example creating a benchmark starting from huggingface datasets. It's just a toy example for the moment (same data for all tasks) but it shows the support of NLP stuff. You can also see how to modify strategies to handle huggingface samples.
I'm closing this issue but feel free to open others if you have any suggestions.