Richard Nagyfi
Richard Nagyfi
First of all, thank you very much for this learning material, I really wish there were similar tutorials for other engineering areas! I think the central limit theorem is slightly...
A notebook that generates Q&A pairs automatically based on the contents of the WikiData knowledge graph to accelerate prompt generation. Added README.md with step-by-step instructions as well as the Jupyter...
https://www.gutenberg.org/ has an extensive collection of ebooks in multiple languages and formats that would make great trianing data
Scrape MEK OSZK (Hungarian Electronical Library) for books and upload the data to HF.
Copy the "Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles. Dataset includes articles, questions, and answers." dataset to HF. > Please cite this paper if you write any...
Copy the Ubuntu dialogue corpus to HF https://www.kaggle.com/datasets/rtatman/ubuntu-dialogue-corpus See if it can be further cleaned (some answers are low quality)
1) copy OpenSubtitles dataset to HF https://opus.nlpl.eu/OpenSubtitles-v2018.php 2) optionall scrape more subtitles from different places as long as they are multilangual and their timestamps can be matched with other languages
I think this can boost the competitiveness of smaller / new languages. Even though they have no chance of ever beating the larger ones in number of new messages, seeing...
- Reupload the data to HF - move all metadata columns to JSON meta - move the gutenberg crawler to datasets/ - update its loader / init scripts - updated...
- updated dataset to match the new schema for both english and multilangual Project Gutenberg eBooks - added link to HF text datasets to __init__.py - moved Gutenberg Crawler from...