Leandro von Werra

Results 160 comments of Leandro von Werra

Yes, this is work in progress. We'll hopefully update it soon with more info.

Two questions about integrating this for language modeling: 1. Should title and abstract be concatenated? 2. The description on the hub and here states that the dataset also includes full...

Also note that note that the dataset contains other languages as well. Looking at a few examples in the dataset viewer on the hub one can see Chinese and Japanese...

Done: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_s2orc Note: For now this only contains the titles and abstract separated by a newline. Documents without abstract are filtered (it seems this also helps a bit with the...

From GCP region `us-central1-a`.

@mariosasko I did not turn it off but I can try the next time - I have to run the pipeline again, anyway. @bhavitvyamalik Yes, I also sharded the dataset...

Maybe a quick solution could be to change the processing function that's passed to `.map` to a standalone function rather than a class method. cc @kashif

Is it possible that you didn't have `EOS` tokens in the fine-tuning/dpo phase? Then the model wouldn't know what token to produce after the letter and just keep generating things....

Those changes are currently only on main, did you install TRL from source?