Leandro von Werra comments

Results 160 comments of


                                            Leandro von Werra

[Bug] Content not shown in docs

Yes, this is work in progress. We'll hopefully update it soon with more info.

Create dataset s2orc_the_semantic_scholar_open_research_corpus

#self-assign

Create dataset s2orc_the_semantic_scholar_open_research_corpus

Two questions about integrating this for language modeling: 1. Should title and abstract be concatenated? 2. The description on the hub and here states that the dataset also includes full...

Create dataset s2orc_the_semantic_scholar_open_research_corpus

Also note that note that the dataset contains other languages as well. Looking at a few examples in the dataset viewer on the hub one can see Chinese and Japanese...

Create dataset s2orc_the_semantic_scholar_open_research_corpus

Done: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_s2orc Note: For now this only contains the titles and abstract separated by a newline. Documents without abstract are filtered (it seems this also helps a bit with the...

Performance of `datasets` at scale

From GCP region `us-central1-a`.

Performance of `datasets` at scale

@mariosasko I did not turn it off but I can try the next time - I have to run the pipeline again, anyway. @bhavitvyamalik Yes, I also sharded the dataset...

DPOTrainer.tokenize_row is not hashable

Maybe a quick solution could be to change the processing function that's passed to `.map` to a standalone function rather than a class method. cc @kashif

DPO models generate multiple / corrupted responses

Is it possible that you didn't have `EOS` tokens in the fine-tuning/dpo phase? Then the model wouldn't know what token to produce after the letter and just keep generating things....

update in DPO raise several problems...

Those changes are currently only on main, did you install TRL from source?