transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[don't merge yet] Fix RAG

Open ydshieh opened this issue 6 months ago • 1 comments

What does this PR do?

datasets introduce trust_remote_code at some point (probably 2024/09), but RAG's modeling code isn't handling this, and we get

ValueError: The repository for wiki_dpr contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wiki_dpr.

This PR makes necessary changes for users could at least specify this argument in from_pretrained that would pass to the datasets call.

ydshieh avatar Jun 04 '25 14:06 ydshieh

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Cyrilvallez ready for a review.

For the changes in the tests, there is something for @itazap to check.

ydshieh avatar Jun 19 '25 10:06 ydshieh

The additional 'space' after the comma in the test june 22 , 2018 likely should be expected (ie, not stripped), since rag uses BART. test_tokenization_rag is minimal so it isn't explicitly clear how to handle spaces around punctuation. Since https://github.com/huggingface/transformers/pull/31938 removed clean_up_tokenization_spaces=True by default, and the bart/rag tests continued to pass, and so for me it's safe to conclude that clean_up_tokenization_spaces=False, and the space after the comma should be preserved -> as updated in this PR!

TLDR: LGTM :)

itazap avatar Jun 20 '25 14:06 itazap

Thank you @itazap for double checking 👍

ydshieh avatar Jun 20 '25 15:06 ydshieh