[don't merge yet] Fix RAG
What does this PR do?
datasets introduce trust_remote_code at some point (probably 2024/09), but RAG's modeling code isn't handling this, and we get
ValueError: The repository for wiki_dpr contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wiki_dpr.
This PR makes necessary changes for users could at least specify this argument in from_pretrained that would pass to the datasets call.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@Cyrilvallez ready for a review.
For the changes in the tests, there is something for @itazap to check.
The additional 'space' after the comma in the test june 22 , 2018 likely should be expected (ie, not stripped), since rag uses BART. test_tokenization_rag is minimal so it isn't explicitly clear how to handle spaces around punctuation. Since https://github.com/huggingface/transformers/pull/31938 removed clean_up_tokenization_spaces=True by default, and the bart/rag tests continued to pass, and so for me it's safe to conclude that clean_up_tokenization_spaces=False, and the space after the comma should be preserved -> as updated in this PR!
TLDR: LGTM :)
Thank you @itazap for double checking 👍