add sdpa to ViT [follow up of #29325]
What does this PR do?
Adding support for SDPA to ViT.
This PR is a followup of #29325 , most (all) of the work was done by @lyaronskaya.
Fixes https://github.com/huggingface/transformers/issues/28005.
This PR also include a minor fix in the SDPA doc checks.
I am currently running RUN_SLOW=1 pytest tests/models/ on a GPU and will report the result in the thread
Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. @ArthurZucker and @fxmarty have already reviewed this PR and @amyeroberts may be interested as she commented on the original PR
To make things faster I tried running on a GPU
RUN_SLOW=1 pytest tests/models/audio_spectrogram_transformer/ tests/models/deit/ tests/models/videomae/ tests/models/vision_encoder_decoder/ tests/models/vision_text_dual_encoder/ tests/models/vit/ tests/models/vit_mae/ tests/models/vit_msn/ tests/models/yolos/
So far I am getting a few fails, some (OOM) unrelated to this PR
====================================================================================== short test summary info =======================================================================================
FAILED tests/models/vision_encoder_decoder/test_modeling_flax_vision_encoder_decoder.py::FlaxViT2GPT2EncoderDecoderModelTest::test_pt_flax_equivalence - RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
FAILED tests/models/vision_encoder_decoder/test_modeling_tf_vision_encoder_decoder.py::TFViT2GPT2EncoderDecoderModelTest::test_pt_tf_model_equivalence - AssertionError: False is not true : outputs.encoder_attentions_0: `pt_outputs` should a tensor when `tf_outputs` is
FAILED tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::DeiT2RobertaModelTest::test_encoder_decoder_model_output_attentions - AttributeError: 'NoneType' object has no attribute 'shape'
FAILED tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::ViT2BertModelTest::test_encoder_decoder_model_output_attentions - ValueError: You have to specify pixel_values
FAILED tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::ViT2TrOCR::test_encoder_decoder_model_output_attentions - ValueError: You have to specify pixel_values
FAILED tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::TrOCRModelIntegrationTest::test_inference_handwritten - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU
FAILED tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::TrOCRModelIntegrationTest::test_inference_printed - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU
FAILED tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::ViT2GPT2ModelIntegrationTest::test_inference_coco_en - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU
FAILED tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::DonutModelIntegrationTest::test_inference_cordv2 - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU
FAILED tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::DonutModelIntegrationTest::test_inference_docvqa - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 150.00 MiB. GPU
FAILED tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::DonutModelIntegrationTest::test_inference_rvlcdip - torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU
FAILED tests/models/vision_text_dual_encoder/test_modeling_flax_vision_text_dual_encoder.py::FlaxViTBertModelTest::test_pt_flax_equivalence - RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
FAILED tests/models/vision_text_dual_encoder/test_modeling_flax_vision_text_dual_encoder.py::FlaxCLIPVisionBertModelTest::test_pt_flax_equivalence - RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
FAILED tests/models/vision_text_dual_encoder/test_modeling_vision_text_dual_encoder.py::ViTBertModelTest::test_pt_flax_equivalence - TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
FAILED tests/models/vision_text_dual_encoder/test_modeling_vision_text_dual_encoder.py::ViTBertModelTest::test_vision_text_output_attention - AttributeError: 'NoneType' object has no attribute 'shape'
FAILED tests/models/vision_text_dual_encoder/test_modeling_vision_text_dual_encoder.py::DeiTRobertaModelTest::test_vision_text_output_attention - AttributeError: 'NoneType' object has no attribute 'shape'
FAILED tests/models/vision_text_dual_encoder/test_modeling_vision_text_dual_encoder.py::CLIPVisionBertModelTest::test_pt_flax_equivalence - TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
FAILED tests/models/yolos/test_image_processing_yolos.py::YolosImageProcessingTest::test_batched_coco_detection_annotations - ImportError: Pycocotools is not installed in your environment.
FAILED tests/models/yolos/test_modeling_yolos.py::YolosModelTest::test_attention_outputs - AttributeError: 'NoneType' object has no attribute 'shape'
================================================================ 19 failed, 820 passed, 427 skipped, 95 warnings in 607.33s (0:10:07) ===============================================================
Thanks for working on this and enabling this for our models, @hyenal!
We've literally just merged in a new feature which should help us run slow tests. To enable this, I've added the run-slow label to this PR. To trigger a run of the slow tests could you:
- Rebase on
mainto include https://github.com/huggingface/transformers/pull/30540 - Push an empty commit with the message:
[run-slow] audio_spectrogram_transformer,deit,vit,vit_hybrid,vit_mae,vit_msn,videomae
@amyeroberts I rebased and ran the pipeline as indicated. The last one should have failed (I know yolo and the encoder/decoder are not ready yet) so I am not sure there's something I did incorrectly
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@hyenal It was just waiting for me to approve the run :) We don't run automatically for security reasons and to prevent running slow, heavy tests unnecessarily
Thanks you @amyeroberts I will fix the tests then and request a new SLOW run when things are fixed :)
@amyeroberts when you have some time could you run the latest slow run I pushed ? I fixed most of the issues but there are 3 failures (ViT2BertModelTest.test_real_model_save_load_from_pretrained , NougatModelIntegrationTest.test_forward_pass, NougatModelIntegrationTest.test_generation ) I did not manage to reproduce locally.
Is there any specific command I should run for these tests ?
@hyenal Sure! I've approved the workflow run, which should trigger these tests. I don't think there should be anything special you need to run these. If you're unable to reproduce locally, and they're being run (not skipped) then it's likely just an env or runner issue and we can help try and debug that.
The MR is now ready, 3 slow tests are failing but I am unable to find the source of it (a precision error due to SDPA ?) if possible I would like to get some help on it.
To further check if everything is working I could push a slow-test pipeline commit to check that all changed models are working in slow mode ?
@amyeroberts I am afraid that I cannot find a direct link between this PR and the current failures:
tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::ViT2BertModelTest::test_real_model_save_load_from_pretrained: these tests also fails for me onmain. It seems that some parameters are not properly initialised according to the stderr
Some weights of ViTModel were not initialized from the model checkpoint at hf-internal-testing/tiny-random-vit and are newly initialized: ['vit.pooler.dense.bias', 'vit.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertLMHeadModel were not initialized from the model checkpoint at hf-internal-testing/tiny-bert and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.encoder.layer.1.crossattention.self.key.weight', 'bert.encoder.layer.1.crossattention.self.query.bias', 'bert.encoder.layer.1.crossattention.self.query.weight', 'bert.encoder.layer.1.crossattention.self.value.bias', 'bert.encoder.layer.1.crossattention.self.value.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
- Nougat tests: Nougat is using Swin which is not part of this PR. The currently failing tests are also failing on
mainon my side.
Last thing to do is add performance numbers for the models e.g. like here for Mistral. It's not necessary to run for all of the models (although this would be great!) but getting numbers for deit, vit, vitmae and yolos should be done as they're quite popular.
@amyeroberts that can be done! If you have any script that I could use so that we keep the same format for the images that would be great!
Also do you mind resolving comments that are left open ? Just to confirm we agree :)
Also do you mind resolving comments that are left open ? Just to confirm we agree :)
@hyenal Sure! I think I've resolved all of them. Let me know if there's any I missed.
@amyeroberts that can be done! If you have any script that I could use so that we keep the same format for the images that would be great!
I don't have a script to hand, unfortunately. In terms of measuring the speed ups, it's OK to use different images/formats across the different models, as long as the settings for e.g. ViT are consistent.
I copied the style of https://github.com/huggingface/transformers/pull/30390, let me know if the docs is okay.
Code for reproducibility
from collections import defaultdict
from time import perf_counter_ns
import numpy as np
import pandas as pd
import requests
import torch
from PIL import Image
from tabulate import tabulate
BATCH_SIZES = [1, 2, 4, 8]
ATTN_IMPLEMENTATION = ["eager", "sdpa"]
def profile_ast(
attn_implementation: str = "eager",
n_trial: int = 10,
batch_size: int = 1,
use_cuda: bool = False,
dtype=torch.float32,
) -> int:
import torch
from datasets import load_dataset
from transformers import ASTForAudioClassification, AutoFeatureExtractor
dataset = load_dataset(
"hf-internal-testing/librispeech_asr_demo", "clean", split="validation"
)
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
feature_extractor = AutoFeatureExtractor.from_pretrained(
"MIT/ast-finetuned-audioset-10-10-0.4593",
torch_dtype=dtype,
)
model = ASTForAudioClassification.from_pretrained(
"MIT/ast-finetuned-audioset-10-10-0.4593",
attn_implementation=attn_implementation,
torch_dtype=dtype,
)
inputs = feature_extractor(
dataset[0]["audio"]["array"],
sampling_rate=sampling_rate,
return_tensors="pt",
) # .to("cuda")
inputs["input_values"] = inputs["input_values"].tile((batch_size, 1, 1))
if use_cuda:
inputs["input_values"] = inputs["input_values"].to("cuda")
total_time = 0.0
if use_cuda:
model = model.to("cuda")
for _ in range(n_trial):
time_start = perf_counter_ns()
with torch.no_grad():
model(**inputs)
time_end = perf_counter_ns()
total_time += (time_end - time_start) / 1e6
return int(total_time / n_trial)
def profile_deit(
attn_implementation: str = "eager",
n_trial: int = 10,
batch_size: int = 1,
use_cuda: bool = False,
dtype=torch.float32,
) -> int:
from transformers import AutoImageProcessor, DeiTForImageClassification
torch.manual_seed(3)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# note: we are loading a DeiTForImageClassificationWithTeacher from the hub here,
# so the head will be randomly initialized, hence the predictions will be random
image_processor = AutoImageProcessor.from_pretrained(
"facebook/deit-base-distilled-patch16-224",
torch_dtype=dtype,
)
model = DeiTForImageClassification.from_pretrained(
"facebook/deit-base-distilled-patch16-224",
attn_implementation=attn_implementation,
torch_dtype=dtype,
)
if use_cuda:
model = model.to("cuda")
inputs = image_processor(images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].tile((batch_size, 1, 1, 1))
if use_cuda:
inputs["pixel_values"] = inputs["pixel_values"].to("cuda")
total_time = 0.0
for _ in range(n_trial):
time_start = perf_counter_ns()
with torch.no_grad():
model(**inputs)
time_end = perf_counter_ns()
total_time += (time_end - time_start) / 1e6
return int(total_time / n_trial)
def profile_vit(
attn_implementation: str = "eager",
n_trial: int = 10,
batch_size: int = 1,
use_cuda: bool = False,
dtype=torch.float32,
):
import torch
from datasets import load_dataset
from transformers import AutoImageProcessor, ViTForImageClassification
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
image_processor = AutoImageProcessor.from_pretrained(
"google/vit-base-patch16-224",
torch_dtype=dtype,
)
model = ViTForImageClassification.from_pretrained(
"google/vit-base-patch16-224",
attn_implementation=attn_implementation,
torch_dtype=dtype,
)
if use_cuda:
model = model.to("cuda")
inputs = image_processor(image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].tile((batch_size, 1, 1, 1))
if use_cuda:
inputs["pixel_values"] = inputs["pixel_values"].to("cuda")
total_time = 0.0
for _ in range(n_trial):
time_start = perf_counter_ns()
with torch.no_grad():
model(**inputs)
time_end = perf_counter_ns()
total_time += (time_end - time_start) / 1e6
return int(total_time / n_trial)
def profile_vit_hybrid(
attn_implementation: str = "eager",
n_trial: int = 10,
batch_size: int = 1,
use_cuda: bool = False,
dtype=torch.float32,
):
import torch
from datasets import load_dataset
from transformers import AutoImageProcessor, ViTHybridForImageClassification
dataset = load_dataset("huggingface/cats-image")
image = dataset["test"]["image"][0]
image_processor = AutoImageProcessor.from_pretrained(
"google/vit-hybrid-base-bit-384",
torch_dtype=dtype,
)
model = ViTHybridForImageClassification.from_pretrained(
"google/vit-hybrid-base-bit-384",
attn_implementation=attn_implementation,
torch_dtype=dtype,
)
if use_cuda:
model = model.to("cuda")
inputs = image_processor(image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].tile((batch_size, 1, 1, 1))
if use_cuda:
inputs["pixel_values"] = inputs["pixel_values"].to("cuda")
total_time = 0.0
for _ in range(n_trial):
time_start = perf_counter_ns()
with torch.no_grad():
model(**inputs)
time_end = perf_counter_ns()
total_time += (time_end - time_start) / 1e6
return int(total_time / n_trial)
def profile_vit_mae(
attn_implementation: str = "eager",
n_trial: int = 10,
batch_size: int = 1,
use_cuda: bool = False,
dtype=torch.float32,
):
# Vit Mae
from transformers import AutoImageProcessor, ViTMAEModel
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained(
"facebook/vit-mae-base",
torch_dtype=dtype,
)
model = ViTMAEModel.from_pretrained(
"facebook/vit-mae-base",
attn_implementation=attn_implementation,
torch_dtype=dtype,
)
if use_cuda:
model = model.to("cuda")
inputs = image_processor(images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].tile((batch_size, 1, 1, 1))
if use_cuda:
inputs["pixel_values"] = inputs["pixel_values"].to("cuda")
total_time = 0.0
for _ in range(n_trial):
time_start = perf_counter_ns()
with torch.no_grad():
model(**inputs)
time_end = perf_counter_ns()
total_time += (time_end - time_start) / 1e6
return int(total_time / n_trial)
def profile_vit_msn(
attn_implementation: str = "eager",
n_trial: int = 10,
batch_size: int = 1,
use_cuda: bool = False,
dtype=torch.float32,
):
from transformers import AutoImageProcessor, ViTMSNForImageClassification
torch.manual_seed(2)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained(
"facebook/vit-msn-base",
torch_dtype=dtype,
)
model = ViTMSNForImageClassification.from_pretrained(
"facebook/vit-msn-base",
attn_implementation=attn_implementation,
torch_dtype=dtype,
)
if use_cuda:
model = model.to("cuda")
inputs = image_processor(images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].tile((batch_size, 1, 1, 1))
if use_cuda:
inputs["pixel_values"] = inputs["pixel_values"].to("cuda")
total_time = 0.0
for _ in range(n_trial):
time_start = perf_counter_ns()
with torch.no_grad():
model(**inputs)
time_end = perf_counter_ns()
total_time += (time_end - time_start) / 1e6
return int(total_time / n_trial)
def profile_yolo(
attn_implementation: str = "eager",
n_trial: int = 10,
batch_size: int = 1,
use_cuda: bool = False,
dtype=torch.float32,
):
from transformers import AutoImageProcessor, AutoModelForObjectDetection
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = AutoImageProcessor.from_pretrained(
"hustvl/yolos-base",
torch_dtype=dtype,
)
model = AutoModelForObjectDetection.from_pretrained(
"hustvl/yolos-base",
attn_implementation=attn_implementation,
torch_dtype=dtype,
)
if use_cuda:
model = model.to("cuda")
inputs = image_processor(images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].tile((batch_size, 1, 1, 1))
if use_cuda:
inputs["pixel_values"] = inputs["pixel_values"].to("cuda")
total_time = 0.0
for _ in range(n_trial):
time_start = perf_counter_ns()
with torch.no_grad():
model(**inputs)
time_end = perf_counter_ns()
total_time += (time_end - time_start) / 1e6
return int(total_time / n_trial)
def profile_videomae(
attn_implementation: str = "eager",
n_trial: int = 10,
batch_size: int = 1,
use_cuda: bool = False,
dtype=torch.float32,
):
import av
from huggingface_hub import hf_hub_download
from transformers import AutoImageProcessor, VideoMAEForVideoClassification
np.random.seed(0)
def read_video_pyav(container, indices):
"""
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
"""
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
"""
Sample a given number of frame indices from the video.
Args:
clip_len (`int`): Total number of frames to sample.
frame_sample_rate (`int`): Sample every n-th frame.
seg_len (`int`): Maximum allowed index of sample's last frame.
Returns:
indices (`List[int]`): List of sampled frame indices
"""
converted_len = int(clip_len * frame_sample_rate)
end_idx = np.random.randint(converted_len, seg_len)
start_idx = end_idx - converted_len
indices = np.linspace(start_idx, end_idx, num=clip_len)
indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
return indices
# video clip consists of 300 frames (10 seconds at 30 FPS)
file_path = hf_hub_download(
repo_id="nielsr/video-demo",
filename="eating_spaghetti.mp4",
repo_type="dataset",
)
container = av.open(file_path)
# sample 16 frames
indices = sample_frame_indices(
clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames
)
video = read_video_pyav(container, indices)
image_processor = AutoImageProcessor.from_pretrained(
"MCG-NJU/videomae-base-finetuned-kinetics",
torch_dtype=dtype,
)
model = VideoMAEForVideoClassification.from_pretrained(
"MCG-NJU/videomae-base-finetuned-kinetics",
attn_implementation=attn_implementation,
torch_dtype=dtype,
)
if use_cuda:
model = model.to("cuda")
inputs = image_processor(list(video), return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].tile((batch_size, 1, 1, 1, 1))
if use_cuda:
inputs["pixel_values"] = inputs["pixel_values"].to("cuda")
total_time = 0.0
for _ in range(n_trial):
time_start = perf_counter_ns()
with torch.no_grad():
model(**inputs)
time_end = perf_counter_ns()
total_time += (time_end - time_start) / 1e6
return int(total_time / n_trial)
def print_comparison(
name: str, batch_sizes: list[int], time_eager: list[float], time_sdpa: list[float]
) -> None:
df = pd.DataFrame(
{
"Batch size": batch_sizes,
"Average inference time (ms), eager mode": time_eager,
"Average inference time (ms), sdpa model": time_sdpa,
"Speed up, Sdpa / Eager (x)": np.array(time_eager) / np.array(time_sdpa),
}
)
print(f"Model: {name}")
print(tabulate(df, headers=df.columns, showindex=False, tablefmt="github"))
MODELS = {
"AST": profile_ast,
"Deit": profile_deit,
"ViT": profile_vit,
"ViT Hybrid": profile_vit_hybrid,
"ViT MAE": profile_vit_mae,
"ViT MSN": profile_vit_msn,
"Yolos": profile_yolo,
"VideoMAE": profile_videomae
}
for model_name, profiler in MODELS.items():
times = defaultdict(list)
for attn_implementation in ATTN_IMPLEMENTATION:
for b in BATCH_SIZES:
times[attn_implementation].append(
profiler(
attn_implementation=attn_implementation,
batch_size=b,
dtype=torch.float32,
use_cuda=True,
)
)
print_comparison(model_name, BATCH_SIZES, times["eager"], times["sdpa"])
Is there anything left to do on this MR ? Since it has been approved I am wondering about next steps in order to merge :)
@hyenal Only thing is the Vit2Bert vision encoder-decoder integration test. Agreed that this PR shouldn't have any effect on the nougat/donut tests and we can ignore those
@amyeroberts do you have any recent slow pipeline on main where ViT2Bert passed ? Using a A100 or my local machine (CPU), twice I got the same error as I have on this PR.
Steps to reproduce:
# Clone repository
pip install -e ".[dev]"
RUN_SLOW=1 pytest tests/models/vision_encoder_decoder/test_modeling_vision_encoder_decoder.py::ViT2BertModelTest::test_real_model_save_load_from_pretrained
I have tried to look for SLOW pipeline on main but all I can find are pending or cancelled pipelines
@hyenal Let me dig into it and see 🕵️ It'll be tomorrow though, as I'm signing off soon
Hi @hyenal, I got yesterday's full slow model CI run here: https://github.com/huggingface/transformers/actions/runs/9089085470/job/24979852319
And good news - all of the failing tests: nougat, donut, vit2bert are failing there too 🥳
I'll merge now. Thanks for all the work and patience adding this impactful feature!
Thank you so much @amyeroberts!!