transformers
transformers copied to clipboard
FlaxDataCollatorForT5MLM :ValueError: all input arrays must have the same shape
System Info
- transformers version: 4.27.1
- Platform: Linux-5.18.10-76051810-generic-x86_64-with-glibc2.35
- Python version: 3.10.6
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 2.0.0.dev20230202+cu116 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: No
Who can help?
@patil-suraj @patrickvonplaten
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I am following the script to reproduce the above https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py#L336-L346
If I give the mean_noise_span_length
> 1, for any value of noise_density, i get the ouput
prompt = "The cute dog walks in the green park"
encoded = tokenizer(prompt, truncation=False, padding=False, return_tensors="pt").input_ids
batch_size =1
input_length = encoded.shape[1]
denoiser = FlaxDataCollatorForT5MLM(tokenizer,.35,3)
mask_indices = np.asarray([denoiser.random_spans_noise_mask(input_length) for i in range(batch_size)])
labels_mask = ~mask_indices
input_ids_sentinel = denoiser.create_sentinel_ids(mask_indices.astype(np.int8))
labels_sentinel = denoiser.create_sentinel_ids(labels_mask.astype(np.int8))
input_ids = denoiser.filter_input_ids(encoded, input_ids_sentinel)
labels = denoiser.filter_input_ids(encoded, labels_sentinel)
If I give the mean_noise_span_length
== 1, for many value of noise_density, i get the error
Traceback (most recent call last):
File "/home/alex/coding/tranformer_learn/t5_denoising.py", line 133, in <module>
mask_indices = np.asarray([denoiser.random_spans_noise_mask(input_length) for i in range(batch_size)])
File "/home/alex/coding/tranformer_learn/t5_denoising.py", line 133, in <listcomp>
mask_indices = np.asarray([denoiser.random_spans_noise_mask(input_length) for i in range(batch_size)])
File "/home/alex/coding/tranformer_learn/t5_denoising.py", line 94, in random_spans_noise_mask
np.stack([nonnoise_span_lengths, noise_span_lengths], axis=1), [num_noise_spans * 2]
File "<__array_function__ internals>", line 200, in stack
File "/home/alex/.local/lib/python3.10/site-packages/numpy/core/shape_base.py", line 464, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape
Basically, the two arrays are different lengths in numpy stack
interleaved_span_lengths = np.reshape(
np.stack([nonnoise_span_lengths, noise_span_lengths], axis=1), [num_noise_spans * 2]
)
From what I could make out this happens when num_noise_spans
== num_noise_tokens
when mean_noise_span_length == 1
num_noise_spans = int(np.round(num_noise_tokens / self.mean_noise_span_length))
Code that can be run https://gist.github.com/alexcpn/b9bb2b0f01833d1bb862502faf99bab8
Expected behavior
There should not be exception
cc @sanchit-gandhi @ArthurZucker maybe
Hey @alexcpn - great job at digging into the issue and thanks for the gist! It does indeed look like the case that we're hitting this error based on how we compute the num_noise_spans
:
https://github.com/huggingface/transformers/blob/aec10d162f59d809ead3990ef78c51918b622f38/examples/flax/language-modeling/run_t5_mlm_flax.py#L274
Would you like to open a PR to fix this so that it's robust for mean_noise_span_length == 1
?
The code is largely ported from the original T5 pre-processing, which can be found here: https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/preprocessors.py
HI @sanchit-gandhi ; I have tried to demo the problem and the possible correction; Please find the pull request here https://github.com/huggingface/transformers/pull/22938