transformers icon indicating copy to clipboard operation
transformers copied to clipboard

FlaxDataCollatorForT5MLM :ValueError: all input arrays must have the same shape

Open alexcpn opened this issue 1 year ago • 3 comments

System Info

  • transformers version: 4.27.1
  • Platform: Linux-5.18.10-76051810-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 2.0.0.dev20230202+cu116 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: No

Who can help?

@patil-suraj @patrickvonplaten

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

I am following the script to reproduce the above https://github.com/huggingface/transformers/blob/main/examples/flax/language-modeling/run_t5_mlm_flax.py#L336-L346

If I give the mean_noise_span_length > 1, for any value of noise_density, i get the ouput

 prompt = "The cute dog walks in the green park"
    encoded = tokenizer(prompt, truncation=False, padding=False, return_tensors="pt").input_ids
    batch_size =1
    input_length = encoded.shape[1]
    denoiser = FlaxDataCollatorForT5MLM(tokenizer,.35,3)
    mask_indices = np.asarray([denoiser.random_spans_noise_mask(input_length) for i in range(batch_size)])
    labels_mask = ~mask_indices
    input_ids_sentinel = denoiser.create_sentinel_ids(mask_indices.astype(np.int8))
    labels_sentinel = denoiser.create_sentinel_ids(labels_mask.astype(np.int8))
    input_ids = denoiser.filter_input_ids(encoded, input_ids_sentinel)
    labels  =  denoiser.filter_input_ids(encoded, labels_sentinel)

If I give the mean_noise_span_length == 1, for many value of noise_density, i get the error

Traceback (most recent call last):
  File "/home/alex/coding/tranformer_learn/t5_denoising.py", line 133, in <module>
    mask_indices = np.asarray([denoiser.random_spans_noise_mask(input_length) for i in range(batch_size)])
  File "/home/alex/coding/tranformer_learn/t5_denoising.py", line 133, in <listcomp>
    mask_indices = np.asarray([denoiser.random_spans_noise_mask(input_length) for i in range(batch_size)])
  File "/home/alex/coding/tranformer_learn/t5_denoising.py", line 94, in random_spans_noise_mask
    np.stack([nonnoise_span_lengths, noise_span_lengths], axis=1), [num_noise_spans * 2]
  File "<__array_function__ internals>", line 200, in stack
  File "/home/alex/.local/lib/python3.10/site-packages/numpy/core/shape_base.py", line 464, in stack
    raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

Basically, the two arrays are different lengths in numpy stack

  interleaved_span_lengths = np.reshape(
            np.stack([nonnoise_span_lengths, noise_span_lengths], axis=1), [num_noise_spans * 2]
        )

From what I could make out this happens when num_noise_spans == num_noise_tokens when mean_noise_span_length == 1

num_noise_spans = int(np.round(num_noise_tokens / self.mean_noise_span_length))

Code that can be run https://gist.github.com/alexcpn/b9bb2b0f01833d1bb862502faf99bab8

Expected behavior

There should not be exception

alexcpn avatar Mar 18 '23 15:03 alexcpn

cc @sanchit-gandhi @ArthurZucker maybe

patrickvonplaten avatar Mar 21 '23 10:03 patrickvonplaten

Hey @alexcpn - great job at digging into the issue and thanks for the gist! It does indeed look like the case that we're hitting this error based on how we compute the num_noise_spans: https://github.com/huggingface/transformers/blob/aec10d162f59d809ead3990ef78c51918b622f38/examples/flax/language-modeling/run_t5_mlm_flax.py#L274

Would you like to open a PR to fix this so that it's robust for mean_noise_span_length == 1?

The code is largely ported from the original T5 pre-processing, which can be found here: https://github.com/google-research/text-to-text-transfer-transformer/blob/main/t5/data/preprocessors.py

sanchit-gandhi avatar Apr 18 '23 17:04 sanchit-gandhi

HI @sanchit-gandhi ; I have tried to demo the problem and the possible correction; Please find the pull request here https://github.com/huggingface/transformers/pull/22938

alexcpn avatar Apr 22 '23 14:04 alexcpn