transformers icon indicating copy to clipboard operation
transformers copied to clipboard

TF: Can't create sharded XGLM model

Open gante opened this issue 1 year ago • 2 comments

System Info

  • transformers version: 4.22.0.dev0
  • Platform: Linux-5.15.0-33-generic-x86_64-with-glibc2.35
  • Python version: 3.8.13
  • Huggingface_hub version: 0.9.0
  • PyTorch version (GPU?): 1.12.0+cu116 (True)
  • Tensorflow version (GPU?): 2.9.1 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.5.0 (gpu)
  • Jax version: 0.3.5
  • JaxLib version: 0.3.5
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Running this CLI command

CUDA_VISIBLE_DEVICES="" TOKENIZERS_PARALLELISM=false NVIDIA_TF32_OVERRIDE=0 transformers-cli pt-to-tf --model-name facebook/xglm-2.9B --new-weights --max-error 3e-3

Gets you the following exception (in the sharding code)

Traceback (most recent call last):
  File "/home/joao/hf/bin/transformers-cli", line 8, in <module>
    sys.exit(main())
  File "/home/joao/transformers/src/transformers/commands/transformers_cli.py", line 55, in main
    service.run()
  File "/home/joao/transformers/src/transformers/commands/pt_to_tf.py", line 309, in run
    tf_from_pt_model.save_pretrained(self._local_dir)
  File "/home/joao/transformers/src/transformers/modeling_tf_utils.py", line 2020, in save_pretrained
    param_dset = shard_file.create_dataset(
  File "/home/joao/hf/lib/python3.8/site-packages/h5py/_hl/group.py", line 161, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/home/joao/hf/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 156, in make_new_dset
    dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl, dapl=dapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 84, in h5py.h5d.create
TypeError: expected bytes, str found

Expected behavior

Successful sharding :D

gante avatar Aug 26 '22 16:08 gante

cc @ArthurZucker

gante avatar Aug 26 '22 16:08 gante

Hey! Little update on this : the problem comes from the previously introduced "hack" :

    return tf.Variable(emb, trainable=False, name="model.embed_positions.weights")

This appears here. This hack can also be seen in BART .

In order to have as little breaking changes as possible, I think we can add the followiing :

if "model." in layer.name : # potentially all models that have the hack will have model. something" 
    param_dset = shard_file.create_dataset(
                            ".".join(layer.name.split(".")[1:]), layer.numpy().shape, dtype=layer.numpy().dtype
                        )

I think we have to keep the "." separation for coherence. Will see if I can open a PR on that soon

ArthurZucker avatar Sep 15 '22 19:09 ArthurZucker