transformers [trainer] param count for deepspeed zero3

As reported in https://github.com/huggingface/transformers/issues/22179 the trainer code doesn't handle the sharded models correctly in reporting "the Number of trainable parameters" - I'm not sure if FSDP models have the same issue.

This PR fixes this situation with Deepspeed ZeRO3 which otherwise reported a count of 0.

Fixes: https://github.com/huggingface/transformers/issues/22179

Mar 16 '23 03:03 stas00

The documentation is not available anymore as the PR was closed or merged.

Mar 16 '23 03:03 HuggingFaceDocBuilderDev

An explanation is needed here.

The Deepspeed team had to invent their own tensor substitute since 2 years ago nothing of a kind existed in pytorch. They had to replace tensors with placeholders to be able to support sharded tensors.

The meta tensors were added just recently so they are looking at possibly switching to those.

The API I used in this PR is not public per-se. And the "clean" way would be to gather tensors and then get their normal t.numel() - but this is extremely wasteful and expensive memory and time-wise. So I hacked it to get the internal equivalent to make it almost instant.

I'm not planning on leaving it this way and asking for deepspeed to provide an efficient method to return the sizes w/o me using a non-public API.

There are many other hidden issues wrt this tensor substitution that impacts only ZeRO stage 3 https://github.com/microsoft/DeepSpeed/issues/2650 - and yesterday I have discovered at least one bug in our examples because of that, while debugging the user report that lead to this PR. All examples resize the embedding under zero3 because their check if the vocab is larger than embedding size always returns True, since the embed size is reported to be of size 0, because it's not gathered :(

I'm working on ensuring that the Deepspeed addresses this issue because it's subtle and very problematic.

Please let me know if you're OK with merging this now that you know more details. I can also easily recode it to gather tensors first, but it'd be very inefficient.

Mar 17 '23 16:03 stas00