distributed-training-guide FSDP: GPT2LMHeadModel object has no attribute model

FSDP: GPT2LMHeadModel object has no attribute model

Open daire-byrne opened this issue 2 weeks ago • 2 comments

First of all, thank you so much for putting together these tutorials! I am slowly working through them and trying to better understand how it all fits together.

I have the DDP example working well on my 4 x L40S single node server, but I can't seem to get the FSDP example to work on a single node (maybe that is my problem?).

# torchrun --nproc-per-node gpu train_llm.py -d tatsu-lab/alpaca -m openai-community/gpt2 --cpu-offload
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/distributed-training-guide/04-fully-sharded-data-parallel/train_llm.py", line 389, in <module>
[rank1]:     main()
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
[rank1]:     return f(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/distributed-training-guide/04-fully-sharded-data-parallel/train_llm.py", line 88, in main
[rank1]:     for decoder in model.model.layers:
[rank1]:                    ^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1940, in __getattr__
[rank1]:     raise AttributeError(
[rank1]: AttributeError: 'GPT2LMHeadModel' object has no attribute 'model'

I haven't got it as far as testing multi node yet, but that was my next step.

I think I have all the correct requirements (transformers=4.57.0) but the pytorch version is the somewhat customised 2.8.0 version that comes in the nvidia pytorch container (2.8.0a0+5228986c39.nv25.06).

The model and dataset are downloaded, cached and verified working with the DDP example (even "offline").

Off topic: the inclusion of the llama-405b tutorial is great, but I doubt many people will be able to run that! It would be awesome if you could include a more modest larger training example too (>gpt2). For people with a few nodes of dual or quad 48G GPUs for example.

Anyway, thanks again for putting this together, much appreciated.

Nov 16 '25 12:11 daire-byrne

distributed-training-guide distributed-training-guide copied to clipboard

FSDP: GPT2LMHeadModel object has no attribute model

distributed-training-guide
distributed-training-guide copied to clipboard