xla Loading a model trained with multi-core TPU into a SPMD model

Hello

I have models trained using multi-core TPU option. I have saved their checkpoints using

import torch_xla.core.xla_model as xm
xm.save(checkpoint, path_checkpoints_file, master_only=True )

Now, on new models to be trained using the SPMD option, I want to instantiate these checkpoints and freeze the parameters of these models during training.

While instantiating the models from checkpoints in the forward function of new model should I use the function for multi-core loading operations as below

checkpoint = xser.load( checkpoint_file_with_path )

or should I load the model in SPMD style as below

checkpoint = get_checkpoint_template( config, model, optimizer, )

tracked_steps = chkpt_mgr.all_steps()
if( len(tracked_steps) ):
    chkpt_mgr.restore( max(tracked_steps), checkpoint )
    print( f"Loaded checkpoint step {max(tracked_steps)} for SPMD operation" )
    break

Mar 01 '24 18:03 mfatih7

@jonb377, maybe could you pitch in here? Thanks!

Mar 04 '24 20:03 wonjoolee95

Hey @mfatih7! The checkpoint format is different between normal checkpointing and distributed checkpointing. The two are not cross-compatible.

To use xser.save checkpoints with SPMD, you'll need to load the checkpoint first with xser.load, then apply sharding. Subsequent checkpoints can be taken with CheckpointManager and will be in the distributed checkpoint format which is compatible with chkpt_mgr.restore.

If you want to translate an ordinary checkpoint into a distributed checkpoint for use with CheckpointManager, you can do something like this in a separate script:

# This script will convert an ordinary checkpoint into a distributed checkpoint compatible with `CheckpointManager`.
# Enable SPMD mode before loading the checkpoint
xr.use_spmd()

# Load the checkpoint in the old format
state_dict = xser.load(path)

# Re-save the checkpoint using distributed checkpointing
chkpt_mgr.save(0, state_dict)

Let me know if you hit any issues with this approach.

Mar 04 '24 21:03 jonb377

Hello @jonb377

For the last few days, my experiments have been running in a similar structure as you mentioned. Till now I have not observed any irregularities.

Mar 05 '24 20:03 mfatih7

Hello @jonb377

I was running experiments on TPU SPMD with my model having a pre-trained model.

Although I do not observe any errors during training the success of my model is different than expected. To double-check the behavior of my model I added another mode to my training scripts. In this new mode, I use the pre-trained model on my dataloaders as part of input data generation on CPU cores.

When the pre-trained model is located on dataloaders(CPUs) the success of my model is as expected.

This implies that when the pre-trained model is placed on TPU as part of the experimented model we do not get the same results when the pre-trained model is located on dataloaders implemented on CPUs.

There is an error either on checkpoint libraries or on my training scripts.

My training scripts do not implement anything different than what you explained above.

Mar 10 '24 14:03 mfatih7

HI @mfatih7, could you share some more details about the error? It sounds like you're getting different results on TPU compared to CPU. This can be expected to a degree, since TPU matmuls inherently use lower precision (you can verify by running a simple matmul on CPU and TPU, the results will differ slightly).

How significant is the difference between CPU and TPU?
Is the deviation after training, or immediate upon checkpoint loading?

Mar 11 '24 18:03 jonb377

Hello @jonb377

Thank you for your answer.

The difference between TPU and CPU runs is observed immediately upon checkpoint loading.

When the pre-trained model is placed on the data loader(CPU), at the start of the training, the loss of the model on TPU is close to the pre-trained model. But when the pre-trained model is placed on TPU, at the start of the training, the loss of the model is not close to the loss of pre-trained model.

The difference is dramatic and must not be related to the slight implementation difference. To give a quantitative example, for a loss that will converge to 0.014, CPU implementation starts from 0.026 but TPU implementation starts from 0.063. CPU implementation tends to converge to 0.014 but TPU implementation does not drop values lower than 0.020.

I do not set os.environ['XLA_USE_BF16'] = '1'

Can you follow my scripts if I provide a minimized repo with both CPU and TPU options for pre-trained model?

Mar 11 '24 18:03 mfatih7

@JackCaoG is such a difference expected between TPU and CPU?

@mfatih7 Do you know if this is also the case for non-SPMD TPU? If you can share a minimal repro that would be useful.

Mar 11 '24 19:03 jonb377

@jonb377

Till now I have not placed pre-trained models(trained with multi-core TPU) into another multi-core TPU training. The reason why I use SPMD TPU is that I want to test bigger models that cannot fit into multi-core TPU implementations.

So in the repo, should I share a pre-trained model instantiated on SPMD TPU or a pre-trained model instantiated on multi-core TPU?

Mar 11 '24 19:03 mfatih7

It would be easiest to debug if we have a standalone reproduction, like if the issue occurs with a simple nn.Linear. I can work on a minimal repro, but just to clarify the steps:

Train the model using non-SPMD TPU
Save the model using xser.save
Convert the checkpoint by using the script in https://github.com/pytorch/xla/issues/6660#issuecomment-1977457694
Compare the loss on CPU and SPMD TPU

Does that look accurate?

Mar 11 '24 19:03 jonb377

I have the pre-trained model. I will add its checkpoint into the repo with the inputs needed to train the actual model.

In a config file, you will have the chance to select CPU or TPU training for the actual model. The scripts will load checkpoints and perform training according to your selection. Then you can trace the code with these training modes. Is this OK for you?

Mar 11 '24 19:03 mfatih7

That sounds fine, please go ahead and share the repo.

I suspect this is just related to matmul precision on TPU and CPU. I will do some tests to verify this hypothesis.

Mar 11 '24 20:03 jonb377

Hello @jonb377

Here is the repo

You need a TPUv4 device to run the experiment.

Activate these lines and run run_train_TPU for TPU operation.

Activate these lines and run run_train_TPU for CPU operation.

Among the prints to the console you can search for Exp Train and Exp Val to observe the printed outputs. By exploring the first prints, you can realize that the loss lGeo is dramatically lower for CPU operation with respect to TPU operation.

You can ask me anything that I can make.

Mar 12 '24 20:03 mfatih7

@jonb377

Is there anything wrong with the repo?

Mar 14 '24 19:03 mfatih7

Hi @mfatih7, I'll have some time this afternoon to look. Will keep you posted!

Mar 14 '24 19:03 jonb377

@mfatih7 I was able to reproduce the numbers you reported, and increasing the TPU matmul precision didn't help.

I'm curious to understand the use case a bit more. I noticed that the two config files for the TPU and CPU experiments had very different model code, would it be possible to run with identical models?

Also, all of my tests had the output Training starts for train_1_1_each_sample_in_single_batch_TPU_spmd - would this not mean both configs are running on TPU SPMD?

Mar 19 '24 23:03 jonb377

@jonb377

Thank you for your effort

Also, all of my tests had the output Training starts for train_1_1_each_sample_in_single_batch_TPU_spmd - would this not mean both configs are running on TPU SPMD?

So we have 2 models. The first model is loaded from the checkpoint and it supplies inputs to the actual model trained. We freeze the parameters of this model and with these frozen parameters, we run inference either using CPU or TPU. If CPU is selected the model is instantiated for each dataloader and the inference is part of pre-processing this way. If TPU is selected the model is instantiated on TPU.

For both cases, the actual model (not loaded from a checkpoint) is trained on TPU SPMD. If CPU is selected the actual model gets its pre-processed data from dataloaders. If TPU is selected the actual model gets its pre-processed data from the model on TPU.

That is why you get the same info for both cases. In both cases, the actual model is trained on TPU.

I'm curious to understand the use case a bit more. I noticed that the two config files for the TPU and CPU experiments had very different model code, would it be possible to run with identical models?

I deleted some unnecessary lines in model files and made another commit. Since model20 contains both the actual model and checkpoint model on TPU and model23 contains only the actual model on TPU the model codes must be different a little I think.

Please don't hesitate to tell me anything that can help.

Could you trace the code and see the necessary code parts for SPMD checkpoint operations?

Mar 20 '24 13:03 mfatih7

Thanks @mfatih7 for the context! Just to confirm my high-level understanding:

The actual model is being trained on SPMD TPU.
Inputs to the training are generated from another model with frozen parameters, which is either on TPU or CPU.
In both cases, the second model is used to preprocess the data, and we find that when the CPU model preprocesses we get better performance compared to the TPU model's preprocessing (loss of 0.01 vs 0.06).

It sounds like we should cut out the actual model training for now and verify that the data preprocessing is working correctly. Would you be able to make a repro which runs the CPU and TPU preprocessing models concurrently and verify their output is identical?

Mar 20 '24 20:03 jonb377

@jonb377

Yes. I think we are now at the same point.

I can make a repo but it will take time because this kind of concurrent data flow over frozen models on both TPU and CPU does not exist on my training setup.

I understand your concern about the actual model. But is it hard for you to just check the critical points on the code that are responsible for model generation on TPU from the checkpoint?

Mar 20 '24 20:03 mfatih7

One concern I have is the make_optimizer_prime call is happening on every invocation to get_checkpoint_template, so we're doing an extra dummy optimizer step each checkpoint. Depending on the optimizer, this can impact training performance.

The optimizer priming step should only happen once before any real data has been processed. I'll try disabling that and see if it improves the loss. If I'm understanding correctly though, this would impact the model training for both CPU and TPU.

Mar 20 '24 21:03 jonb377

@mfatih7 I've done a couple of tests related to the dataloading:

Inputs agree between CPU and TPU before passing through the frozen model.
After passing through the frozen model, they differ slightly, even with higher precision matmuls.

The model parameters look largely the same, but I didn't do a direct comparison. It does still seem like a precision issue, perhaps there's something happening deeper in the compiler that is causing this.

Was the frozen model originally trained on TPU?

Mar 20 '24 23:03 jonb377

@jonb377

Thank you very much.

Was the frozen model originally trained on TPU?

Yes. The frozen model was trained on TPUv3 with multi-core option.

The checkpoints on TPUv3 are saved using

import torch_xla.core.xla_model as xm
xm.save(checkpoint, path_checkpoints_file, master_only=True )

And both for TPU and CPU cases, the checkpoints are loaded using

checkpoint = xser.load( checkpoint_file_with_path ) in the line.

Mar 21 '24 07:03 mfatih7

@jonb377

Is there any progress on this issue?

Mar 27 '24 18:03 mfatih7

xla xla copied to clipboard

Loading a model trained with multi-core TPU into a SPMD model

xla
xla copied to clipboard