hivemind icon indicating copy to clipboard operation
hivemind copied to clipboard

Question about loading checkpoint

Open finger92 opened this issue 3 years ago • 6 comments

I was able to resume training by only 'load_state_dict' in monitor peer before using hivemind 1.0.0 version. The code looks like this:

# monitor peer
if load_from_pretrained:
  self.model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"), strict=False)
  ...
  self.collaborative_optimizer.load_state_dict(torch.load("optimizer.pt", map_location="cpu"))

The peers would load from monitor's state after start up.

However, in ver 1.0.0 or master code, 'load_state_dict' in monitor seems not work. My question is am I using the wrong method or should I load the checkpoint on the worker peer?

finger92 avatar Jan 14 '22 03:01 finger92

Hi! Thanks for the report. I'll appreciate if you can run a few tests to isolate the problem.

Note: there's quite a bunch of them. If you find any discrepancies, you don't need to run the rest of tests. Also: feel free to reach out if you need any assistance with running these checks.

Q0: If possible, please specify hivemind versions you used: both the old one and the new one

Q1: In the new version, when you launch one monitor and one training peer, does the training peer print Downloading parameter from <long PeerID string>? (and not "Failed to load state" / "Cowardly refusing to load state")

Q2: Are you loading a checkpoint saved in an earlier hivemind version or in master? If new hivemind fails to load old checkpoint, does it load the checkpoint that was saved in the new version?

Q3: Please print model and optimizer checksums on the monitor right after it loads parameters from file(monitor) or from peer(trainer) respectively

print("Local epoch:", self.collaborative_optimizer.local_epoch)
print("Params checksum:", sum(p.sum().item() for p in self.model.parameters()))
print("Optimizer checksum:", sum(v.data.numpy().sum() for k,v in self.collaborative_optimizer.state_dict().items()))

And the print same values in a training peer right after it loads state for the first time.

  • Does local_epoch match the epoch that was used when you last saved state?
  • Do these values match? If they do not, did they match in earlier version?

Q4: Can you please check if all keys match successfully? (print outputs of whatever.load_state_dict(...) - it contains a report of which keys matched and which did not)

justheuristic avatar Jan 14 '22 04:01 justheuristic

Q0: 0.10.0 vs 1.0.0 Q1: I found a warn [WARN] [hivemind.optim.state_averager.load_state_from_peers:669] Failed to load state from peer, received inconsistent number of optimizer statistics Could this be the reason for fail to load checkpoint?

finger92 avatar Jan 14 '22 09:01 finger92

Could this be the reason for fail to load checkpoint?

I think yes. Are you sure that the model and the optimizer are defined in the same way in the monitor and the trainer?

If possible, can you please send the code and the CLI args you're running it with, so we can reproduce the issue locally?

borzunov avatar Jan 14 '22 12:01 borzunov

@finger92 JFYI: the warning is thrown by this line: state_averager.py:669.

This warning is triggered by StopIteration which means that you received less tensors in loaded state than you expected

  • either the peer that sent you state has a different model and/or optimizer configuration (e.g. number of layers, Adam vs Lamb or different options)
  • or there was a connection error - in which case you will see that connection error above the warning (e.g. TimeoutError, BrokenPipeError)

It would be great if you could physically check that the states have the same shape. On both aux and gpu peers, run:

metadata, tensors, infos = self.collaborative_optimizer.state_averager.get_current_state()
print("Number of tensors in state:", len(tensors))

If they match, please also check print(metadata["optimizer_metadata"]) and see if it has the same type/number of elements

If either of the two mismatch between trainer and aux peer, then the two peers created model/optimizer differently and we should look for the problem in the client code (as in "not in hivemind core"). If they match, then the state somehow got broken in transit, we'll help you investigate that.

justheuristic avatar Jan 15 '22 08:01 justheuristic

I solved this! In example/albert The trainer peer used scheduler while monitor peer not. Which results in some differences in the "optimizer state_dict" of the two peers(scheduler will add a 'initial_lr' to optimizer's state). After adding a non-functional "scheduler" in monitor peer it works fine.

By the way, I changed "prefix" in state_averager in monitor peer's code to let trainer could download state from monitor

self.state_averager = TrainingStateAverager(
            dht=dht,
            optimizer=opt,
            prefix=f"{experiment_prefix}_state_averager",
            state_compression=hivemind.Float16Compression(),
            bandwidth=optimizer_args.bandwidth,
            client_mode=optimizer_args.client_mode,
            start=True,
            **asdict(averager_args),
        )

finger92 avatar Jan 19 '22 03:01 finger92

Hi! Awesome work! Feel free to ping us if you encounter any more oddities :)

We'll incorporate your fixes into the example in the coming days (within a week or two at most) and write back to you with an update.

justheuristic avatar Jan 19 '22 20:01 justheuristic