Stas Bekman
Stas Bekman
So it looks like `clone()` was there but was removed to optimize speed. Also the problem happens only with torch==1.10, probably has to do with the new backward hook function....
> @stas00 I'm not sure if there's a way to exactly replicate the formatting you used for the underflow/overflow debugger. Here is what I have so far. > > ```...
- With regards to controlling when to log we need to think how we want to use it. Let's start with just logging as is. - Also remember that we...
I don't think this is the case. Remember this module will not have access to all data from all processes. So it'll only see a sliver of the model that...
The original debug tool was used in just a simple 1-gpu setup, so it required no special outputs handling. Here we are dealing with hundreds of gpus, so a simple...
@jaketae, if that's OK with you, I will start experimenting inside this PR with how we should log things. I am going to backport `DebugUnderflowOverflow` here from `transformers` and experiment...
This is very experimental, but this dumps all forward data into tb per rank data, which can then can be studied. Testing with just max(tensor) for now. Here is a...
So we can't use `register_full_backward_hook` as it leads to a huge leak in deepspeed - reported it here: https://github.com/microsoft/DeepSpeed/issues/1572 we will use the deprecated `register_backward_hook` for now.
If you want to play with the output, add: `--tensorboard-debug-dir debug-logs/tbs` (any path) to the options, and when it run for some dozen iterations and it exits you can view...
Found a possibly another interesting datum to log: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost https://arxiv.org/abs/1804.04235 In section "5. A Problem with Adam: Out-of-Date Second Moment Estimator" it talks...