Alex J. Champandard

Results 60 comments of Alex J. Champandard

All of the issues that are still open have not been done! Contributions still welcome.

I managed to avoid Steam updates until today, but getting the latest update today also broke this repository with the same issue. The device is basically not usable on Linux...

As well as the `seq_len: 256,` changes to the JSON config, here is the `run_train.sh` script I'm using: ``` #!/bin/bash export OMP_NUM_THREADS=1 torchrun --nproc-per-node 2 -m open_lm.main --model open_lm_11m \...

I'm pretty sure 3090s do support bf16 but I'll test regular float16! Update, with `--precision fp16` (no AMP): - **160m model** reaches `batch_size: 38` without OOM, 40 fails. - **11m...

Using `amp_bf16` instead of `amp_bfloat16` results in the same memory usage. 160m reaches `bs=52`, 11m reaches `bs=56`. So far I don't believe it's related to hardware support, data-type, and I...

OK, so right before entering `train_one_epoch()` the memory usage of GPU 0 is proportional to the number of workers + 1. With only one worker and one GPU, this is...

UPDATE: Results above look like the cost of doing business with `DistributedDataParallel`, nothing too out of the ordinary? The memory usage at peak in `train_one_epoch` is interesting, just before the...

@anas-awadalla Yes, it helps I'm not chasing a white rabbit! 🐇🕳 For `float32`, peak allocated `10.3G` for 11m model before backward(), compared to `9.5G` with `bfloat16`. - The relative memory...

Some progress, isolating the problem to a single GPU single worker helps. The problems are there also. 1) Bug maybe? Models with single GPU/workers don't seem to be correctly using...

@mitchellnw OK, thanks. The good news is that it's easy to isolate and reproduce! Having removed the autocast and fixed the raw fp16 problem, I made a chart of the...