botbw comments

Results 39 comments of


                                            botbw

[gemini] async grad chunk reduce (all-reduce&reduce-scatter)

# trace previous now # benchmark before colossalai run --nproc_per_node 8 --hostfile hosts.txt benchmark.py -g -x -b 4 -s 100 ``` num_samples: 392, dp_world_size: 8, flop_megatron: 9.7808637396779e+16, flop: 86555325938794496, avg_duration:...

[Gemini] Prefetch next chunk before each op

# LLaMa trace (static) ## Prefetch = 0 ## Prefetch = 10 # Benchmark

Training stage2 interruption

Hey @Little-devil1 , this should have been resolved and you are welcome to pull the main branch and try again.

Training stage2 interruption

> Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem? > @Little-devil1...

Training stage2 interruption

> > > Hello, after pulling the main branch, using the above configuration in 8XH100(80G) for training test, the above situation will still appear, is it the 8XH100(80G) memory problem?你好，在拉取主分支后，使用上述配置在...

fix a bug that caused to pop all `v2v` conditions during training

@hadipash Could you please specify any exact bugs this usage causes? Poping a config might be useful if we don't want it to be consumed twice (but I'm not sure...

fix a bug that caused to pop all `v2v` conditions during training

> @botbw It doesn't cause bugs with the current configs (as they use `i2v` configuration only), but if one were to add `v2v` configurations, a single short video would cause...

fix a bug that caused to pop all `v2v` conditions during training

@zhengzangw A tiny change for the robustness of open-sourced code.

Training stucked after Epoch 0

@xilanhua12138 At this point, I would suggest: 1. set `pin_memory_cache_pre_alloc_numels = None` in `cfg` or `train.py`. 2. set `pin_memory = False` when initializing dataloader in `train.py`. Note: you might still...

[DTensor] use P2P for complicated transformation when redistributing tensor

@tianyu-l Thanks for sharing the information! Regarding point 2 and 3, I'm not sure why `ProcessGroup` initialization affects p2p comm, could please further explain it? I thought that all P2P...