Maozhou Ge

Results 7 issues of Maozhou Ge

alpa's two-level hierarchical space of parallelism is based on the observation that the bandwidth of inter-node(like InfiniBand) is much lower than intra-node(like NVLink). But the latest NVLink Switch system supports...

We can get ~4x speedup on A00 80GB for the shapes: ``` out_grad: torch.Size([10, 1, 192, 256, 128]), torch.float32 depth_grad: torch.Size([10, 7, 120, 64, 120]), torch.float32 feat_grad: torch.Size([10, 7, 64,...

Related to **BERT/TensorFlow2** On the hyperparameters used for BERT Large pretraining, the cmd in the doc is not aligned the config in `scripts/configs/pretrain_config.sh`. - doc: https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow2/LanguageModeling/BERT/README.md#pre-training The following sample code...

bug

The [docs](https://github.com/rapidsai/rmm?tab=readme-ov-file#script-to-build-rmm-from-source) said: Build, install, and test the `rmm` python package, in the `python` folder: ``` bash $ python -m pip install -e ./python ``` But I got error: ```...

bug
doc

On single device, we can init RMM with ```python import rmm from rmm.allocators.torch import rmm_torch_allocator import torch rmm.reinitialize(pool_allocator=True) torch.cuda.memory.change_current_allocator(rmm_torch_allocator) ``` How about distributed training with DDP on 32 cards? Is...

doc

I found that FSDP2 failed to load large(32B or 72B) model state_dict. And it works after I changed the "fsdp2" part in the cmd below to "fsdp": ```bash actor_rollout_ref.actor.strategy=fsdp \...