ghostplant
ghostplant
You can go through these examples to convert training checkpoints between distributed version and single-device version: https://github.com/microsoft/tutel#how-to-convert-checkpoint-files-that-adapt-to-different-distributed-world-sizes
> Thanks for your quick update for this feature! I notice you use mpiexec to launch the job and save the ckpt. If I use torch.distributed.launch to train my moe,...
Yes, TutelDistributedOptimizer is a replacement of Pytorch DDP in that example (helloworld_ddp_tutel) to make the whole model sychronization transparent. TutelDistributedOptimizer not only implements ZeRO optimization, but also leverages built-in mask...
To use TutelDistributedOptimizer which has parameter synchronization included, you should no longer warp the model with `DistributedDataParallel`.
It is a version that manually distinguish parameter types, which follows `helloworld_ddp.py`
To use tutel moe in Pytorch DDP backend, you need to not only set skip_allreduce as true in the moe scan function, but also recollect parameters with those masks, and...
OK, what about Ubuntu 22.04's official kernel, which is based on 5.19.x? Is it integrated by default?
Awsome, thanks!
I checked my Ubuntu 22.04 kernel version which is 5.15.0-48-generic, but I didn't find device file `/dev/ashmem` and `/dev/binder` exist. Is it not a environment that Anbox can directly work?
So strange, the word doesn't exist as well: ```sh root@ubuntu-pc:~# grep ashmem /proc/misc root@ubuntu-pc:~# grep binder /proc/filesystems root@ubuntu-pc:~# uname -a Linux ubuntu-pc 5.15.0-48-generic #54-Ubuntu SMP Fri Aug 26 13:26:29 UTC...