Ma, Guokai comments

Results 180 comments of


                                            Ma, Guokai

[RFC] add device abstraction to allow other device than CUDA be used

> @delock, is this PR still actively developed for merging? Hi @tjruwase, We have validated this branch in our environment. The latest Intel extension for DeepSpeed works with this PR...

[RFC] add device abstraction to allow other device than CUDA be used

Hi @tjruwase , want to know whether this PR is in the merge queue or still need some changes. Currently DeepSpeed engine is already integrated with accelerator abstraction and the...

[RFC] add device abstraction to allow other device than CUDA be used

> @delock it feels like we could add a pre-commit hook to ensure that our formatter fails if someone tries to use `torch.cuda` outside the new get_accelerator api. Similar to...

[RFC] add device abstraction to allow other device than CUDA be used

https://github.com/microsoft/DeepSpeed/pull/2981 is created for pre-commit check. @jeffra

Better core binding in torch.backends.xeon.run_cpu when launced from torchrun with --nproc-per-node

Hi @ezyang, this PR needs approval from maintainer to be merged. Can you help review this PR? Thanks!

run the intel reinforcement on mutiple node

I used the following commands and it still works. I don't have full detail about your issue. But some suggestions: 1. Makesure gstuil is installed for python2 rather than python3...

run the intel reinforcement on mutiple node

Google had reorgnized the directory to accomondate training v0.7 submission, thus the directory structure no longer compatible with previous submission code for v0.6. The directory path for v0.6 checkpoint is...

Create simple_local_chat.py

Hi, thanks for this example! When I keep talking with it I got this error, is there a way to avoid this error? ``` ValueError: Requested tokens (527) exceed context...

run-example.sh fails with urllib3.exceptions.ProtocolError: Response ended prematurely

@awan-10 @lekurile Thanks for start this thread. I met this error when I tried to run this example on Xeon server with CPU. I suspect this is a configuration issue....

run-example.sh fails with urllib3.exceptions.ProtocolError: Response ended prematurely

Hi @lekurile Now I can start the server from seperate command line and run benchmark on this server with reduced test size (max batch 128, avg prompt128) to start with....