tutel icon indicating copy to clipboard operation
tutel copied to clipboard

Qs

Open zws98 opened this issue 1 year ago • 3 comments

I trained MOE on 8 gpus with 8 experts. When I conducted the inference in parallel, I found each process had a similar but different result. I would like to ask you what could be the cause of this?

zws98 avatar Apr 16 '24 09:04 zws98

Maybe you can consider if drop-less MOE mode can solve your issue, which is achieved by setting capacity_factor=0

ghostplant avatar Apr 16 '24 09:04 ghostplant

The results are still diverse for each process and the results are different from setting capacity_factor=1.25.

zws98 avatar Apr 16 '24 09:04 zws98

Do you have more information? I didn't get what you said.

Outputs from different GPUs:

STEP-10: loss = 21.11541, step_time = 3.628716 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3628715753555298 sec.
STEP-10: loss = 21.11541, step_time = 3.670310 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36703104972839357 sec.
STEP-10: loss = 21.11541, step_time = 3.689584 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.3689584493637085 sec.
STEP-10: loss = 21.11541, step_time = 3.675405 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36754045486450193 sec.
STEP-10: loss = 21.11541, step_time = 3.681213 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.36812126636505127 sec.
STEP-10: loss = 21.11541, step_time = 3.629702 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3629701852798462 sec.
STEP-10: loss = 21.11541, step_time = 3.700365 sec, perf = 0.07 tflops.

[Summary] Average synchronized step_time = 0.37003653049468993 sec.
STEP-10: loss = 21.11541, step_time = 3.658189 sec, perf = 0.08 tflops.

[Summary] Average synchronized step_time = 0.3658188819885254 sec.

ghostplant avatar Apr 17 '24 01:04 ghostplant