Cheng Li
Cheng Li
there are multiple ways to address OOM: use GPUs with larger memory (e.g. 80gb); use more GPUs and apply tensor parallelism or expert parallelism (if it's a moe model); use...
This double init does not seem to affect memory usage? I printed the memory allocation before and after https://github.com/stanford-futuredata/megablocks/blob/main/megablocks/layers/dmoe_test.py#L41, although the MLP init and mlp_impl init are both called, the...
Hi @LucasWilkinson, I ran the w4a8 benchmark in the PR, the gemm perf from machete is 15-20% slower than the marline kernels. Is that expected? Thanks.
> Hi @cli99 , I want to follow up with this PR, thanks for contributing! I haven't met this issue during my training, do you know under what circumstances the...
Removing the comment at `output = self.model(**batch)`gives ``` File "/usr/lib/python3/dist-packages/torch/jit/frontend.py", line 407, in __call__ return method(ctx, node) ^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/torch/jit/frontend.py", line 1245, in build_DictComp raise NotSupportedError(r, "Comprehension ifs are not...
@callmekris can you also check the transformers version in both envs?