Frank Seide

Results 38 comments of Frank Seide

Thanks for the quick response. I clarified my issue description that this is not about specifying the relative target path, relative to the buck root, but about being able to...

Can we fake the intrinsics? How many different intrinsics are actually used? Maybe we can just emulate those that are used.

The total number of model parameters must be divisible by the number of GPUs. It’s a limitation of NCCL that we have not yet worked around. Get Outlook for iOS...

Not entirely easy, as we also pad parameter sizes to multiples of 256 bytes or so. The easiest would be to just add dummy parameter values equal to the largest...

So e.g. to distribute 999 elements to 4 GPUs, one would first do a 4-way exchange of 249 elements, and then another of 1 element each but only for the...

The Nvidia contact responded that we can roll our own `ncclReduceScatter()` that supports this by a combination of `ncclGroupStart()`, `ncclReduce()`, and `ncclGroupEnd()`. That would work indeed, although I'd rather have...

So do I understand your question correctly: you would like to have predictable CPU RAM/core usage during *GPU* decoding? (For *CPU* decoding, I think setting `--cpu-threads` and `--workspace` should help.)

I think it does for the numeric tensors flowing through the network. It does not for C++-side state structures used by the decoder, e.g. the arrays of active beams and...

AFAIK, NVLink works for all GPUs inside a single box. Note sure if cross-box setups are possible and/or common. Can you try to set this environment variable and run it...

"the multi-node training available currently in Marian is usually slow" Just to be clear, that is usually due to a slow network. It works nicely if you have a fast...