Frank Seide comments

Results 38 comments of


                                            Frank Seide

[feature request] option to set working directory

Thanks for the quick response. I clarified my issue description that this is not about specifying the relative target path, relative to the buck root, but about being able to...

Compile error: immintrin.h: No such file or directory

Can we fake the intrinsics? How many different intrinsics are actually used? Maybe we can just emulate those that are used.

all shards must have the same size -- problem with 6GPUs but not with 5

The total number of model parameters must be divisible by the number of GPUs. It’s a limitation of NCCL that we have not yet worked around. Get Outlook for iOS...

all shards must have the same size -- problem with 6GPUs but not with 5

Not entirely easy, as we also pad parameter sizes to multiples of 256 bytes or so. The easiest would be to just add dummy parameter values equal to the largest...

all shards must have the same size -- problem with 6GPUs but not with 5

So e.g. to distribute 999 elements to 4 GPUs, one would first do a 4-way exchange of 249 elements, and then another of 1 element each but only for the...

all shards must have the same size -- problem with 6GPUs but not with 5

The Nvidia contact responded that we can roll our own `ncclReduceScatter()` that supports this by a combination of `ncclGroupStart()`, `ncclReduce()`, and `ncclGroupEnd()`. That would work indeed, although I'd rather have...

Using as little CPU RAM and cores as possible, when decoding

So do I understand your question correctly: you would like to have predictable CPU RAM/core usage during *GPU* decoding? (For *CPU* decoding, I think setting `--cpu-threads` and `--workspace` should help.)

Using as little CPU RAM and cores as possible, when decoding

I think it does for the numeric tensors flowing through the network. It does not for C++-side state structures used by the decoder, e.g. the arrays of active beams and...

Training speed of Transformer-Big

AFAIK, NVLink works for all GPUs inside a single box. Note sure if cross-box setups are possible and/or common. Can you try to set this environment variable and run it...

MPI-training seems not working in the current version.

"the multi-node training available currently in Marian is usually slow" Just to be clear, that is usually due to a slow network. It works nicely if you have a fast...