turboderp
turboderp
You need to define how weights are to be split across the GPUs. There's a bit of trial and error in that, currently, since you're only supplying the maximum allocation...
Where is this? Invite me! Edit: Never mind I found it.
Well, there's a whole discussion about the "correct" way to tokenize inputs to language models. According to the SP authors, it's not correct to pass control symbols to the encoder,...
I don't think this is the right approach. Truncation is now a regular part of how the context window is adjusted, so it would spam hundreds of lines in the...
I'm sorry I really don't know anything about docker. @nopperl did the Docker stuff, maybe they can help?
As far as I can tell it's very hard to use efficiently in CUDA, since you need to run every quantized element through a lookup table, or as they've done...
Yes, this is planned. Stay tuned.
The kernels are very specifically optimized for matrix-vector operations (batch size = 1). It also does well on matrix-matrix by reconstructing full-precision matrices on the fly and relying on cuBLAS....
There's no support for concurrency, no. You'd need a separate instance for each thread, with its own generator and cache, and some mechanism for sensibly splitting the work between threads,...
I'm not sure how they arrive at those results. Plain HF Transformers can be mighty slow, but you have to really try to make it *that* slow, I feel. As...