aikitoria
aikitoria
@byshiue any chance of this being added soon? 👀 Most other engines have it now, it's in vLLM and HF transformers is also adding it.
If Nvidia doesn't want to do it (why? much superior inference results...), maybe we can add it ourselves? It looks like sampling layers are part of the code that is...
Is that not much slower than it would be if properly implemented it in the CUDA sampling layers? Just saw the PR above also. That's an interesting way. Highly doubt...
Yes please. Command-R+ support is needed!
A simpler alternative solution might be to launch the next hf_transfer instance when the previous file is about 75% done rather than waiting for 100%. But reusing connections from a...
Sure, here is an example where it would make a difference: https://huggingface.co/databricks/dbrx-instruct/tree/main When downloading this on a server with 10gbit/s download speed, restarting the connections and growing the sliding window...
@Wauplin here is another model where this would be very beneficial: https://huggingface.co/CohereForAI/c4ai-command-r-plus
> Or maybe you're targeting a server than doesn't support http2 I dunno? Huggingface are hosting the server themselves. Wasn't even aware you can use hf_transfer for something other than...
This sometimes happens for me as well. Maybe 1% of downloads. Makes huggingface-cli freeze up, have to kill the process to retry.
Happened again. Froze with a file at 99% and locked up the whole session. Had to switch to another and kill the process to retry, which immediately worked.