Stas Bekman
Stas Bekman
I also wonder whether the policy should be arch-specific, or model-specific - what if someone wants to do 8-bit only for FFN or only for Embedding? If model-specific than the...
> What I would add is what kind of int8 data type is used. Did you mean to say something different here, Tim? Unless I misunderstood, int8 is already a...
Sounds good, Tim. So I trust you will come up with the different names then. We just need to think how to make it easily expandable in the future to...
> I've [implemented](https://github.com/deniskamazur/transformers/tree/gpt-j-8bit) the «hardcoded» version of this issue. Awesome news, @deniskamazur! I won't have time at this moment to support this process very closely but I trust there will...
Hi Denis, it has been a long time.... perhaps there has been a misunderstanding - as we have been waiting for you to complete the PR so nothing has happened...
I suppose the advantage of loading in int8, is that with fp16 you need 2x memory upfront, but since we now have sharded checkpoints this can be overcome by sharding...
Thank you for the detailed breakdown, @lhoestq > I'm curious, what would you expect to happen in this situation ? 1. the simplest solution is to add a flag to...
Yes, so that you always have the cached entry for any dataset, but the "payload" doesn't have to be physically in the cache if it's already on the local filesystem....
Your outline spec is very sound and clear, @lhoestq - thank you! @thomasw21, indeed that would be a wonderful extra feature. In Megatron-Deepspeed we manually drained the dataloader for the...
I totally agree, @iliaschalkidis! In general, pretty much any DeepSpeed-specific questions should go to https://github.com/microsoft/DeepSpeed - please feel free to tag me if it's related to `transformers` though, since most...