candle
candle copied to clipboard
Remove unnecessary memset for reshape, concat and contiguous
This is a reopened PR #1680, it will improve around 5% - 10% performance for every LLM model in candle-examples. This was also adopted by @EricLBuehler in his Mistralrs project and proved its effectiveness. Therefore, we hope this can be merged into candle main branch.
--- previous comments Some tensor operations (e.g., reshape, concat, and contiguous) have redundant memset operations. This occurs when alloc_zeros is called unnecessarily to fill the destination buffer. This patch introduces an alloc operation with an optional value for buffer filling. By default, the alloc function will call device::alloc if there is no value to fill. This change will prevent unnecessary memset calls for certain tensor operations, including reshape, concat, and contiguous, thereby speeding up the inference process.
To streamline the implementation, it would be more intuitive to incorporate an alloc_values function in the external crate "cudarc". This function would accept the optional init_value parameter. If init_value is None, the function would allocate an uninitialized buffer; otherwise, it would allocate a buffer and memset it with init_value (see changes in cuda_backend.rs -> alloc_impl). This would eliminate the need for separate implementations such as ones_impl, zeros_impl, and so on.
Could you provide more details on these 5%-10% speedups? I would rather avoid having undefined values if possible and would have thought that these represent a very small amount times but looks like I was mistaking so I would like to see which of the initializations are actually taking that much time.
Could you provide more details on these 5%-10% speedups? I would rather avoid having undefined values if possible and would have thought that these represent a very small amount times but looks like I was mistaking so I would like to see which of the initializations are actually taking that much time.
The contiguous, reshape and cat operations in the current candle tensor require creating the output buffer that is filled with zero values (using memset). While it takes some time for the memset of the output buffer (especially for the large ones, e.g., long sequence output, batched output), and more importantly, the output buffer will be immediately filled with other values making it unnecessary to create the zero buffer. There is no such a method in the candle to alloc an uninitialized buffer for tensor usage and therefore I created an alloc_impl in the backend with the optional buffer-filling value. It could serve as a unified entry for buffer allocation and there is no need for standalone zero_impl, ones_impl, etc.
Ok I agree that this could be problematic in some use cases, especially as I've spent the last couple days optimizing a couple places in the cuda backend and now these initialization appear more clearly in my profile. As this PR has a bunch of conflicts now + some unrelated changes around tokenizers (?) I've extracted the relevant bits in #1901 with a couple tweaks (making it marked as unsafe, adding a couple comments, also speed up on the cpu side etc). I have to think more about it as it's certainly a bit of a scary change but the speedups (~4% in my case) certainly sound interesting to have, thanks for pinpointing this.