Hybrid Parallel Plugin下TP显存比同配置下deepspeed要高???
您好,请问下为啥Hybrid Parallel Plugin下TP显存比同配置下deepspeed要高???
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: Is the TP memory under Hybrid Parallel Plugin higher than the deepspeed under the same configuration? ? ?
Hello, may I ask why the TP memory under Hybrid Parallel Plugin is higher than the deepspeed under the same configuration? ? ?
好像是显存碎片造成?而且很严重,请问下有对应的优化改进措施吗?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
It seems to be caused by memory fragmentation? And it’s very serious. Are there any corresponding optimization and improvement measures?
Deepspeed zero-3是完全切分权重而TP并不完全切分(例如非Linear/Embedding层)。当Activation较小时这种情况有可能发生,请提供更详细的信息
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Deepspeed zero-3 is a complete slicing of weights while TP is not fully slicing (e.g., non-Linear/Embedding layers). This may occur when Activation is small, please provide more detailed information
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Your code efficiency can be improved by 10 times! 🚀 https://code.mayoubang.cn Free AI code tool, if you don’t believe it, try it!