ColossalAI [BUG]: 单机单卡和单机八卡训练时间一样

🐛 Describe the bug

为什么我用单机单卡、单机四卡（TP--2D）和单机八卡（TP--3D）三种训练方式，显卡显存占用和训练时间都是一样的？并没有感觉到colossal在对我的训练进行加速！！如何才能知道自己的模型有没有在做并行计算呢？

Environment

No response

Mar 09 '23 10:03 Vvvvvvsysy

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Title: [BUG]: The training time for single-machine single-card and single-machine eight-card is the same

🐛 Describe the bug

Why do I use the three training methods of single-machine single-card, single-machine four-card (TP--2D) and single-machine eight-card (TP--3D), the memory usage of the graphics card and the training time are the same? I didn't feel that colossal was accelerating my training! ! How can I know whether my model is doing parallel computing?

Environment

No response

Mar 09 '23 10:03 Issues-translate-bot

你好，我也遇到相同的问题。基于https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel 项目进行测试，例如，单卡消耗1707m显存，倘若四卡张量并行，每张卡应消耗430m左右，但实际测试是每张卡消耗1723m。仅修改关键参数（nproc per node），其余不变。请问是我理解或操作有误嘛？请指正，谢谢

Mar 10 '23 01:03 pilipala818

你好，我也遇到相同的问题。基于https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel 项目进行测试，例如，单卡消耗1707m显存，倘若四卡张量并行，每张卡应消耗430m左右，但实际测试是每张卡消耗1723m。仅修改关键参数（nproc per node），其余不变。请问是我理解或操作有误嘛？请指正，谢谢

你用官方的代码也是这样吗？这个问题困扰我们很久了，我们在单机单卡和单机八卡这两种设置下，显存消耗和时间消耗都是一样的，像是一种假并行。假如你能解决这个问题，请务必回复我，非常感谢！！

Mar 10 '23 04:03 Vvvvvvsysy

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Hi, I have the same problem. Based on the https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel project, for example, a single card consumes 1707m of video memory. If four cards are used in parallel, each card should consume about 430m, but The actual test is that each card consumes 1723m. Only the key parameters (nproc per node) are modified, and the rest remain unchanged. Is my understanding or operation wrong? Please correct me, thank you

Do you use the official code as well? This problem has troubled us for a long time. Under the two settings of single-machine single-card and single-machine eight-card, the memory consumption and time consumption are the same, which is like a kind of false parallelism. If you can solve this problem, please be sure to reply me, thank you very much! !

Mar 10 '23 04:03 Issues-translate-bot

你好，我也遇到相同的问题。基于https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel 项目进行测试，例如，单卡消耗1707m显存，倘若四卡张量并行，每张卡应消耗430m左右，但实际测试是每张卡消耗1723m。仅修改关键参数（nproc per node），其余不变。请问是我理解或操作有误嘛？请指正，谢谢

你用官方的代码也是这样吗？这个问题困扰我们很久了，我们在单机单卡和单机八卡这两种设置下，显存消耗和时间消耗都是一样的，像是一种假并行。假如你能解决这个问题，请务必回复我，非常感谢！！

暂未解决该问题，本人实验并未记录时间消耗。在验证官方代码时，张量的切分是正常的，但显存占用对不上。例如，四卡情况下，在1d切分时，输入2561024，张量切分为256256 在2d切分时，输入2561024，张量切分为128512

Mar 10 '23 04:03 pilipala818

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Hello, I have the same problem. Test based on the https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel project. For example, a single card consumes 1707m of video memory. If four cards are used in parallel, each card should consume about 430m, but The actual test is that each card consumes 1723m. Only the key parameters (nproc per node) are modified, and the rest remain unchanged. Is my understanding or operation wrong? Please correct me, thank you

Do you use the official code as well? This problem has troubled us for a long time. Under the two settings of single-machine single-card and single-machine eight-card, the memory consumption and time consumption are the same, which is like a kind of false parallelism. If you can solve this problem, please be sure to reply me, thank you very much! !

This problem has not been solved yet, and my experiment did not record the time consumption. When verifying the official code, the tensor segmentation is normal, but the video memory usage is not correct. For example, in the case of four cards, when splitting in 1d, input 2561024, and the tensor is split into 256256 In 2d segmentation, input 2561024, tensor segmentation is 128512

Mar 10 '23 04:03 Issues-translate-bot

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Hello, I have the same problem. Based on the https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel project, for example, a single card consumes 1707m of video memory. If four cards are used in parallel, each card should consume about 430m, but The actual test is that each card consumes 1723m. Only the key parameters (nproc per node) are modified, and the rest remain unchanged. Is my understanding or operation wrong? Please correct me, thank you

Do you use the official code as well? This problem has troubled us for a long time. Under the two settings of single-machine single-card and single-machine eight-card, the memory consumption and time consumption are the same, which is like a kind of false parallelism. If you can solve this problem, please be sure to reply me, thank you very much! !

This problem has not been solved yet, and my experiment did not record the time consumption. When verifying the official code, the division of tensors is normal, but the video memory usage does not match. For example, in the case of four cards, in 1d segmentation, input 256_1024, tensor segmentation is 256_256; in 2d segmentation, input 256_1024, tensor segmentation is 128_512

Mar 10 '23 06:03 Issues-translate-bot

来我主页，我们邮箱交流一下把！

Mar 10 '23 06:03 Vvvvvvsysy

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Come to my homepage, let's exchange emails!

Mar 10 '23 06:03 Issues-translate-bot

@Vvvvvvsysy 你好，请问你解决这个问题了么，我也出现了同样的情况。在训练https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/opt 时，单卡，八卡，四卡，gpu memory usage几乎是一样的

Mar 22 '23 04:03 laozhanghahaha

Hi @pilipala818 @Vvvvvvsysy ，用torch.cuda.memory_allocated和torch.cuda.max_memory_allocated来测显存占用会比较精确。测时间的时候需要注意加上torch.cuda.synchronize

Mar 22 '23 05:03 kurisusnowdeng

运行GPT2 gemini的examples时候，我用nvidia-smi命令，观察单卡和多卡占用的显存都是一样的，batch=16时候单卡15935Mi，但是在日志中观察发现： [03/27/23 17:24:22] INFO colossalai - colossalai - INFO: ./train_gpt2.py:366 train_step INFO colossalai - colossalai - INFO: [1/20] Forward GPU memory usage: 2034.80 MB, CPU memory usage: 15982.80 MB [03/27/23 17:24:33] INFO colossalai - colossalai - INFO: ./train_gpt2.py:377 train_step INFO colossalai - colossalai - INFO: [1/20] Backward GPU memory usage: 486.07 MB, CPU memory usage: 15982.81 MB

NVidai-smi显示的显存数值和日志里面的CPU Memory的数值是一样？

Mar 27 '23 09:03 zhangyuanscall

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

When running the examples of GPT2 gemini, I use the nvidia-smi command to observe that the video memory occupied by single card and multi-card is the same. When batch=16, the single card is 15935Mi, but the observation in the log shows: [03/27/23 17:24:22] INFO colossalai - colossalai - INFO: ./train_gpt2.py:366 train_step INFO colossalai - colossalai - INFO: [1/20] Forward GPU memory usage: 2034.80 MB, CPU memory usage: 15982.80 MB [03/27/23 17:24:33] INFO colossalai - colossalai - INFO: ./train_gpt2.py:377 train_step INFO colossalai - colossalai - INFO: [1/20] Backward GPU memory usage: 486.07 MB, CPU memory usage: 15982.81 MB

Is the video memory value displayed by NVidai-smi the same as the CPU Memory value in the log?

Mar 27 '23 09:03 Issues-translate-bot

运行GPT2 gemini的examples时候，我用nvidia-smi命令，观察单卡和多卡占用的显存都是一样的，batch=16时候单卡15935Mi，但是在日志中观察发现： [03/27/23 17:24:22] INFO colossalai - colossalai - INFO: ./train_gpt2.py:366 train_step INFO colossalai - colossalai - INFO: [1/20] Forward GPU memory usage: 2034.80 MB, CPU memory usage: 15982.80 MB [03/27/23 17:24:33] INFO colossalai - colossalai - INFO: ./train_gpt2.py:377 train_step INFO colossalai - colossalai - INFO: [1/20] Backward GPU memory usage: 486.07 MB, CPU memory usage: 15982.81 MB

NVidai-smi显示的显存数值和日志里面的CPU Memory的数值是一样？

@zhangyuanscall nvidia-smi的结果并不准确因为还包含pytorch的reserved memory，不过从你的结果也能侧面证明大部分显存开销已经被offload到RAM上了

Mar 27 '23 10:03 kurisusnowdeng

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

When running the examples of GPT2 gemini, I use the nvidia-smi command to observe that the video memory occupied by single card and multi-card is the same. When batch=16, the single card is 15935Mi, but the observation in the log shows: [03/27/ 23 17:24:22] INFO colossalai - colossalai - INFO: ./train_gpt2.py:366 train_step INFO colossalai - colossalai - INFO: [1/20] Forward GPU memory usage: 2034.80 MB, CPU memory usage: 15982.80 MB [03 /27/23 17:24:33] INFO colossalai - colossalai - INFO: ./train_gpt2.py:377 train_step INFO colossalai - colossalai - INFO: [1/20] Backward GPU memory usage: 486.07 MB, CPU memory usage: 15982.81 MB

Is the video memory value displayed by NVidai-smi the same as the CPU Memory value in the log?

The result of nvidia-smi is not accurate because it also includes the reserved memory of pytorch, but your results can also prove that most of the video memory overhead has been offloaded to RAM

Mar 27 '23 10:03 Issues-translate-bot

This issue was closed due to inactivity. If you have further questions, please open another new issue and provide details. Thanks.

Apr 27 '23 10:04 binmakeswell

ColossalAI ColossalAI copied to clipboard

[BUG]: 单机单卡和单机八卡训练时间一样

🐛 Describe the bug

Environment

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard