ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

有什么训练策略是可以直接在单机单显卡运行的吗?[DOC]:

Open chingfeng2021 opened this issue 1 year ago • 15 comments

📚 The doc issue

有相关的文档说明吗

chingfeng2021 avatar Mar 03 '23 06:03 chingfeng2021

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: Is there any training strategy that can be run directly on a single computer with a single graphics card? [DOC]:

Issues-translate-bot avatar Mar 03 '23 06:03 Issues-translate-bot

Try this one out: https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/gemini

JThh avatar Mar 04 '23 03:03 JThh

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./train_gpt_demo.py", line 353, in main() File "./train_gpt_demo.py", line 267, in main model = zero_model_wrapper(model, zero_stage, gemini_config) File "/opt/conda/lib/python3.8/site-packages/colossalai/nn/parallel/zero_wrapper.py", line 43, in zero_model_wrapper wrapped_model = GeminiDDP(model, **gemini_config) File "/opt/conda/lib/python3.8/site-packages/colossalai/nn/parallel/gemini_parallel.py", line 63, in init super().init(module, gemini_manager, pin_memory, force_outputs_fp32, strict_ddp_mode) File "/opt/conda/lib/python3.8/site-packages/colossalai/nn/parallel/data_parallel.py", line 242, in init self._init_chunks(param_order=param_order, File "/opt/conda/lib/python3.8/site-packages/colossalai/nn/parallel/data_parallel.py", line 670, in _init_chunks self.chunk_manager.register_tensor(tensor=fp32_p, File "/opt/conda/lib/python3.8/site-packages/colossalai/gemini/chunk/manager.py", line 71, in register_tensor self.__close_one_chunk(chunk_group[-1]) File "/opt/conda/lib/python3.8/site-packages/colossalai/gemini/chunk/manager.py", line 228, in __close_one_chunk chunk.close_chunk() File "/opt/conda/lib/python3.8/site-packages/colossalai/gemini/chunk/chunk.py", line 302, in close_chunk self.cpu_shard = torch.empty(self.shard_size, dtype=self.dtype, pin_memory=self.pin_memory) RuntimeError: CUDA error: out of memor

chingfeng2021 avatar Mar 08 '23 08:03 chingfeng2021

这个需要什么配置才能跑起来呢 @JThh

chingfeng2021 avatar Mar 08 '23 08:03 chingfeng2021

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


What configuration does this need to run? @JThh

Issues-translate-bot avatar Mar 08 '23 08:03 Issues-translate-bot

Can I know what's your current configuration? 你目前的配置是怎样的呢?

JThh avatar Mar 08 '23 10:03 JThh

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Can I know what's your current configuration? What is your current configuration?

Issues-translate-bot avatar Mar 08 '23 10:03 Issues-translate-bot

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.30.02 Driver Version: 531.18 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 1070 On | 00000000:01:00.0 Off | N/A | | 45% 47C P8 8W / 150W| 0MiB / 8192MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce RTX 3080 On | 00000000:03:00.0 On | N/A | | 0% 46C P8 21W / 340W| 1126MiB / 10240MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

chingfeng2021 avatar Mar 09 '23 09:03 chingfeng2021

@JThh 这是我目前的配置 2张显卡 ,3080 + 1070

chingfeng2021 avatar Mar 09 '23 09:03 chingfeng2021

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@JThh this is my current configuration 2 graphics cards , 3080 + 1070

Issues-translate-bot avatar Mar 09 '23 09:03 Issues-translate-bot

我把batchsize都调的很小,还是说内存不够 export PLACEMENT=${PLACEMENT:-"cput"} ----> 这个参数是不是应该设置为cuda

chingfeng2021 avatar Mar 09 '23 09:03 chingfeng2021

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I adjusted the batchsize very small, or the memory is not enough export PLACEMENT=${PLACEMENT:-"cput"} ----> Should this parameter be set to cuda

Issues-translate-bot avatar Mar 09 '23 09:03 Issues-translate-bot

我把batchsize都调的很小,还是说内存不够 export PLACEMENT=${PLACEMENT:-"cput"} ----> 这个参数是不是应该设置为cuda

你跑多大的gpt2,报错的

pecanjk avatar Mar 16 '23 09:03 pecanjk

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I adjusted the batchsize very small, or the memory is not enough export PLACEMENT=${PLACEMENT:-"cput"} ----> Should this parameter be set to cuda

How much gpt2 did you run, and it reported an error

Issues-translate-bot avatar Mar 16 '23 09:03 Issues-translate-bot

Take a look at this.

JThh avatar Apr 17 '23 09:04 JThh

We have updated a lot. Please check the latest code. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 27 '23 08:04 binmakeswell