GLM-130B icon indicating copy to clipboard operation
GLM-130B copied to clipboard

量化int4遇到的问题

Open chensiyao12 opened this issue 1 year ago • 10 comments

cpu内存256G,GPU 6张3090

WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified using world size: 4 and model-parallel size: 4

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528) initializing model parallel with size 4 Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere. global rank 3 is loading checkpoint glm130b_t4/49300/mp_rank_03_model_states.pt global rank 2 is loading checkpoint glm130b_t4/49300/mp_rank_02_model_states.pt global rank 0 is loading checkpoint glm130b_t4/49300/mp_rank_00_model_states.pt global rank 1 is loading checkpoint glm130b_model/glm130b_t4/49300/mp_rank_01_model_states.pt

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18293 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18294 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18295 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 18292) of binary: /usr/local/bin/python3.10 Fatal Python error: Segmentation fault

Current thread 0x00007f7a9e4d5280 (most recent call first): File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in call File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345 in wrapper File "/usr/local/bin/torchrun", line 8 in

Extension modules: backports.lzma._lzma, torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 20) ./scripts/generate.sh: line 38: 18226 Segmentation fault (core dumped)

chensiyao12 avatar May 23 '23 02:05 chensiyao12

解决了吗

wenshuop avatar May 23 '23 08:05 wenshuop

没有,你也遇到这个问题吗?

chensiyao12 avatar May 24 '23 01:05 chensiyao12

对很苦恼,增加了虚拟机的内存也不行

wenshuop avatar May 24 '23 01:05 wenshuop

cpu内存256G,GPU 6张3090

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified using world size: 4 and model-parallel size: 4

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528) initializing model parallel with size 4 Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere. global rank 3 is loading checkpoint glm130b_t4/49300/mp_rank_03_model_states.pt global rank 2 is loading checkpoint glm130b_t4/49300/mp_rank_02_model_states.pt global rank 0 is loading checkpoint glm130b_t4/49300/mp_rank_00_model_states.pt global rank 1 is loading checkpoint glm130b_model/glm130b_t4/49300/mp_rank_01_model_states.pt

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18293 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18294 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18295 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 18292) of binary: /usr/local/bin/python3.10 Fatal Python error: Segmentation fault

Current thread 0x00007f7a9e4d5280 (most recent call first): File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in call File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345 in wrapper File "/usr/local/bin/torchrun", line 8 in

Extension modules: backports.lzma._lzma, torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 20) ./scripts/generate.sh: line 38: 18226 Segmentation fault (core dumped)

cpu内存256G,GPU 6张3090

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified using world size: 4 and model-parallel size: 4

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528) initializing model parallel with size 4 Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere. global rank 3 is loading checkpoint glm130b_t4/49300/mp_rank_03_model_states.pt global rank 2 is loading checkpoint glm130b_t4/49300/mp_rank_02_model_states.pt global rank 0 is loading checkpoint glm130b_t4/49300/mp_rank_00_model_states.pt global rank 1 is loading checkpoint glm130b_model/glm130b_t4/49300/mp_rank_01_model_states.pt

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18293 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18294 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 18295 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 18292) of binary: /usr/local/bin/python3.10 Fatal Python error: Segmentation fault

Current thread 0x00007f7a9e4d5280 (most recent call first): File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 31 in get_all File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 53 in synchronize File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 67 in barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906 in _exit_barrier File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 877 in _invoke_run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 125 in wrapper File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 236 in launch_agent File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131 in call File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 715 in run File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 724 in main File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345 in wrapper File "/usr/local/bin/torchrun", line 8 in

Extension modules: backports.lzma._lzma, torch._C, torch._C._fft, torch._C._linalg, torch._C._nn, torch._C._sparse, torch._C._special, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator (total: 20) ./scripts/generate.sh: line 38: 18226 Segmentation fault (core dumped) 130B 要是方便的话,可以共享模型吗?130B给的文件只能下载一部分权重

GXKIM avatar May 25 '23 02:05 GXKIM

我也遇见了,有人解决了么

wei-potato avatar May 26 '23 06:05 wei-potato

我也遇见了,有人解决了么

我解决了

GXKIM avatar Jun 07 '23 08:06 GXKIM

大佬,能教一下怎么解决的么

wei-potato avatar Jun 08 '23 03:06 wei-potato

GXKIM @.***>于2023年6月7日 周三16:31写道:

我也遇见了,有人解决了么

我解决了

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-130B/issues/160#issuecomment-1580195239, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATLBXLXG2VUZWCVDSTQXL3DXKA34JANCNFSM6AAAAAAYLG2VZU . You are receiving this because you commented.Message ID: @.***>

请大佬讲下吧

wenshuop avatar Jun 08 '23 03:06 wenshuop

我也遇见了,有人解决了么

我解决了

能分享一下量化后的程序吗

rchanggogogo avatar Jun 09 '23 01:06 rchanggogogo

大佬怎么解决的?

chensiyao12 avatar Jun 15 '23 01:06 chensiyao12