text-generation-inference
text-generation-inference copied to clipboard
Could not run 'c10d::allreduce_' with arguments from the 'Meta' backend.
I try to start a large version of the model using docker:
docker run -p 10249:80 -e RUST_BACKTRACE=full -e FLASH_ATTENTION=1 -e CUDA_VISIBLE_DEVICES=4,7 --privileged --security-opt="seccomp=unconfined" -v /download:/data ghcr.io/huggingface/text-generation-inference:0.5 --model-id /data/llama-13b-hf --num-shard 2 --max-total-tokens 2048
can be initialized
Details
2023-04-18T08:00:20.891953Z INFO text_generation_launcher: Args { model_id: "/data/llama-13b-hf", revision: None, sharded: None, num_shard: Some(2), quantize: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 2048, max_batch_size: 32, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None } 2023-04-18T08:00:20.891982Z INFO text_generation_launcher: Sharding model on 2 processes 2023-04-18T08:00:20.892328Z INFO text_generation_launcher: Starting shard 0 2023-04-18T08:00:20.892328Z INFO text_generation_launcher: Starting shard 1 2023-04-18T08:00:26.396665Z INFO text_generation_launcher: Shard 0 ready in 5.503382395s 2023-04-18T08:00:26.396665Z INFO text_generation_launcher: Shard 1 ready in 5.503381293s 2023-04-18T08:00:26.495656Z INFO text_generation_launcher: Starting Webserver 2023-04-18T08:00:27.467600Z WARN text_generation_router: router/src/main.rs:134: no pipeline tag found for model /data/llama-13b-hf 2023-04-18T08:00:27.472098Z INFO text_generation_router: router/src/main.rs:149: Connected
but an error occurred when calling
2023-04-18T08:00:33.236816Z ERROR shard-manager: text_generation_launcher: "Method Prefill encountered an error.
Traceback (most recent call last):
File \"/opt/miniconda/envs/text-generation/bin/text-generation-server\", line 8, in <module>
sys.exit(app())
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
return get_command(self)(*args, **kwargs)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
return self.main(*args, **kwargs)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 778, in main
return _main(
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
rv = self.invoke(ctx)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
return __callback(*args, **kwargs)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
return callback(**use_params) # type: ignore
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/cli.py\", line 55, in serve
server.serve(model_id, revision, sharded, quantize, uds_path)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 135, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize))
File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/runners.py\", line 44, in run
return loop.run_until_complete(main)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
self.run_forever()
File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
self._run_once()
File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
handle._run()
File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/events.py\", line 80, in _run
self._context.run(self._callback, *self._args)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
return await self.intercept(
> File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/interceptor.py\", line 20, in intercept
return await response
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 82, in _unary_interceptor
raise error
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 46, in Prefill
generations, next_batch = self.model.generate_token(batch)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/contextlib.py\", line 79, in inner
return func(*args, **kwds)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 278, in generate_token
out, present = self.forward(
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 262, in forward
return self.model.forward(
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 607, in forward
hidden_states, present = self.model(
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 523, in forward
hidden_states = self.embed_tokens(input_ids)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 221, in forward
torch.distributed.all_reduce(out, group=self.process_group)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py\", line 1436, in wrapper
return func(*args, **kwargs)
File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py\", line 1687, in all_reduce
work = group.allreduce([tensor], opts)
NotImplementedError: Could not run 'c10d::allreduce_' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'c10d::allreduce_' is only available for these backends: [CPU, CUDA, SparseCPU, SparseCUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
same error here
I think maybe the llama weights you are trying to load do not have the correct layout. Are you sure they are hf compatible?
Yes, I can do inference with transformers and accelerate
I think maybe the llama weights you are trying to load do not have the correct layout. Are you sure they are hf compatible?
(1) I transfer the llama weights with this file and have the same bug. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py
(2) I change the tokenizer path with --tokenizer_id /path/to/huggingface/hub/models--fxmarty--tiny-llama-fast-tokenizer/snapshots, and still have this bug.
I met the same problem with bloomz model https://huggingface.co/bigscience/bloomz-mt on ghcr.io/huggingface/text-generation-inference:0.6
same problem with bigscience/bloomz-mt