text-generation-inference Could not run 'c10d::allreduce_' with arguments from the 'Meta' backend.

I try to start a large version of the model using docker: docker run -p 10249:80 -e RUST_BACKTRACE=full -e FLASH_ATTENTION=1 -e CUDA_VISIBLE_DEVICES=4,7 --privileged --security-opt="seccomp=unconfined" -v /download:/data ghcr.io/huggingface/text-generation-inference:0.5 --model-id /data/llama-13b-hf --num-shard 2 --max-total-tokens 2048

can be initialized

Details

2023-04-18T08:00:20.891953Z INFO text_generation_launcher: Args { model_id: "/data/llama-13b-hf", revision: None, sharded: None, num_shard: Some(2), quantize: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 2048, max_batch_size: 32, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None } 2023-04-18T08:00:20.891982Z INFO text_generation_launcher: Sharding model on 2 processes 2023-04-18T08:00:20.892328Z INFO text_generation_launcher: Starting shard 0 2023-04-18T08:00:20.892328Z INFO text_generation_launcher: Starting shard 1 2023-04-18T08:00:26.396665Z INFO text_generation_launcher: Shard 0 ready in 5.503382395s 2023-04-18T08:00:26.396665Z INFO text_generation_launcher: Shard 1 ready in 5.503381293s 2023-04-18T08:00:26.495656Z INFO text_generation_launcher: Starting Webserver 2023-04-18T08:00:27.467600Z WARN text_generation_router: router/src/main.rs:134: no pipeline tag found for model /data/llama-13b-hf 2023-04-18T08:00:27.472098Z INFO text_generation_router: router/src/main.rs:149: Connected

but an error occurred when calling

2023-04-18T08:00:33.236816Z ERROR shard-manager: text_generation_launcher: "Method Prefill encountered an error.
Traceback (most recent call last):
  File \"/opt/miniconda/envs/text-generation/bin/text-generation-server\", line 8, in <module>
    sys.exit(app())
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
    return self.main(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 778, in main
    return _main(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
    rv = self.invoke(ctx)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
    return __callback(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/cli.py\", line 55, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 135, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/runners.py\", line 44, in run
    return loop.run_until_complete(main)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
    self.run_forever()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
    self._run_once()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
    handle._run()
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/asyncio/events.py\", line 80, in _run
    self._context.run(self._callback, *self._args)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
    return await self.intercept(
> File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/interceptor.py\", line 20, in intercept
    return await response
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 82, in _unary_interceptor
    raise error
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/server.py\", line 46, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/contextlib.py\", line 79, in inner
    return func(*args, **kwds)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 278, in generate_token
    out, present = self.forward(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py\", line 262, in forward
    return self.model.forward(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 607, in forward
    hidden_states, present = self.model(
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 523, in forward
    hidden_states = self.embed_tokens(input_ids)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py\", line 221, in forward
    torch.distributed.all_reduce(out, group=self.process_group)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py\", line 1436, in wrapper
    return func(*args, **kwargs)
  File \"/opt/miniconda/envs/text-generation/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py\", line 1687, in all_reduce
    work = group.allreduce([tensor], opts)
NotImplementedError: Could not run 'c10d::allreduce_' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'c10d::allreduce_' is only available for these backends: [CPU, CUDA, SparseCPU, SparseCUDA, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

Apr 18 '23 08:04 mathxin

same error here

Apr 21 '23 09:04 ildoonet

I think maybe the llama weights you are trying to load do not have the correct layout. Are you sure they are hf compatible?

Apr 21 '23 13:04 OlivierDehaene

Yes, I can do inference with transformers and accelerate

Apr 23 '23 08:04 mathxin

I think maybe the llama weights you are trying to load do not have the correct layout. Are you sure they are hf compatible?

(1) I transfer the llama weights with this file and have the same bug. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py

(2) I change the tokenizer path with --tokenizer_id /path/to/huggingface/hub/models--fxmarty--tiny-llama-fast-tokenizer/snapshots, and still have this bug.

Apr 23 '23 10:04 Wen1163204547

I met the same problem with bloomz model https://huggingface.co/bigscience/bloomz-mt on ghcr.io/huggingface/text-generation-inference:0.6

Apr 27 '23 07:04 wptoux

same problem with bigscience/bloomz-mt

Apr 27 '23 13:04 PeiyuZ-star

text-generation-inference text-generation-inference copied to clipboard

Could not run 'c10d::allreduce_' with arguments from the 'Meta' backend.

text-generation-inference
text-generation-inference copied to clipboard