executorch icon indicating copy to clipboard operation
executorch copied to clipboard

[Segmentation fault] python3 torchchat.py export stories15M --dtype fp32 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path stories15M.pte

Open mikekgfb opened this issue 1 year ago • 4 comments

https://github.com/pytorch/torchchat/actions/runs/9047866134/job/24860312456?pr=751

This is a launch blocker for torchchat because it causes a fail for users following the example commands in our docs.

  + python3 torchchat.py export stories15M --dtype fp32 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path stories15M.pte
  /opt/homebrew/Caskroom/miniconda/base/envs/test-quantization-mps-macos/lib/python3.10/site-packages/executorch/exir/emit/_emitter.py:1474: UserWarning: Mutation on a buffer in the model is detected. ExecuTorch assumes buffers that are mutated in the graph have a meaningless initial state, only the shape and dtype will be serialized.
    warnings.warn(
  Using device=cpu
  Loading model...
  Time to load model: 0.01 seconds
  Quantizing the model with: {'embedding': {'bitwidth': 4, 'groupsize': 32}, 'linear:a8w4dq': {'groupsize': 256}}
  Time to quantize model: 7.83 seconds
  Exporting model using ExecuTorch to /Users/ec2-user/runner/_work/torchchat/torchchat/pytorch/torchchat/stories15M.pte
  The methods are:  {'forward'}
  + python3 generate.py stories15M --pte-path stories15M.pte --prompt 'Hello my name is'
  [program.cpp:130] InternalConsistency verification requested but not available
  [method.cpp:939] Overriding output data pointer allocated by memory plan is not allowed.
  ./run-quantization.sh: line 27: 18269 Segmentation fault: 11  python3 generate.py stories15M --pte-path stories15M.pte --prompt "Hello my name is"
  Error: Process completed with exit code 1.

mikekgfb avatar May 12 '24 01:05 mikekgfb

Also https://github.com/pytorch/torchchat/actions/runs/9054732211/job/24874908070?pr=768

mikekgfb avatar May 12 '24 23:05 mikekgfb

thanks for reporting.

@mikekgfb tried reproducing locally. but can't so far. is it reproducible for you consistently or happened randomly?

mergennachin avatar May 13 '24 14:05 mergennachin

Consistently reproducible both in ci and locally

mikekgfb avatar May 13 '24 15:05 mikekgfb

I wonder if this is caused by the CI flow exporting to the same file name and there being some collision with multiple threads exporting to the same named .pte file. And when running the model, there was some corruption with the file causing segfault.

do you mind sharing the model artifact causing seg fault? Can help with jumpstarting the debug for this.

mcr229 avatar May 13 '24 16:05 mcr229

I wonder if this is caused by the CI flow exporting to the same file name and there being some collision with multiple threads exporting to the same named .pte file. And when running the model, there was some corruption with the file causing segfault.

do you mind sharing the model artifact causing seg fault? Can help with jumpstarting the debug for this.

I don't think we use multithreading? That being said this works now.

mikekgfb avatar May 20 '24 12:05 mikekgfb