torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Generate Command phi3 Error

Open sgupta1007 opened this issue 1 year ago • 13 comments

I have used command tune run generate --config custom_quantization.yaml prompt='Explain some topic'to generate inference from finetuned phi3 model through torchtune

Config custom_quantization.yaml

model:
  _component_: torchtune.models.phi3.qlora_phi3_mini

checkpointer:
  _component_: torchtune.utils.FullModelHFCheckpointer
  checkpoint_dir: //fine_tuned_phi/
  checkpoint_files: [
    hf_model_0001_0.pt,hf_model_0002_0.pt,adapter_0.pt
  ]
  model_type: PHI3
  output_dir: /fine_tuned_legal_phi/

device: cuda
dtype: bf16
seed: 1234

quantizer:
  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
  groupsize: 256

Error Flagged KeyError: 'PHI3``

sgupta1007 avatar Sep 13 '24 20:09 sgupta1007

I believe it should be "model_type: PHI3_MINI"

https://github.com/pytorch/torchtune/blob/4fbe7b2d4956b3790c51d7a255c0040cf5c38fad/recipes/configs/phi3/mini_lora.yaml#L46

felipemello1 avatar Sep 13 '24 21:09 felipemello1

model type change resolved this error but lead to FullModelHFCheckpointer.load_checkpoint() got an unexpected keyword argument 'weights_only' error

sgupta1007 avatar Sep 13 '24 22:09 sgupta1007

@joecummings , have you seen this before?

felipemello1 avatar Sep 13 '24 22:09 felipemello1

@sgupta1007 , i am not too familiar with the generate recipe, however, we are working on a V2 of it (https://github.com/pytorch/torchtune/pull/1563). There are opportunities to improve the quantization experience in it.

To unblock you for now, are you able to use generate without the quantization?

felipemello1 avatar Sep 13 '24 22:09 felipemello1

I am not able to use generate without quantization.

I will try to explain my approach for generation :

1. Perform phi3 qlora finetuning on 1 epoch
2. Supply the adapter and models weights to checkpointer files in config file 
3. Keep the model component as torchtune.models.phi3.qlora_phi3_mini. 
4. Run Generation Command tune run generate --config custom_quantization.yaml prompt='Explain some topic'

sgupta1007 avatar Sep 15 '24 10:09 sgupta1007

@sgupta1007 as adapter is already merged why we need to give adapter and model weights??

apthagowda97 avatar Sep 16 '24 20:09 apthagowda97

model:
  _component_: torchtune.models.llama3_1.llama3_1_8b

checkpointer:
  _component_: torchtune.utils.FullModelHFCheckpointer
  checkpoint_dir: /path/output/
  checkpoint_files: [
    hf_model_0001_0.pt,
    hf_model_0002_0.pt,
    hf_model_0003_0.pt,
    hf_model_0004_0.pt
  ]
  output_dir: /path/output/
  model_type: LLAMA3

device: cuda
dtype: bf16

seed: 1234

# Tokenizer arguments
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /path/llama3.1-8b/original/tokenizer.model

# Generation arguments; defaults taken from gpt-fast
prompt: "Tell me a joke?"
instruct_template: null
chat_format: null
max_new_tokens: 300
temperature: 0.6 # 0.8 and 0.6 are popular values to try
top_k: 300
# It is recommended to set enable_kv_cache=False for long-context models like Llama3.1
enable_kv_cache: True

quantizer: null

I am getting CUDA out of Memory on this in A100 GPU for 8 bn model ... strange!!!

apthagowda97 avatar Sep 16 '24 20:09 apthagowda97

Can you run 'nvidia-smi' and confirm that there isnt any dead process consuming your memory before you run generate.py?

However, there was a known issue where kvcache was in FP32 and was initialized with max_seq_len=131k, consuming a lot of memory before generation started. There were a couple of PRs up to fix this.

I will let @joecummings and @SalmanMohammadi reply, since they were working on this.

Thanks for sharing this info!

felipemello1 avatar Sep 16 '24 20:09 felipemello1

Can you run 'nvidia-smi' and confirm that there isnt any dead process consuming your memory before you run generate.py?

However, there was a known issue where kvcache was in FP32 and was initialized with max_seq_len=131k, consuming a lot of memory before generation started. There were a couple of PRs up to fix this.

I will let @joecummings and @SalmanMohammadi reply, since they were working on this.

Thanks for sharing this info!

Yep, this is almost certainly due to the fact that the KV cache is being initialized for 131k context length, which OOMs. Once #1449 lands, we can set a max length on the cache itself so that it doesn't initialize for the whole context length. In the meantime, here are some mitigations:

  • Modify the Llama3.1 8B model definition to set a max_seq_length=8192
  • Turn off kv caching and use compile instead for some speed-up

joecummings avatar Sep 16 '24 21:09 joecummings

This should be addressed with #1603 now that #1449 is in.

salmanmohammadi avatar Sep 16 '24 21:09 salmanmohammadi

Hey @apthagowda97 - give this a try on our latest nightly build, it should work for you : )

salmanmohammadi avatar Sep 17 '24 14:09 salmanmohammadi

@felipemello1 @SalmanMohammadi I am not able even to generate the inference even if i neglect the adapter weights.and run inference without quantization . @apthagowda97 I am specifically talking about phi3 not about llama.You can try to prompt without using the adapter weights will run into same error.

sgupta1007 avatar Sep 24 '24 20:09 sgupta1007

@joecummings , do you mind taking a look?

felipemello1 avatar Sep 26 '24 14:09 felipemello1