How you evaluate reasoning models like QwQ-32B, since the response time and token length is very long?
Any adjustments to the hyperparameters in pred.py?
We use Temperature=0.6, TopP=0.95, MinP=0, max_new_tokens=30000, max_input_len=100000. Remember to enable YaRN whilst evaluating.
I wonder whether you need to add an additional hyperpamameter "timeout" to the following place during evaluation:
completion = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], temperature=temperature, max_tokens=max_new_tokens, timeout=40000 # or larger? )
and for enabling YARN, is the following config of QwQ-32B right for evaluation?
{ "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 27648, "max_position_embeddings": 40960, "max_window_layers": 64, "model_type": "qwen2", "num_attention_heads": 40, "num_hidden_layers": 64, "num_key_value_heads": 8, "rms_norm_eps": 1e-05, "rope_theta": 1000000.0, "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" }, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.43.1", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }
I set timeout=3600. Your YaRN configuration is correct.
Thanks to your reply!
Have you evaluated QwQ-32B on Longbench v1? If so, are ther any adjustments to the hyperparameters in pred.py?