Baber Abbasi

Results 51 comments of Baber Abbasi

Are you using `limit 1` for the second error? Might be because it divides by N - 1 to calculate the sample standard deviation. cc @lintangsutawika

Hey. Have you tried caching the weights by running with DP=1 until they are downloaded? I found it prone to hang with DP otherwise.

> Yes, the weights are cached. The process is hanging after `llm.generate` returns results. hmm. It's working for me with `0.3.2`. Have you tried running on a fresh virtual environment?

> Just tried it on a separate server and new env still face the same issue. What version of ray do you have? Mine is `ray==2.10.0` Probably the latest one....

It's probably because of #1308. So the fewshot samples used for a particular `doc_id` will vary depending on whether DP is used and the number of ranks. Best way to...

Hi! There's [Real Toxicity](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/lm_eval/tasks/realtoxicityprompts) in the big-refactor (soon to be main) branch which evaluates the generations with the Perspective API (need a key but it's free) using a custom `metric.py`....

Hi! @juletx should be able to confirm but I think just using `{{answer_number|string}}` without the condition should [work](https://huggingface.co/datasets/juletxara/mgsm) here. Not quite sure what we are indexing here. The COT prompts...

Thanks for the confirmation @juletx! Also looks like the `\nAnswer:` string in `doc_to_text ` should be in the native language for the `direct` variation, which doesn't seem to be true...

> I looks that `answer[6+1]` intends to skip the pre-defined `ANSWER` string, "Step-by-Step Answer:", which has a len of 6, from the `answer` string. However, for such purpose, we need...

Like @haileyschoelkopf said, I think for a fair comparison, you should use a bs auto to take advantage of vLLM's continuous batching. Don't know if it slows down when logprobs...