Koan-Sin Tan comments

Results 251 comments of


                                            Koan-Sin Tan

"Accuracy" metric for LLM model(s)

*gemma 3 1b* | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |-----------------|-------|----------------|-----:|-----------|---|-----:|---|------| |tinyBenchmarks | N/A| | | | | | | | | - tinyArc | 0|none |...

"Accuracy" metric for LLM model(s)

> ![Image](https://github.com/user-attachments/assets/5746a9d7-6202-49f5-ab1c-d7f3f118e8bb) > > After some exploration, the use cases we are trying to enable (say summarization, context generation etc.,) are not properly captured by the datasets used in TinyBenchmark....

"Accuracy" metric for LLM model(s)

> [@freedomtan](https://github.com/freedomtan) From the tinyBenchmarks page, the ones I have marked as Single-Token had descriptions similar to single-token outputs (PFB). We have not run it, to be exactly sure. But...

"Accuracy" metric for LLM model(s)

How about quantized model from Meta folks, we know they are available on Huggingface too - https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-QLORA_INT4_EO8 - https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8 Well, they are not in Huggingface safetensor format, that is, we...

"Accuracy" metric for LLM model(s)

@mohitmundhragithub @Aswinoss and @Mostelk OpenOrca is a dataset, not a benchmark.

"Accuracy" metric for LLM model(s)

> This paper https://arxiv.org/pdf/2208.03299 also has interesting code base that may be easier for integration than lm-eval or tiny lm eval, just focus on the zero-shot cases for our case:...

"Accuracy" metric for LLM model(s)

as I said Meta's quantized llama 3.2 3B models could be evaluated with ExecuTorch code, With ```bash export LLAMA_DIR="/Users/freedom/.llama/checkpoints" export LLAMA_QUANTIZED_CHECKPOINT=${LLAMA_DIR}/"Llama3.2-3B-Instruct-int4-qlora-eo8/consolidated.00.pth" export LLAMA_PARAMS=${LLAMA_DIR}/"Llama3.2-3B-Instruct-int4-qlora-eo8/params.json" export LLAMA_TOKENIZER=${LLAMA_DIR}/"Llama3.2-3B-Instruct-int4-qlora-eo8/tokenizer.model" python -m executorch.examples.models.llama.eval_llama \ --model...

"Accuracy" metric for LLM model(s)

To get baseline numbers, llama 3.2 3B Instruction MMLU with `lm_eval` ```bash $ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct --tasks mmlu --num_fewshot 5 ``` I got hf (pretrained=meta-llama/Llama-3.2-3B-Instruct), gen_kwargs: (None), limit:...

"Accuracy" metric for LLM model(s)

how does ExecuTorch's executorch.examples.models.llama.eval_llama work? mainly it calls the lm_eval's evaluator.simple_evaluate() see https://github.com/pytorch/executorch/blob/main/examples/models/llama/eval_llama_lib.py#L295-L320 and https://github.com/EleutherAI/lm-evaluation-harness/blob/8bc4afff22e73995883de41018388428e39f8a92/lm_eval/evaluator.py#L47

"Accuracy" metric for LLM model(s)

evaluated with `lm_eval --model hf --model_args pretrained=meta-llama/... --tasks mmlu --num_fewshot 5` on Colab (w/ L4 GPU) | model | MMLU (5-shot)| |-------| --------------| |3.2 1B Instruct| 0.4557 ± 0.0041| |3.2...