Jeremy D comments

Results 7 comments of


                                            Jeremy D

multilingual ability

We haven't performed any multi-lingual evaluation yet. Are there any multi-lingual benchmarks you'd want us to evaluate on?

Reproduce result of Boolq on LLaMA-7B

We found that the zero shot performance of LLaMa on boolq was 0.767. Can you let me know how you produced this 0.62 number?

Reproduce result of Boolq on LLaMA-7B

It may be because we used this model: https://huggingface.co/huggyllama/llama-7b I will try to rerun with the model you linked and see how it performs

I just had success running `composer eval/eval.p YAML_NAME.yaml` with the following YAML: ``` seed: 1 max_seq_len: 1024 device_eval_batch_size: 4 fsdp_config: mixed_precision: PURE sharding_strategy: FULL_SHARD icl_tasks: - label: winograd dataset_uri: eval/local_data/winograd_wsc.jsonl...

Jeremy D

multilingual ability

Reproduce result of Boolq on LLaMA-7B

Reproduce result of Boolq on LLaMA-7B

Evaluation result mismatch

HumanEval Benchmark

Add FreebaseQA to tasks and gauntlet

Tessa/callibration script