Jeremy D
Jeremy D
We haven't performed any multi-lingual evaluation yet. Are there any multi-lingual benchmarks you'd want us to evaluate on?
We found that the zero shot performance of LLaMa on boolq was 0.767. Can you let me know how you produced this 0.62 number?
It may be because we used this model: https://huggingface.co/huggyllama/llama-7b I will try to rerun with the model you linked and see how it performs
I just had success running `composer eval/eval.p YAML_NAME.yaml` with the following YAML: ``` seed: 1 max_seq_len: 1024 device_eval_batch_size: 4 fsdp_config: mixed_precision: PURE sharding_strategy: FULL_SHARD icl_tasks: - label: winograd dataset_uri: eval/local_data/winograd_wsc.jsonl...
Hi! We are working on integrating HumanEval into the current coding suite. Thank you for your patience while we do so :)
Hi @moeiniamir this dataset is very cool. Can you explain the MARLIN section a bit more?
Would you mind adding the MCLI name of a test run you launched so I can go back and `describe run` and view logs later? Additionally a screenshot of the...