在运行helm的xsum的时候（llama-7b），local出来的结果accuracy是空的

Open oujieww opened this issue 2 years ago • 3 comments

1.full和h2o都正常得到了结果的latex（如下），但是local运行出来是空白。

helm的evalution代码因为连不上服务器，我修改成了如下内容： stats: List[Stat] = [ # Metrics from the RealToxicityPrompts paper Stat(MetricName("expected_max_toxicity")).add(0), Stat(MetricName("max_toxicity_probability")).add(True), # Additional metrics we added Stat(MetricName("toxic_frac")).add(0), ] # return stats

    return stats

local的运行过程，看起来和full和h2o没什么区别，我看运行过程没有报错，只有一些warning，日志如下： import_results: Updating cache with requests and results { Wrote 1000 entries to cache at /home/UserData/H2O/h2o_hf/helm/prod_env/cache/together.sqlite. } [0.679s] Done. /home/UserData/oj/anaconda3/envs/llm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /home/UserData/oj/anaconda3/envs/llm/lib/python3.9/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator12recordStreamERKNS_7DataPtrENS0_10CUDAStreamE warn(f"Failed to load image Python extension: {e}") /home/UserData/oj/anaconda3/envs/llm/lib/python3.9/site-packages/torch/init.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.) _C._set_default_tensor_type(t) main { Read 1 run entries from src/helm/benchmark/presentation/xsum/run_specs_llama.conf 1 entries produced 1 run specs run_specs { RunSpec(name='summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b', scenario_spec=ScenarioSpec(class_name='helm.benchmark.scenarios.summarization_scenario.SummarizationScenario', args={'dataset_name': 'xsum-sampled', 'sampling_min_length': 50, 'sampling_max_length': 150, 'doc_max_length': 512}), adapter_spec=AdapterSpec(method='generation', global_prefix='', instructions='', input_prefix='###\nArticle: ', input_suffix='\n\n', reference_prefix='A. ', reference_suffix='\n', output_prefix='Summarize the above article in 1 sentence.\n', output_suffix='\n', instance_prefix='\n', substitutions=[], max_train_instances=5, max_eval_instances=100, num_outputs=1, num_train_trials=1, sample_train=True, model='together/gpt-neox-20b', temperature=0.3, max_tokens=64, stop_sequences=['###'], random=None), metric_specs=[MetricSpec(class_name='helm.benchmark.summarization_metrics.SummarizationMetric', args={'task': 'summarization_xsum_sampled', 'device': 'cpu'}), MetricSpec(class_name='helm.benchmark.basic_metrics.BasicMetric', args={'names': []}), MetricSpec(class_name='helm.benchmark.bias_metrics.BiasMetric', args={'mode': 'associations', 'demographic_category': 'race', 'target_category': 'adjective'}), MetricSpec(class_name='helm.benchmark.bias_metrics.BiasMetric', args={'mode': 'associations', 'demographic_category': 'race', 'target_category': 'profession'}), MetricSpec(class_name='helm.benchmark.bias_metrics.BiasMetric', args={'mode': 'associations', 'demographic_category': 'gender', 'target_category': 'adjective'}), MetricSpec(class_name='helm.benchmark.bias_metrics.BiasMetric', args={'mode': 'associations', 'demographic_category': 'gender', 'target_category': 'profession'}), MetricSpec(class_name='helm.benchmark.bias_metrics.BiasMetric', args={'mode': 'representation', 'demographic_category': 'race'}), MetricSpec(class_name='helm.benchmark.bias_metrics.BiasMetric', args={'mode': 'representation', 'demographic_category': 'gender'}), MetricSpec(class_name='helm.benchmark.toxicity_metrics.ToxicityMetric', args={})], data_augmenter_spec=DataAugmenterSpec(perturbation_specs=[], should_augment_train_instances=False, should_include_original_train=False, should_skip_unchanged_train=False, should_augment_eval_instances=False, should_include_original_eval=False, should_skip_unchanged_eval=False, seeds_per_instance=1), groups=['summarization_xsum']) } [0.0s] Running locally in root mode with local path: prod_env Created cache with config: SqliteCacheConfig(path='prod_env/cache/huggingface.sqlite') AutoClient: cache_path = prod_env/cache AutoClient: mongo_uri = Found 1 account(s). Created cache with config: SqliteCacheConfig(path='prod_env/cache/perspectiveapi.sqlite') 0%| | 0/1 [00:00<?, ?it/s] Running summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b { scenario.get_instances { ensure_file_downloaded { Not downloading https://worksheets.codalab.org/rest/bundles/0xcfbb0ef1226040f78e58060c9e4d13cf/contents/blob/ because benchmark_output/scenarios/summarization/data/xsum-sampled.pk already exists } [0.0s] } [1.169s] 22691 instances, 25 train instances, 100/22666 eval instances DataPreprocessor.preprocess { } [0.0s] GenerationAdapter.adapt { 125 instances, choosing 5/25 train instances, 100 eval instances Adapting with train_trial_index=0 { Sampled 5 examples for trial #0. Parallelizing computation on 100 items over 1 threads { Created cache with config: SqliteCacheConfig(path='prod_env/cache/EleutherAI.sqlite')0:00<?, ?it/s] Loading EleutherAI/gpt-neox-20b with Hugging Face Transformers { } [0.056s] The original constructed prompt exceeded the max context length. Removed 1 in-context examples to fit it within the context window. The original constructed prompt exceeded the max context length. Removed 1 in-context examples to fit it within the context window. 100%|████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 275.11it/s] } [0.363s] Sample prompts { reference index = None, request_mode = None { ### Article: Almost one million people visited the city during the six-week festival period over Christmas and Hogmanay. Organisers said almost 890,000 people visited the Edinburgh's Christmas events in 2014/15, contributing £199.5m to the local economy. The three-day Hogmanay celebrations attracted more than 150,000 people, creating an economic impact of £41.8m. Charlie Wood, Edinburgh's Christmas festival director, said: "This is great news for Edinburgh. The revenue generated does not go to the events themselves, the event organisers or to Edinburgh city council. "This is money, which is going to the businesses of Edinburgh, be it retail, accommodation, food, drink, shopping and entertainment."

        Summarize the above article in 1 sentence.
        Edinburgh's winter festivals generated more than £241m for the city, according to organisers.
        
        ###
        Article: The 25-year-old, from North Ormesby, was shaping metal when a part from the press fell on his foot on 17 March. Teesside Magistrates' Court heard that SM Thompson Limited, of Middlesbrough, had allowed dangerous lifting practices to go unchecked over 10 years. The firm admitted a Health and Safety Executive (HSE) breach and was fined Â£7,500. It must also pay Â£1,120 costs. The hearing heard how the worker had to have the big toe on his left foot amputated and two other toes removed. He was in hospital for seven days but has since returned to work, the hearing heard. HSE inspector Paul Wilson said: "This worker's injuries need not have happened. "The failure of SM Thompson to look properly at the risks involved and then organise the lifting operation properly put staff at needless risk. "This sadly led to the painful and life-changing injuries suffered by this young man."
        
        Summarize the above article in 1 sentence.
        A Teesside steel firm has been fined after a worker was crushed by a press and had to have three toes amputated.
        
        ###
        Article: The colourful phenomenon was visible in Scotland and Northern Ireland, but was also spotted as far south as Anglesey in Wales and Staffordshire in England. Aurora Borealis occurs when electrically-charged particles from the sun enter the earth's atmosphere. Many people took to social media to share photographs of the dramatic show. Forecasters had predicted a solar storm and good conditions for Aurora Borealis, and sightings of green, pink, purple, red and yellow lights were reported for several hours from about 20:00 GMT. Gavin Chambers, an RSPB warden, tweeted pictures of vivid green in the sky over Lake Vyrnwy in Powys, Wales, saying: "Well worth getting back out of bed for!!" Donna Butcher tweeted: "Just been watching an amazing display of Aurora from Staveley, Cumbria. Shafts of light streaming directly towards Polaris." You can email your pictures and video to [email protected], and find out more about the Northern Lights here.
        
        Summarize the above article in 1 sentence.
        There have been spectacular displays of the Aurora Borealis - better known as the Northern Lights - across parts of the UK overnight.
        
        ###
        Article: The astronomer, who presented The Sky At Night for over 50 years, died at his home in Selsey, West Sussex, in December 2012. The monocle will be auctioned later at Christie's, in London. The xylophone - which he used during a Royal Variety Performance in front of the Queen - is to be sold at Henry Adams Auctioneers in Chichester. Sir Patrick presented the first edition of The Sky at Night on 24 April 1957. He became famous for his habit of wearing a monocle on screen, as well as his dishevelled and idiosyncratic persona. However, he was a celebrated and gifted astronomer and wrote dozens of books, with his research being used by the US and the Soviet Union in their space programmes. The monocle has a reserve price of £500 - £800 and the xylophone £1,500 - £2,000.
        
        Summarize the above article in 1 sentence.
        Sir Patrick Moore's famous monocle and his xylophone are due to be sold at two separate auctions.
        
        ###
        Article: Two-year-old Sophia set out across her family's property south-east of Melbourne at around 7:30pm on Tuesday. The dog, a one-year-old Australian sheepdog named Poppy, went with her. Police confirmed that Poppy's barking alerted rescuers to the whereabouts of the pair after a seven-hour search. Rescuers found Sophia and Poppy 200m from a dam on the family's property. The Australian Broadcasting Corp. quoted Sophia's grandmother, Vera Cook, who credited the dog with saving the toddler's life. She said that Sophia was wearing just a nappy and T-shirt when she wandered off. "The only thing I was thinking was well, hopefully the dog would have kept her warm," Ms Cook was quoted as saying. "If [the dog] wasn't with her, I don't know whether they would have found her." The family issued a statement thanking emergency services and promising that Poppy would be "well fed this evening".
        
        Summarize the above article in 1 sentence.
        A pet dog is being credited with keeping a little girl safe after the pair wandered away from their family home and could not be found for hours.
        
        ###
        Article: Head coach Stuart Lancaster's World Cup preparations suffered a blow as for the first 70 minutes a largely first-choice XV struggled to deal with French power. Two late tries flattered the visitors, who have one game left before launching their World Cup campaign against Fiji on 18 September. "We gave away penalties and our discipline was shocking," said Robshaw. "Whether it was rust, or nerves, it wasn't good enough. Credit to France, they put us under pressure and made us make mistakes. "We gave too many penalties away, but in the second half we came out and played well but couldn't quite get over the line in the end," he told Sky Sports. "We can't give teams like France and other quality sides head starts like we did. "We'll look long and hard at ourselves, because we let ourselves down in the first half. We played well in phases but you can't wait for 40 minutes to do that." Late tries from Danny Cipriani and Jonathan Joseph made it close on the scoreboard but Lancaster was left with much to ponder by a disappointing team display in the Stade de France. Media playback is not supported on this device The head coach, who must announce his final squad by 31 August, added: "We've got to get our discipline at the breakdown - we can't give France easy position like we did. We need to improve on that, because all the little mistakes add up. "The bench made a difference. It upped the energy levels and we scored some good tries. I won't gloss over what went on before that, because it was too little too late. "There are a few players who have given me food for thought, those guys who came on and gave us the energy we needed and made a difference. "I need to have a sleep on this game and think about my final squad. We're two weeks away from playing Ireland and four weeks away from Fiji in the World Cup and we'll expect a reaction next time." England host Ireland in their final World Cup warm-up game at Twickenham on 5 September. Fly-half Ford told Sky Sports: "I thought we might snatch that at the end but we had hardly any ball in the first half and gave away too many penalties. We played some good stuff in the second-half. "In the first half a lot of our undoing was down to ourselves. We just weren't good enough in that first half and there's no excuse for that. We let them build up the score and that made it hard for us. "It was frustrating and we had to think of ways to adapt and that was constantly going through our minds. We tried to get ourselves out of a hole. "We've got to turn up against Ireland now and make sure that we win. Our basics have got to be world class."
        
        Summarize the above article in 1 sentence.
      } [0.0s]
    } [0.0s]
  } [0.364s]
  100 requests
} [0.364s]
Executor.execute {
  Parallelizing computation on 100 items over 1 threads {
                                                                                                               Created cache with config: SqliteCacheConfig(path='prod_env/cache/together.sqlite')0 [00:00<?, ?it/s]
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
                                                                                                               WARNING: truncate_sequence needs to strip "###"                    | 54/100 [00:00<00:00, 538.06it/s]
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"
    WARNING: truncate_sequence needs to strip "###"

100%|████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 534.64it/s] } [0.187s] Processed 100 requests } [0.187s] 10 metrics { <helm.benchmark.metrics.summarization_metrics.SummarizationMetric object at 0x7f9241996070> { Parallelizing computation on 100 items over 1 threads { ensure_file_downloaded { | 0/100 [00:00<?, ?it/s] Executing: wget https://worksheets.codalab.org/rest/bundles/0x3fb04ae3ae024c369d048f6c2cdf16cb/contents/blob/codalab_merged_results/xsum_0shots.csv -O benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_0shots.csv.tmp --2023-11-02 17:38:00-- https://worksheets.codalab.org/rest/bundles/0x3fb04ae3ae024c369d048f6c2cdf16cb/contents/blob/codalab_merged_results/xsum_0shots.csv Connecting to 192.168.3.206:20171... connected. Proxy request sent, awaiting response... 200 OK Syntax error in Set-Cookie: codalab_session=""; expires=Thu, 01 Jan 1970 00:00:00 GMT; Max-Age=-1; Path=/ at position 70. Length: unspecified [text/csv] Saving to: ‘benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_0shots.csv.tmp’

benchmark_output/runs/xsum [ <=> ] 3.18M 1.53MB/s in 2.1s

2023-11-02 17:38:06 (1.53 MB/s) - ‘benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_0shots.csv.tmp’ saved [3339418]

        Executing: mv benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_0shots.csv.tmp benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_0shots.csv
        Finished downloading https://worksheets.codalab.org/rest/bundles/0x3fb04ae3ae024c369d048f6c2cdf16cb/contents/blob/codalab_merged_results/xsum_0shots.csv to benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_0shots.csv
      } [5.723s]
      ensure_file_downloaded {
        Executing: wget https://worksheets.codalab.org/rest/bundles/0x3fb04ae3ae024c369d048f6c2cdf16cb/contents/blob/codalab_merged_results/xsum_5shots.csv -O benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_5shots.csv.tmp

--2023-11-02 17:38:06-- https://worksheets.codalab.org/rest/bundles/0x3fb04ae3ae024c369d048f6c2cdf16cb/contents/blob/codalab_merged_results/xsum_5shots.csv Connecting to 192.168.3.206:20171... connected. Proxy request sent, awaiting response... 200 OK Syntax error in Set-Cookie: codalab_session=""; expires=Thu, 01 Jan 1970 00:00:00 GMT; Max-Age=-1; Path=/ at position 70. Length: unspecified [text/csv] Saving to: ‘benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_5shots.csv.tmp’

benchmark_output/runs/xsum [ <=> ] 6.74M 2.36MB/s in 2.9s

2023-11-02 17:38:11 (2.36 MB/s) - ‘benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_5shots.csv.tmp’ saved [7071368]

        Executing: mv benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_5shots.csv.tmp benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_5shots.csv
        Finished downloading https://worksheets.codalab.org/rest/bundles/0x3fb04ae3ae024c369d048f6c2cdf16cb/contents/blob/codalab_merged_results/xsum_5shots.csv to benchmark_output/runs/xsum_llama7b_result_local/eval_cache/xsum_5shots.csv
      } [5.163s]
      ensure_file_downloaded {
        Executing: wget https://worksheets.codalab.org/rest/bundles/0xf4de83c1f0d34d7999480223e8f5ab87/contents/blob/ -O benchmark_output/runs/xsum_llama7b_result_local/eval_cache/qafacteval.pk.tmp

--2023-11-02 17:38:12-- https://worksheets.codalab.org/rest/bundles/0xf4de83c1f0d34d7999480223e8f5ab87/contents/blob/ Connecting to 192.168.3.206:20171... connected. Proxy request sent, awaiting response... 200 OK Syntax error in Set-Cookie: codalab_session=""; expires=Thu, 01 Jan 1970 00:00:00 GMT; Max-Age=-1; Path=/ at position 70. Length: unspecified [application/x-tex-pk] Saving to: ‘benchmark_output/runs/xsum_llama7b_result_local/eval_cache/qafacteval.pk.tmp’

benchmark_output/runs/xsum [ <=>] 15.12M 321KB/s in 49s

2023-11-02 17:39:03 (316 KB/s) - ‘benchmark_output/runs/xsum_llama7b_result_local/eval_cache/qafacteval.pk.tmp’ saved [15851990]

        Executing: mv benchmark_output/runs/xsum_llama7b_result_local/eval_cache/qafacteval.pk.tmp benchmark_output/runs/xsum_llama7b_result_local/eval_cache/qafacteval.pk
        Finished downloading https://worksheets.codalab.org/rest/bundles/0xf4de83c1f0d34d7999480223e8f5ab87/contents/blob/ to benchmark_output/runs/xsum_llama7b_result_local/eval_cache/qafacteval.pk
      } [51.381s]

/home/UserData/oj/anaconda3/envs/llm/lib/python3.9/site-packages/spacy/pipeline/lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'. warnings.warn(Warnings.W108) 100%|█████████████████████████████████████████████████████████████████████| 100/100 [01:05<00:00, 1.53it/s] } [1m5.327s]██████████████████████████████████████████████████████| 100/100 [01:05<00:00, 37.02it/s] } [1m5.363s] BasicMetric() { Parallelizing computation on 100 items over 1 threads { 100%|████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 293.11it/s] } [0.341s]████████████████████████████████████████████████▍ | 89/100 [00:00<00:00, 292.84it/s] } [0.463s] BiasMetric(mode=associations, demographic_category=race, target_category=adjective) { Parallelizing computation on 100 items over 1 threads { 100%|█████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 683111.40it/s] } [0.0s] | 0/100 [00:00<?, ?it/s] } [0.579s] BiasMetric(mode=associations, demographic_category=race, target_category=profession) { Parallelizing computation on 100 items over 1 threads { 100%|█████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 641330.89it/s] } [0.0s] | 0/100 [00:00<?, ?it/s] } [0.397s] BiasMetric(mode=associations, demographic_category=gender, target_category=adjective) { Parallelizing computation on 100 items over 1 threads { 100%|█████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 595781.82it/s] } [0.0s] | 0/100 [00:00<?, ?it/s] } [0.409s] BiasMetric(mode=associations, demographic_category=gender, target_category=profession) { Parallelizing computation on 100 items over 1 threads { 100%|█████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 684225.77it/s] } [0.0s] | 0/100 [00:00<?, ?it/s] } [0.284s] BiasMetric(mode=representation, demographic_category=race, target_category=None) { Parallelizing computation on 100 items over 1 threads { 100%|█████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 604366.57it/s] } [0.0s] | 0/100 [00:00<?, ?it/s] } [0.011s] BiasMetric(mode=representation, demographic_category=gender, target_category=None) { Parallelizing computation on 100 items over 1 threads { 100%|█████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 686465.47it/s] } [0.0s] | 0/100 [00:00<?, ?it/s] } [0.011s] ToxicityMetric() { Parallelizing computation on 100 items over 1 threads { 100%|██████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 24413.88it/s] } [0.004s] | 0/100 [00:00<?, ?it/s] } [0.022s] TokensMetric() { Parallelizing computation on 100 items over 1 threads { 100%|██████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 66969.57it/s] } [0.001s] | 0/100 [00:00<?, ?it/s] } [0.003s] } [1m7.546s] Generated 204 stats. Writing 3017 characters to benchmark_output/runs/xsum_llama7b_result_local/summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b/run_spec.json Writing 320 characters to benchmark_output/runs/xsum_llama7b_result_local/summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b/scenario.json Writing 1096618 characters to benchmark_output/runs/xsum_llama7b_result_local/summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b/scenario_state.json Writing 69947 characters to benchmark_output/runs/xsum_llama7b_result_local/summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b/stats.json Writing 975977 characters to benchmark_output/runs/xsum_llama7b_result_local/summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b/per_instance_stats.json CacheStats.print_status { prod_env/cache/EleutherAI.sqlite: 202 queries, 0 computes prod_env/cache/together.sqlite: 100 queries, 0 computes } [0.0s] } [1m10.051s] 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [01:10<00:00, 70.05s/it] Symlinking benchmark_output/runs/xsum_llama7b_result_local to latest. Done. } [1m10.069s] /home/UserData/oj/anaconda3/envs/llm/lib/python3.9/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /home/UserData/oj/anaconda3/envs/llm/lib/python3.9/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator12recordStreamERKNS_7DataPtrENS0_10CUDAStreamE warn(f"Failed to load image Python extension: {e}") /home/UserData/oj/anaconda3/envs/llm/lib/python3.9/site-packages/torch/init.py:614: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:451.) _C._set_default_tensor_type(t) main { Reading schema from schema.yaml... Reading contamination information from contamination.yaml... validate_contamination { } [0.0s] 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.44it/s] Summarizer.check_metrics_defined { } [0.0s] Summarizer.write_executive_summary { Writing 66 characters to benchmark_output/runs/xsum_llama7b_result_local/summary.json } [0.0s] Writing 87652 characters to benchmark_output/runs/xsum_llama7b_result_local/runs.json Writing 3249 characters to benchmark_output/runs/xsum_llama7b_result_local/run_specs.json Writing 8197 characters to benchmark_output/runs/xsum_llama7b_result_local/groups.json Writing 28421 characters to benchmark_output/runs/xsum_llama7b_result_local/groups_metadata.json WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='summac', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='QAFactEval', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='BERTScore-F', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='HumanEval-faithfulness', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='HumanEval-relevance', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='HumanEval-coherence', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name Writing 265 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/core_scenarios_accuracy.tex Writing 10126 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/core_scenarios_accuracy.json Writing 250 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/core_scenarios_calibration.tex Writing 10750 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/core_scenarios_calibration.json Writing 248 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/core_scenarios_robustness.tex Writing 11405 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/core_scenarios_robustness.json Writing 244 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/core_scenarios_fairness.tex Writing 11285 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/core_scenarios_fairness.json Writing 289 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/core_scenarios_efficiency.tex Writing 12299 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/core_scenarios_efficiency.json Writing 441 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/core_scenarios_general_information.tex Writing 52038 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/core_scenarios_general_information.json Writing 236 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/core_scenarios_bias.tex Writing 45710 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/core_scenarios_bias.json Writing 272 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/core_scenarios_toxicity.tex Writing 8224 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/core_scenarios_toxicity.json Writing 394 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/core_scenarios_summarization_metrics.tex Writing 12239 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/core_scenarios_summarization_metrics.json Writing 181624 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/core_scenarios.json WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='summac', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='QAFactEval', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='BERTScore-F', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='HumanEval-faithfulness', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='HumanEval-relevance', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='HumanEval-coherence', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name Writing 263 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/summarization_accuracy.tex Writing 2117 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/summarization_accuracy.json Writing 392 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/summarization_summarization_metrics.tex Writing 12237 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/summarization_summarization_metrics.json Writing 234 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/summarization_bias.tex Writing 8324 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/summarization_bias.json Writing 270 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/summarization_toxicity.tex Writing 2151 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/summarization_toxicity.json Writing 287 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/summarization_efficiency.tex Writing 2371 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/summarization_efficiency.json Writing 439 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/summarization_general_information.tex Writing 7656 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/summarization_general_information.json Writing 36614 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/summarization.json WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='summac', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='QAFactEval', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='BERTScore-F', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='HumanEval-faithfulness', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='HumanEval-relevance', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name WARNING: run spec summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b does not have any stat matched by MetricNameMatcher(name='HumanEval-coherence', split='test', sub_split=None, perturbation_name=None), 0 near misses matching just the name Writing 761 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/latex/summarization_xsum_summarization_xsum_dataset_name:xsum-sampled,sampling_min_length:50,sampling_max_length:150,doc_max_length:512.tex Writing 15264 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/json/summarization_xsum_summarization_xsum_dataset_name:xsum-sampled,sampling_min_length:50,sampling_max_length:150,doc_max_length:512.json Writing 15996 characters to benchmark_output/runs/xsum_llama7b_result_local/groups/summarization_xsum.json Summarizer.write_cost_report { Writing 171 characters to benchmark_output/runs/xsum_llama7b_result_local/costs.json } [0.0s] Parallelizing computation on 1 items over 8 threads { write_run_display_json { 0%| | 0/1 [00:00<?, ?it/s] Writing 230464 characters to benchmark_output/runs/xsum_llama7b_result_local/summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b/instances.json Writing 71805 characters to benchmark_output/runs/xsum_llama7b_result_local/summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b/display_predictions.json Writing 759532 characters to benchmark_output/runs/xsum_llama7b_result_local/summarization_xsum:temperature=0.3,device=cpu,model=together_gpt-neox-20b/display_requests.json } [0.148s] 100%|█████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6.72it/s] } [0.149s] Done. } [1.058s]

Nov 02 '23 09:11 oujieww

Thanks for the question.

We simplify the helm evaluation code. Please check the summarization benchmarking part "scripts/summarization/eval.sh"

Dec 02 '23 05:12 Kyriection

你好，请问latex文件里的model为什么显示gpt-neox-20b呀，不是测llama-7b的结果么

Feb 02 '24 15:02 SherrySwift

Hi, that's because the model name registered in https://github.com/FMInference/H2O/blob/main/h2o_hf/data/xsum.jsonl is GPT-NeoX-20B, but it won't affect the final results. When extracting the data from HELM for local evaluation, we use the gpt-neox-20b template to get the raw data, since we only use the input for local evalution, the model name won't affect the generated content. Thanks!

Feb 04 '24 16:02 Kyriection