nl2bash icon indicating copy to clipboard operation
nl2bash copied to clipboard

Failure to run eval script

Open kyduff opened this issue 2 years ago • 1 comments

I'm trying to do a smoke test for your eval suite but cannot get the script to run properly.

I've followed the setup instructions:

  1. run make in the root directory
  2. add nl2bash to PYTHONPATH
  3. run make data in scripts

From here I attempt to confirm the dev set evaluates well against itself: from scripts I run

./bash-run.sh --data bash --prediction_file ../data/bash/dev.cm.filtered --eval

this produces the following stdout & traceback:

Reading data from /workspace/sempar/nl2bash/encoder_decoder/../data/bash                                                          
Saving models to /workspace/sempar/nl2bash/encoder_decoder/../model/seq2seq                                                       
Loading data from /workspace/sempar/nl2bash/encoder_decoder/../data/bash                                                          
source file: /workspace/sempar/nl2bash/encoder_decoder/../data/bash/train.nl.filtered                                             
target file: /workspace/sempar/nl2bash/encoder_decoder/../data/bash/train.cm.filtered                                             
9985 data points read.                                                                                                            
source file: /workspace/sempar/nl2bash/encoder_decoder/../data/bash/dev.nl.filtered                                               
target file: /workspace/sempar/nl2bash/encoder_decoder/../data/bash/dev.cm.filtered                                               
782 data points read.                                                                                                             
source file: /workspace/sempar/nl2bash/encoder_decoder/../data/bash/test.nl.filtered                                              
target file: /workspace/sempar/nl2bash/encoder_decoder/../data/bash/test.cm.filtered                                              
779 data points read.                                                                                                             
(Auto) evaluating ../data/bash/dev.cm.filtered                                                                                    
782 predictions loaded from ../data/bash/dev.cm.filtered                                                                          
Traceback (most recent call last):                                                                                                
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main                                                      
    "__main__", mod_spec)                                                                                                         
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code                                                                 
    exec(code, run_globals)                                                                                                       
  File "/workspace/sempar/nl2bash/encoder_decoder/translate.py", line 378, in <module>                                            
    tf.compat.v1.app.run()                                                                                                        
  File "/workspace/sempar/sempar.env/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 36, in run              
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)                                                          
  File "/workspace/sempar/sempar.env/lib/python3.7/site-packages/absl/app.py", line 312, in run                                   
    _run_main(main, args)                                                                                                         
  File "/workspace/sempar/sempar.env/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main                             
    sys.exit(main(argv))                                                                                                          
  File "/workspace/sempar/nl2bash/encoder_decoder/translate.py", line 301, in main                               
    eval(dataset, FLAGS.prediction_file)                                                                                          
  File "/workspace/sempar/nl2bash/encoder_decoder/translate.py", line 176, in eval                                                
    return eval_tools.automatic_eval(prediction_path, dataset, top_k=3, FLAGS=FLAGS, verbose=verbose)                             
  File "/workspace/sempar/nl2bash/eval/eval_tools.py", line 246, in automatic_eval                                                
    "{} vs. {}".format(len(grouped_dataset), len(prediction_list)))                                                               
ValueError: ground truth and predictions length must be equal: 701 vs. 782

You can see it's evaluating against a dataset with only 701 bash utterances even though it successfully read 782 from the dev set in data_utils.load_data. Do you know why this is happening?

(If it helps I'm in Python 3.7.11 running on a fresh install of Ubuntu 18.04.6)

kyduff avatar Jun 09 '22 11:06 kyduff

Update

I've overridden encoder_decoder.data_utils.group_parallel_data to refrain from aggregating matching NL utterances in order to match the dataset sizes. This allows the eval logic to run, though as it stands, the code breaks the following assertion in eval.token_based.corpus_bleu_score:

assert(loose_constraints or node.get_num_of_children() == 1)

The root nodes frequently have more than 1 child in the dev set. I changed the invocation of data_tools.bash_tokenizer to set the loose_contraints flag to True. This dodges the assertion error.

I'm not sure if I am taking on undesirable assumptions by making this change. Is there a reason the ground truth ASTs break the tokenizer when loose_constraints is left False?

kyduff avatar Jun 17 '22 12:06 kyduff