MathVista Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets

Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets

Open mattmazzola opened this issue 11 months ago • 0 comments

I had been working more closely with this repo a few weeks ago and thought I would try to contribute some of the modifications back for others to benefit.

Issues

The installation and setup of repo wasn't explicitly specified. See https://github.com/lupantech/MathVista/issues/13#issuecomment-1936732324
The code in repo was still setup to use locally downloaded data, but data is now available HuggingFace
1. This has all splits, is easier to manage and abstracts problem from developer
The gpt.py model was using old version of openai library code
Bard, Claud libraries were supposed to be optional but were not

Solutions

Use .devcontainer to standardize development environment and dependency installation
Change evaluation files to all use dataset from HuggingFace
Update GPT file to use newer OpenAI library with environment variables for Azure OpenAI
Make imports of claude, openai and bard, dynamic only if that model type was chosen

Other Misc

Use proper logging with rich formatting
Use separate metrics calculation from logging
Use pandas DataFrame for metric printing in nicer tables (See below)
Add ability to limit all steps of evaluation (generate, extract, calculate) to max number of problems
1. Allow easier testing of functionality on small subsets
Remove duplicate definitions of get_chat_response (Fixes: #16)

Sample Output

Generate Responses

[18:09:52] INFO     [root] MathVista: Generating Responses - Start                                              
[18:09:52] INFO     [root] Loading dataset AI4Math/MathVista, split testmini...                                 
[18:10:01] INFO     [root] Creating new query...                                                                
[18:10:01] INFO     [root] Loading gpt-4-32k...                                                                 
[18:10:01] INFO     [root] Model loaded.                                                                        
[18:10:01] INFO     [root] Results already exist.                                                               
[18:10:01] INFO     [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...                              
[18:10:01] WARNING  [root] Limiting number of problems to 20.                                                   
[18:10:01] INFO     [root] Number of test problems to run: 20                                                   
  0%|                                                                                    | 0/20 [00:00<?, ?it/s][18:10:01] DEBUG    [root] --------------------------------------------------------------                       
[18:10:01] DEBUG    [root] Generating response for problem: 1...                                            
[18:10:14] DEBUG    [root] Query:                                                                               
                    Question: When a spring does work on an object, we cannot find the work by simply           
                    multiplying the spring force by the object's displacement. The reason is that there is no   
                    one value for the force-it changes. However, we can split the displacement up into an       
                    infinite number of tiny parts and then approximate the force in each as being constant.     
                    Integration sums the work done in all those parts. Here we use the generic result of the    
                    integration.                                                                                
                                                                                                                
                    In Figure, a cumin canister of mass $m=0.40 \mathrm{~kg}$ slides across a horizontal        
                    frictionless counter with speed $v=0.50 \mathrm{~m} / \mathrm{s}$. It then runs into and    
                    compresses a spring of spring constant $k=750 \mathrm{~N} / \mathrm{m}$. When the canister  
                    is momentarily stopped by the spring, by what distance $d$ is the spring compressed?        
                    Hint: Please answer the question requiring a floating-point number with one decimal place   
                    and provide the final value, e.g., 1.2, 1.3, 1.4, at the end.                               
                    Solution:                                                                                   
[18:10:14] DEBUG    [root] Response:                                                                            
                    The spring does work on the canister, bringing it to rest. The work done by the spring is   
                    equal to the kinetic energy of the canister before it hits the spring. The work done by the 
                    spring is given by the equation $W = \frac{1}{2}kx^2$, where $x$ is the distance the spring 
                    is compressed. The kinetic energy of the canister is given by the equation $KE =            
                    \frac{1}{2}mv^2$. Setting these two equal to each other gives:                              
                                                                                                                
                    $\frac{1}{2}kx^2 = \frac{1}{2}mv^2$                                                         
                                                                                                                
                    Solving for $x$ gives:                                                                      
                                                                                                                
                    $x = \sqrt{\frac{mv^2}{k}}$                                                                 
                                                                                                                
                    Substituting the given values gives:                                                        
                                                                                                                
                    $x = \sqrt{\frac{(0.40 \mathrm{~kg})(0.50 \mathrm{~m/s})^2}{750 \mathrm{~N/m}}}$            
                                                                                                                
                    $x = 0.01 \mathrm{~m}$                                                                      
                                                                                                                
                    So, the spring is compressed by a distance of 0.01 m or 1.0 cm.                             
  5%|███▊                                                                        | 1/20 [00:13<04:08, 13.05s/it][18:10:14] DEBUG    [root] --------------------------------------------------------------                       
[18:10:14] DEBUG    [root] Generating response for problem: 2...      
...
[18:11:18] DEBUG    [root] Query:                                                                               
                    Question: Is the sum of smallest two bar is greater then the largest bar?                   
                    Choices:                                                                                    
                    (A) Yes                                                                                     
                    (B) No                                                                                      
                    Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at
                    the end.                                                                                    
                    Solution:                                                                                   
[18:11:18] DEBUG    [root] Response:                                                                            
                    The question does not provide enough information for a solution. It refers to "bars" but    
                    does not specify their sizes or quantities.                                                 
[18:11:18] INFO     [root] Saved results to _results/eval/mathvista/gpt4/debug/gpt4.json                        
100%|███████████████████████████████████████████████████████████████████████████| 20/20 [01:17<00:00,  3.89s/it]
[18:11:18] INFO     [root] MathVista: Generating Responses - Finish

Extract Answer

[18:16:09] INFO     [root] MathVista: Extract Answers - Start                                                   
[18:16:09] INFO     [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...                              
[18:16:09] INFO     [root] Number of test problems to run: 20                                                   
 95%|███████████████████████████████████████████████████████████████████████▎   | 19/20 [00:35<00:02,  2.91s/it][18:16:46] INFO     [root] Saved results to _results/eval/mathvista/gpt4/debug/gpt4.json                        
100%|███████████████████████████████████████████████████████████████████████████| 20/20 [00:37<00:00,  1.86s/it]
[18:16:46] INFO     [root] MathVista: Extract Answers - Finish

Calculate Score

[18:21:17] INFO     [root] MathVista: Calculating Scores - Start                                                
[18:21:17] INFO     [root] Loading dataset AI4Math/MathVista, split testmini...                                 
[18:21:25] INFO     [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...                              
[18:21:25] INFO     [root] Number of testing problems: 20                                                       
[18:21:25] INFO     [root] For each problem normalize extractions and get True False value                      
100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 34735.44it/s]
[18:21:25] INFO     [root] Calculate the average accuracy                                                       
100%|███████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 353949.70it/s]
/workspaces/MathVista/evaluation/calculate_score.py:249: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  values += results_df[key][i]
[18:21:25] INFO     [root] Correct: 8/20 - Accuracy: 40.00%                                                     
                    ========================================                                                    
                                                                                                                
                    question_type                                                                               
                    ========================================                                                    
                                 Accuracy Correct/Total                                                         
                    multi_choice   61.54%        (8/13)                                                         
                    free_form       0.00%         (0/7)                                                         
                                                                                                                
                    answer_type                                                                                 
                    ========================================                                                    
                            Accuracy Correct/Total                                                              
                    text      61.54%        (8/13)                                                              
                    float      0.00%         (0/1)                                                              
                    integer    0.00%         (0/6)                                                              
                                                                                                                
                    language                                                                                    
                    ========================================                                                    
                            Accuracy Correct/Total                                                              
                    chinese   66.67%         (2/3)                                                              
                    english   35.29%        (6/17)                                                              
                                                                                                                
                    source                                                                                      
                    ========================================                                                    
                                Accuracy Correct/Total                                                          
                    UniGeo       100.00%         (1/1)                                                          
                    Super-CLEVR  100.00%         (3/3)                                                          
                    TQA          100.00%         (1/1)                                                          
                    ScienceQA    100.00%         (1/1)                                                          
                    GeoQA+        66.67%         (2/3)                                                          
                    SciBench       0.00%         (0/1)                                                          
                    TextVQA        0.00%         (0/2)                                                          
                    CLEVR-Math     0.00%         (0/2)                                                          
                    Geometry3K     0.00%         (0/1)                                                          
                    IconQA         0.00%         (0/1)                                                          
                    IQTest         0.00%         (0/1)                                                          
                    DVQA           0.00%         (0/2)                                                          
                    ChartQA        0.00%         (0/1)                                                          
                                                                                                                
                    category                                                                                    
                    ========================================                                                    
                                      Accuracy Correct/Total                                                    
                    general-vqa         50.00%        (5/10)                                                    
                    math-targeted-vqa   30.00%        (3/10)                                                    
                                                                                                                
                    task                                                                                        
                    ========================================                                                    
                                                Accuracy Correct/Total                                          
                    textbook question answering   66.67%         (2/3)                                          
                    visual question answering     60.00%         (3/5)                                          
                    geometry problem solving      60.00%         (3/5)                                          
                    math word problem              0.00%         (0/3)                                          
                    figure question answering      0.00%         (0/4)                                          
                                                                                                                
                    context                                                                                     
                    ========================================                                                    
                                      Accuracy Correct/Total                                                    
                    geometry diagram    60.00%         (3/5)                                                    
                    synthetic scene     60.00%         (3/5)                                                    
                    scientific figure   50.00%         (1/2)                                                    
                    natural image       33.33%         (1/3)                                                    
                    abstract scene       0.00%         (0/1)                                                    
                    puzzle test          0.00%         (0/1)                                                    
                    bar chart            0.00%         (0/3)                                                    
                                                                                                                
                    grade                                                                                       
                    ========================================                                                    
                                      Accuracy Correct/Total                                                    
                    high school         66.67%         (4/6)                                                    
                    daily life          37.50%         (3/8)                                                    
                    elementary school   20.00%         (1/5)                                                    
                    college              0.00%         (0/1)                                                    
                                                                                                                
                    skills                                                                                      
                    ========================================                                                    
                                          Accuracy Correct/Total                                                
                    scientific reasoning    66.67%         (2/3)                                                
                    algebraic reasoning     60.00%         (3/5)                                                
                    geometry reasoning      50.00%         (3/6)                                                
                    arithmetic reasoning    42.86%         (3/7)                                                
                    statistical reasoning    0.00%         (0/3)                                                
                    numeric commonsense      0.00%         (0/3)                                                
                    logical reasoning        0.00%         (0/1)                                                
                                                                                                                
[18:21:25] INFO     [root] Saved scores to: _results/eval/mathvista/gpt4/debug/gpt4_metric.json                 
[18:21:25] INFO     [root] MathVista: Calculating Scores - Finish

Mar 03 '24 19:03 mattmazzola

MathVista MathVista copied to clipboard

Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets

Issues

Solutions

Other Misc

Sample Output

Generate Responses

Extract Answer

Calculate Score

MathVista
MathVista copied to clipboard