MathVista
MathVista copied to clipboard
Add .devcontainer, update GPT to use OpenAI >1.x, make Claude and Bard imports dynamics and optional, use HuggingFace datasets
I had been working more closely with this repo a few weeks ago and thought I would try to contribute some of the modifications back for others to benefit.
Issues
- The installation and setup of repo wasn't explicitly specified. See https://github.com/lupantech/MathVista/issues/13#issuecomment-1936732324
- The code in repo was still setup to use locally downloaded data, but data is now available HuggingFace
- This has all splits, is easier to manage and abstracts problem from developer
- The gpt.py model was using old version of openai library code
- Bard, Claud libraries were supposed to be optional but were not
Solutions
- Use .devcontainer to standardize development environment and dependency installation
- Change evaluation files to all use dataset from HuggingFace
- Update GPT file to use newer OpenAI library with environment variables for Azure OpenAI
- Make imports of claude, openai and bard, dynamic only if that model type was chosen
Other Misc
- Use proper logging with rich formatting
- Use separate metrics calculation from logging
- Use pandas DataFrame for metric printing in nicer tables (See below)
- Add ability to limit all steps of evaluation (generate, extract, calculate) to max number of problems
- Allow easier testing of functionality on small subsets
- Remove duplicate definitions of
get_chat_response
(Fixes: #16)
Sample Output
Generate Responses
[18:09:52] INFO [root] MathVista: Generating Responses - Start
[18:09:52] INFO [root] Loading dataset AI4Math/MathVista, split testmini...
[18:10:01] INFO [root] Creating new query...
[18:10:01] INFO [root] Loading gpt-4-32k...
[18:10:01] INFO [root] Model loaded.
[18:10:01] INFO [root] Results already exist.
[18:10:01] INFO [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...
[18:10:01] WARNING [root] Limiting number of problems to 20.
[18:10:01] INFO [root] Number of test problems to run: 20
0%| | 0/20 [00:00<?, ?it/s][18:10:01] DEBUG [root] --------------------------------------------------------------
[18:10:01] DEBUG [root] Generating response for problem: 1...
[18:10:14] DEBUG [root] Query:
Question: When a spring does work on an object, we cannot find the work by simply
multiplying the spring force by the object's displacement. The reason is that there is no
one value for the force-it changes. However, we can split the displacement up into an
infinite number of tiny parts and then approximate the force in each as being constant.
Integration sums the work done in all those parts. Here we use the generic result of the
integration.
In Figure, a cumin canister of mass $m=0.40 \mathrm{~kg}$ slides across a horizontal
frictionless counter with speed $v=0.50 \mathrm{~m} / \mathrm{s}$. It then runs into and
compresses a spring of spring constant $k=750 \mathrm{~N} / \mathrm{m}$. When the canister
is momentarily stopped by the spring, by what distance $d$ is the spring compressed?
Hint: Please answer the question requiring a floating-point number with one decimal place
and provide the final value, e.g., 1.2, 1.3, 1.4, at the end.
Solution:
[18:10:14] DEBUG [root] Response:
The spring does work on the canister, bringing it to rest. The work done by the spring is
equal to the kinetic energy of the canister before it hits the spring. The work done by the
spring is given by the equation $W = \frac{1}{2}kx^2$, where $x$ is the distance the spring
is compressed. The kinetic energy of the canister is given by the equation $KE =
\frac{1}{2}mv^2$. Setting these two equal to each other gives:
$\frac{1}{2}kx^2 = \frac{1}{2}mv^2$
Solving for $x$ gives:
$x = \sqrt{\frac{mv^2}{k}}$
Substituting the given values gives:
$x = \sqrt{\frac{(0.40 \mathrm{~kg})(0.50 \mathrm{~m/s})^2}{750 \mathrm{~N/m}}}$
$x = 0.01 \mathrm{~m}$
So, the spring is compressed by a distance of 0.01 m or 1.0 cm.
5%|███▊ | 1/20 [00:13<04:08, 13.05s/it][18:10:14] DEBUG [root] --------------------------------------------------------------
[18:10:14] DEBUG [root] Generating response for problem: 2...
...
[18:11:18] DEBUG [root] Query:
Question: Is the sum of smallest two bar is greater then the largest bar?
Choices:
(A) Yes
(B) No
Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at
the end.
Solution:
[18:11:18] DEBUG [root] Response:
The question does not provide enough information for a solution. It refers to "bars" but
does not specify their sizes or quantities.
[18:11:18] INFO [root] Saved results to _results/eval/mathvista/gpt4/debug/gpt4.json
100%|███████████████████████████████████████████████████████████████████████████| 20/20 [01:17<00:00, 3.89s/it]
[18:11:18] INFO [root] MathVista: Generating Responses - Finish
Extract Answer
[18:16:09] INFO [root] MathVista: Extract Answers - Start
[18:16:09] INFO [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...
[18:16:09] INFO [root] Number of test problems to run: 20
95%|███████████████████████████████████████████████████████████████████████▎ | 19/20 [00:35<00:02, 2.91s/it][18:16:46] INFO [root] Saved results to _results/eval/mathvista/gpt4/debug/gpt4.json
100%|███████████████████████████████████████████████████████████████████████████| 20/20 [00:37<00:00, 1.86s/it]
[18:16:46] INFO [root] MathVista: Extract Answers - Finish
Calculate Score
[18:21:17] INFO [root] MathVista: Calculating Scores - Start
[18:21:17] INFO [root] Loading dataset AI4Math/MathVista, split testmini...
[18:21:25] INFO [root] Reading _results/eval/mathvista/gpt4/debug/gpt4.json...
[18:21:25] INFO [root] Number of testing problems: 20
[18:21:25] INFO [root] For each problem normalize extractions and get True False value
100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 34735.44it/s]
[18:21:25] INFO [root] Calculate the average accuracy
100%|███████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 353949.70it/s]
/workspaces/MathVista/evaluation/calculate_score.py:249: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
values += results_df[key][i]
[18:21:25] INFO [root] Correct: 8/20 - Accuracy: 40.00%
========================================
question_type
========================================
Accuracy Correct/Total
multi_choice 61.54% (8/13)
free_form 0.00% (0/7)
answer_type
========================================
Accuracy Correct/Total
text 61.54% (8/13)
float 0.00% (0/1)
integer 0.00% (0/6)
language
========================================
Accuracy Correct/Total
chinese 66.67% (2/3)
english 35.29% (6/17)
source
========================================
Accuracy Correct/Total
UniGeo 100.00% (1/1)
Super-CLEVR 100.00% (3/3)
TQA 100.00% (1/1)
ScienceQA 100.00% (1/1)
GeoQA+ 66.67% (2/3)
SciBench 0.00% (0/1)
TextVQA 0.00% (0/2)
CLEVR-Math 0.00% (0/2)
Geometry3K 0.00% (0/1)
IconQA 0.00% (0/1)
IQTest 0.00% (0/1)
DVQA 0.00% (0/2)
ChartQA 0.00% (0/1)
category
========================================
Accuracy Correct/Total
general-vqa 50.00% (5/10)
math-targeted-vqa 30.00% (3/10)
task
========================================
Accuracy Correct/Total
textbook question answering 66.67% (2/3)
visual question answering 60.00% (3/5)
geometry problem solving 60.00% (3/5)
math word problem 0.00% (0/3)
figure question answering 0.00% (0/4)
context
========================================
Accuracy Correct/Total
geometry diagram 60.00% (3/5)
synthetic scene 60.00% (3/5)
scientific figure 50.00% (1/2)
natural image 33.33% (1/3)
abstract scene 0.00% (0/1)
puzzle test 0.00% (0/1)
bar chart 0.00% (0/3)
grade
========================================
Accuracy Correct/Total
high school 66.67% (4/6)
daily life 37.50% (3/8)
elementary school 20.00% (1/5)
college 0.00% (0/1)
skills
========================================
Accuracy Correct/Total
scientific reasoning 66.67% (2/3)
algebraic reasoning 60.00% (3/5)
geometry reasoning 50.00% (3/6)
arithmetic reasoning 42.86% (3/7)
statistical reasoning 0.00% (0/3)
numeric commonsense 0.00% (0/3)
logical reasoning 0.00% (0/1)
[18:21:25] INFO [root] Saved scores to: _results/eval/mathvista/gpt4/debug/gpt4_metric.json
[18:21:25] INFO [root] MathVista: Calculating Scores - Finish