Image validation improvements
Pull Request
- [x] Api Handler split
- [x] Image evaluation split
- [x] Json-Structs for requests & responses
- [x] Proper error handling + more error details
- [x] Gemini implementation
- [x] Api-Info adjustment
- [x] .env adjustments
- [x] secret updates for new api tests
Backend
Make sure, you have done de following before merging this pull request:
- [x] Your code is documented using doc comments.
- [ ] You added appropriate logging in your code.
- [x] You added unit tests to all your implemented functions.
- [x] You ran
cargo formatandcargo clippyto format your code and check for improvements.
Here are some testing results. This test should decide whether the Gemini-API turns out to be useful. As far as I can say the results are not optimal but considering the temporary state during rush hours, it seems to be much better than now. The two diagrams show the difference in how a representation of the problem influences the result.
Just a small change in the sentence or one simple word more in the request improves the outcome by ~15%!
Almost all invalid images are detected by AI. Images an admin would accept do not score as well with the AI. More on that in the next diagram. For now, the AI seems much striker than a human.
When we look at the reasons why Gemini rejects an image, some concerns arise. Even if sometimes just one ingredient is missing or the egg is boiled instead of cooked, the AI rejects the image. Sometimes Gemini can't differentiate or detect the right food. For example: Kartoffelsalat is incorrectly recognized as Kartoffelbrei. Some custom meal names like "koerifries" provide issues, as it's hard to know what "koerifries" are. The AI expects curry sauce or something similar with the fries. Finally, there are issues with the naming of the meals. Meals like "Buffet with x, y, z" cause problems because Gemini expects x, y, and z, even when there's a buffet selection. Homemade mensa noodles aren't recognized as homemade by the AI because they look industrial. The dish should contain cashews, but the cafeteria doesn't provide enough, so they're not visible in the image. I guess you can see were this is going...
But despite these issues, I see a lot of potential and am implementing the API for our purposes. For modularity and future improvements, I'll add .env variables for the request body and specify whether to use Gemini at all.
This pr should be complete besides the open comments. We should consider logging on some places.
Codecov Report
Attention: Patch coverage is 92.20779% with 6 lines in your changes missing coverage. Please review.
Project coverage is 93.41%. Comparing base (
3c3e93e) to head (0c069cf). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #191 +/- ##
==========================================
- Coverage 93.45% 93.41% -0.04%
==========================================
Files 42 45 +3
Lines 1849 1914 +65
==========================================
+ Hits 1728 1788 +60
- Misses 121 126 +5
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
As for logging, I thing we don't need to add any aditional. Errors are passed up. We don't do any "independent" actions.
One thing that is still missing AFAIK:
- [x] finding / evaluating the propper query question and ensuring it satisfies our needs (low false positives). Maybe we could automate that by creating a test that queries a folder of images, compares them against the required answer and creates a short report. (that test can be
#[ignore]ed for automatic testing). This way, we can easily compare different query questions with the press of a button.
One thing that is still missing AFAIK:
- [x] finding / evaluating the propper query question and ensuring it satisfies our needs (low false positives). Maybe we could automate that by creating a test that queries a folder of images, compares them against the required answer and creates a short report. (that test can be
#[ignore]ed for automatic testing). This way, we can easily compare different query questions with the press of a button.
The new created test "test_evaluate_images" runs the gemini evaluation over a predefined set of images. At the end it provides a short response, how it performed on the provided question/query. The best performance could be archived with the query mentioned in the .env.example:
Starting gemini evaluation test with: 'Is a meal from a mensaria visible in the picture?' and 132 samples
Correct: 113/132
Correct images rejected: 2/19
Incorrect images accepted: 17/19
Correct images rejected: 2/19 Incorrect images accepted: 17/19
More interesting would be the actual false positive/negative rates. https://en.wikipedia.org/wiki/False_positive_rate
New output prompt contains FPR as well as FNR. (FPR images can be manually deleted via review, FNR images are lost in the process.)
--- Results: ---
Correct decisions: 110/129 (85.27132%).
Wrong decisions: 19/129 (14.728682%).
Images that got accepted but should not (FP): 18/129.
False positive rate (FPR): 47.368423%.
Images that got not accepted but should be (FN): 1/129.
False negative rate (FNR): 1.0989012%.
----------------