Remove species_ prefix dependency from image path
Related to https://github.com/drivendataorg/zamba-web/issues/512
🚀 Deployed on https://deploy-preview-383--silly-keller-664934.netlify.app
Codecov Report
Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.
Project coverage is 82.9%. Comparing base (
ce5d28e) to head (37acdd8).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| zamba/images/manager.py | 50.0% | 2 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #383 +/- ##
========================================
+ Coverage 82.8% 82.9% +0.1%
========================================
Files 37 37
Lines 3191 3194 +3
========================================
+ Hits 2645 2651 +6
+ Misses 546 543 -3
| Files with missing lines | Coverage Δ | |
|---|---|---|
| zamba/images/config.py | 92.4% <100.0%> (+<0.1%) |
:arrow_up: |
| zamba/models/config.py | 96.2% <100.0%> (+<0.1%) |
:arrow_up: |
| zamba/images/manager.py | 83.4% <50.0%> (+1.6%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
What are the implications of just doing this on save versus ahead of when training starts? Does this mean you couldn't resume training a model? Do we need to the species_ addition at all in the image path if we are already tracking the species columns?
Good questions.
Re implications, just reasoning through: Label inputs and outputs are the important bits. On first train, input labels have no prefixes. There are no label outputs on the first training pass prior to model save unless the user is using the code directly, which feels like a rare case. Predicting from that saved model currently produces labels with prefixes. These labels do not match the original input labels. Without any more thinking, it seems like the current state isn't great.
Re resuming training, my current understanding: We take the checkpoint and the new labels from the user. We add the the species prefix then remove (the first instance of) it to check for label matches between the passed labels and the checkpoint hparam species. If we had stripped the prefix on checkpoint save, the user could pass labels without the prefix and they'd match. If we don't strip the prefix on save (which we're currently aren't), the user labels would only match if they included the prefix.
Re removing the prefix dependency entirely: I've found three places where this dependency still exists on the image path:
- [x] Calculating the class weights. Easily changed.
- [x] Model instantiation, called here and here, also easily changed.
- [x] Dataset splitting
- [x] We would also change the training config that introduces the prefix.
@pjbull I think this PR means an improvement over status quo, but it's kinda complicated, so I'm not totally confident on that. Removing this complexity looks straightforward, but potentially unnecessary. Thoughts?
Predicting from that saved model currently produces labels with prefixes. These labels do not match the original input labels. Without any more thinking, it seems like the current state isn't great.
Yep, this is the crux of the issue we want fixed.
There are no label outputs on the first training pass prior to model save unless the user is using the code directly, which feels like a rare case.
They could be displayed in the metrics/logs, so we want those to match the final state.
We would also change the training config that introduces the prefix.
I think I'm missing a step here. Why does get_dummies add the species_ prefix? It looks like if prefix is None it uses the value directly.
Overall, I'd take a more complex change to not add the prefix species_ at all and use the list that we generate of ordered species names anywhere we need them.
I think I'm missing a step here. Why does get_dummies add the species_ prefix? It looks like if prefix is None it uses the value directly.
That's why it took me a while to track down. It's here in the pandas code. Here's a little test:
Overall, I'd take a more complex change to not add the prefix species_ at all and use the list that we generate of ordered species names anywhere we need them.
:+1:
Ah, this is a series vs. DataFrame gotcha:
In [4]: pd.get_dummies(pd.Series(list("aabbabab")))
Out[4]:
a b
0 True False
1 True False
2 False True
3 False True
4 True False
5 False True
6 True False
7 False True
In [5]: pd.get_dummies(pd.DataFrame(list("aabbabab")))
Out[5]:
0_a 0_b
0 True False
1 True False
2 False True
3 False True
4 True False
5 False True
6 True False
7 False True
I don't think these failures have anything to do with this PR. Might be version updates or something?