zamba Remove species_ prefix dependency from image path

Related to https://github.com/drivendataorg/zamba-web/issues/512

May 05 '25 20:05 jdcc

🚀 Deployed on https://deploy-preview-383--silly-keller-664934.netlify.app

May 05 '25 20:05 github-actions[bot]

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.

Project coverage is 82.9%. Comparing base (ce5d28e) to head (37acdd8).

Files with missing lines	Patch %	Lines
zamba/images/manager.py	50.0%	2 Missing :warning:

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #383     +/-   ##
========================================
+ Coverage    82.8%   82.9%   +0.1%     
========================================
  Files          37      37             
  Lines        3191    3194      +3     
========================================
+ Hits         2645    2651      +6     
+ Misses        546     543      -3

Files with missing lines	Coverage Δ
zamba/images/config.py	`92.4% <100.0%> (+<0.1%)`	:arrow_up:
zamba/models/config.py	`96.2% <100.0%> (+<0.1%)`	:arrow_up:
zamba/images/manager.py	`83.4% <50.0%> (+1.6%)`	:arrow_up:

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

May 05 '25 20:05 codecov[bot]

What are the implications of just doing this on save versus ahead of when training starts? Does this mean you couldn't resume training a model? Do we need to the species_ addition at all in the image path if we are already tracking the species columns?

May 07 '25 05:05 pjbull

Good questions.

Re implications, just reasoning through: Label inputs and outputs are the important bits. On first train, input labels have no prefixes. There are no label outputs on the first training pass prior to model save unless the user is using the code directly, which feels like a rare case. Predicting from that saved model currently produces labels with prefixes. These labels do not match the original input labels. Without any more thinking, it seems like the current state isn't great.

Re resuming training, my current understanding: We take the checkpoint and the new labels from the user. We add the the species prefix then remove (the first instance of) it to check for label matches between the passed labels and the checkpoint hparam species. If we had stripped the prefix on checkpoint save, the user could pass labels without the prefix and they'd match. If we don't strip the prefix on save (which we're currently aren't), the user labels would only match if they included the prefix.

Re removing the prefix dependency entirely: I've found three places where this dependency still exists on the image path:

[x] Calculating the class weights. Easily changed.
[x] Model instantiation, called here and here, also easily changed.
[x] Dataset splitting
[x] We would also change the training config that introduces the prefix.

@pjbull I think this PR means an improvement over status quo, but it's kinda complicated, so I'm not totally confident on that. Removing this complexity looks straightforward, but potentially unnecessary. Thoughts?

May 07 '25 14:05 jdcc

Predicting from that saved model currently produces labels with prefixes. These labels do not match the original input labels. Without any more thinking, it seems like the current state isn't great.

Yep, this is the crux of the issue we want fixed.

There are no label outputs on the first training pass prior to model save unless the user is using the code directly, which feels like a rare case.

They could be displayed in the metrics/logs, so we want those to match the final state.

We would also change the training config that introduces the prefix.

I think I'm missing a step here. Why does get_dummies add the species_ prefix? It looks like if prefix is None it uses the value directly.

Overall, I'd take a more complex change to not add the prefix species_ at all and use the list that we generate of ordered species names anywhere we need them.

May 07 '25 22:05 pjbull

I think I'm missing a step here. Why does get_dummies add the species_ prefix? It looks like if prefix is None it uses the value directly.

That's why it took me a while to track down. It's here in the pandas code. Here's a little test:

Overall, I'd take a more complex change to not add the prefix species_ at all and use the list that we generate of ordered species names anywhere we need them.

:+1:

May 08 '25 14:05 jdcc

Ah, this is a series vs. DataFrame gotcha:

In [4]: pd.get_dummies(pd.Series(list("aabbabab")))
Out[4]: 
       a      b
0   True  False
1   True  False
2  False   True
3  False   True
4   True  False
5  False   True
6   True  False
7  False   True

In [5]: pd.get_dummies(pd.DataFrame(list("aabbabab")))
Out[5]: 
     0_a    0_b
0   True  False
1   True  False
2  False   True
3  False   True
4   True  False
5  False   True
6   True  False
7  False   True

May 08 '25 15:05 pjbull

I don't think these failures have anything to do with this PR. Might be version updates or something?

May 12 '25 17:05 jdcc