xbatcher
xbatcher copied to clipboard
Prepending batch dimensions to match Keras interfaces
See issues #36 and #38
@jhamman @rabernat I'm not sure if this solves the problem in all cases, can you have a look?
Thanks a lot for getting this PR started Chris!
I'm not sure if this solves the problem in all cases, can you have a look?
It's not possible for me to know that just by looking at the code. That's why we have tests!
As you can see your changes have caused existing tests to fail. That's because the generated batch dataset is being compared with an "expected" dataset.
https://github.com/pangeo-data/xbatcher/blob/d98ad21f65affa306dfd09d4b1d00e14f599fe65/xbatcher/tests/test_generators.py#L23-L32
Since the expected dataset does not have the extra dimension, the test fails. This raises the following question: should this feature be optional? If so, we need to create a new option for it. Just changing the behavior of the code in this way is a "breaking change" to our API, and could cause problems for other xbatcher users. Tests help guard against this.
Changing the API is not out of the question--this is a very new and experimental package. But it would need to be discussed with the other developers and motivated by a clear argument.
If we do add an option for this feature (e.g. squeeze_batch_dimension=True as a default, but optionally False to enable the behavior in this PR), that needs to be covered by tests as well.
You can run the test suite yourself locally as
pytest -v
(You'll need pytest installed.)
Sounds good, I'll try adding the squeeze option and see what happens. I'm guessing xbatcher is supporting non-Keras use cases as well?
Still need to add unit test(s)
Codecov Report
Merging #39 (f569190) into main (decee57) will not change coverage. The diff coverage is
100.00%.
@@ Coverage Diff @@
## main #39 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 2 2
Lines 77 82 +5
Branches 18 20 +2
=========================================
+ Hits 77 82 +5
| Impacted Files | Coverage Δ | |
|---|---|---|
| xbatcher/generators.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update decee57...f569190. Read the comment docs.
@jhamman @rabernat I think this is ready to be merged
Doing some cleanup atm
@jhamman @rabernat Sorry for the confusion, I think this is in a mergeable state now.
Actually, if we go with the proposal of default False and rely on xr.DataArray.squeeze, this option becomes completely unnecessary.
Actually, if we go with the proposal of default
Falseand rely onxr.DataArray.squeeze, this option becomes completely unnecessary.
Hmm, let's get a second opinion from someone who uses Keras. If they agree, we can close this Pull Request?
Sorry for taking so long to respond here. My understanding of the conversation above is that the purpose of this PR can be broken down to two issues:
- Should there be an easier option to retain all the original dataset dimensions in the generated batches? Currently one would need to include all the original dims in the dict provided to
input_dims. My opinion is that this feature would be useful. One option would be for the user to specifyinput_dims=Falseto avoid any stacking step. Since this is not the main purpose of this PR, I suggest opening a separate issue to discuss further if any of you agree that this is worthwhile. - Should there always be a
sampledimension (previously called thebatchdimension)? This is the topic of https://github.com/xarray-contrib/xbatcher/issues/127. If we chose to implement the feature suggested in point 1, I think it makes sense to include asampledimension in the case in which the user specifies that stacking should take place. In the case without a.stackstep, I think it’s trivial to add a length 1 dimension and that it is not worth the API complexity to support that through the batch generator.
If either of these points would benefit from some in-depth discussion, we could put this PR on the agenda for next Monday’s Pangeo ML working group meeting.
Hi all, just wondering what the status of this is?
I'm writing some xbatcher example notebooks and I'm running into this issue again.
Hi all, just wondering what the status of this is?
I'm writing some xbatcher example notebooks and I'm running into this issue again.
Apologies again for the delay - my attention was drawn elsewhere due to illness, travel, and some other urgent tasks. If I recall correctly we proposed at our last meeting to always have a defined axis order including a sample dimension and provide the option to specify that no transposing/stacking should occur using input_dims=None. I will try to draft this alternative solution by Monday.