xbatcher icon indicating copy to clipboard operation
xbatcher copied to clipboard

Prepending batch dimensions to match Keras interfaces

Open cmdupuis3 opened this issue 4 years ago • 12 comments

See issues #36 and #38

@jhamman @rabernat I'm not sure if this solves the problem in all cases, can you have a look?

cmdupuis3 avatar Nov 16 '21 21:11 cmdupuis3

Thanks a lot for getting this PR started Chris!

I'm not sure if this solves the problem in all cases, can you have a look?

It's not possible for me to know that just by looking at the code. That's why we have tests!

As you can see your changes have caused existing tests to fail. That's because the generated batch dataset is being compared with an "expected" dataset.

https://github.com/pangeo-data/xbatcher/blob/d98ad21f65affa306dfd09d4b1d00e14f599fe65/xbatcher/tests/test_generators.py#L23-L32

Since the expected dataset does not have the extra dimension, the test fails. This raises the following question: should this feature be optional? If so, we need to create a new option for it. Just changing the behavior of the code in this way is a "breaking change" to our API, and could cause problems for other xbatcher users. Tests help guard against this.

Changing the API is not out of the question--this is a very new and experimental package. But it would need to be discussed with the other developers and motivated by a clear argument.

If we do add an option for this feature (e.g. squeeze_batch_dimension=True as a default, but optionally False to enable the behavior in this PR), that needs to be covered by tests as well.

You can run the test suite yourself locally as

pytest -v

(You'll need pytest installed.)

rabernat avatar Nov 16 '21 21:11 rabernat

Sounds good, I'll try adding the squeeze option and see what happens. I'm guessing xbatcher is supporting non-Keras use cases as well?

cmdupuis3 avatar Nov 17 '21 14:11 cmdupuis3

Still need to add unit test(s)

cmdupuis3 avatar Nov 18 '21 20:11 cmdupuis3

Codecov Report

Merging #39 (f569190) into main (decee57) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main       #39   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            2         2           
  Lines           77        82    +5     
  Branches        18        20    +2     
=========================================
+ Hits            77        82    +5     
Impacted Files Coverage Δ
xbatcher/generators.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update decee57...f569190. Read the comment docs.

codecov[bot] avatar Nov 19 '21 18:11 codecov[bot]

@jhamman @rabernat I think this is ready to be merged

cmdupuis3 avatar Nov 19 '21 19:11 cmdupuis3

Doing some cleanup atm

cmdupuis3 avatar Nov 19 '21 21:11 cmdupuis3

@jhamman @rabernat Sorry for the confusion, I think this is in a mergeable state now.

cmdupuis3 avatar Nov 30 '21 21:11 cmdupuis3

Actually, if we go with the proposal of default False and rely on xr.DataArray.squeeze, this option becomes completely unnecessary.

cmdupuis3 avatar Nov 28 '22 23:11 cmdupuis3

Actually, if we go with the proposal of default False and rely on xr.DataArray.squeeze, this option becomes completely unnecessary.

Hmm, let's get a second opinion from someone who uses Keras. If they agree, we can close this Pull Request?

weiji14 avatar Nov 30 '22 20:11 weiji14

Sorry for taking so long to respond here. My understanding of the conversation above is that the purpose of this PR can be broken down to two issues:

  1. Should there be an easier option to retain all the original dataset dimensions in the generated batches? Currently one would need to include all the original dims in the dict provided to input_dims. My opinion is that this feature would be useful. One option would be for the user to specify input_dims=False to avoid any stacking step. Since this is not the main purpose of this PR, I suggest opening a separate issue to discuss further if any of you agree that this is worthwhile.
  2. Should there always be a sample dimension (previously called the batch dimension)? This is the topic of https://github.com/xarray-contrib/xbatcher/issues/127. If we chose to implement the feature suggested in point 1, I think it makes sense to include a sample dimension in the case in which the user specifies that stacking should take place. In the case without a .stack step, I think it’s trivial to add a length 1 dimension and that it is not worth the API complexity to support that through the batch generator.

If either of these points would benefit from some in-depth discussion, we could put this PR on the agenda for next Monday’s Pangeo ML working group meeting.

maxrjones avatar Dec 01 '22 16:12 maxrjones

Hi all, just wondering what the status of this is?

I'm writing some xbatcher example notebooks and I'm running into this issue again.

cmdupuis3 avatar Jan 18 '23 23:01 cmdupuis3

Hi all, just wondering what the status of this is?

I'm writing some xbatcher example notebooks and I'm running into this issue again.

Apologies again for the delay - my attention was drawn elsewhere due to illness, travel, and some other urgent tasks. If I recall correctly we proposed at our last meeting to always have a defined axis order including a sample dimension and provide the option to specify that no transposing/stacking should occur using input_dims=None. I will try to draft this alternative solution by Monday.

maxrjones avatar Jan 19 '23 21:01 maxrjones