ccdproc
ccdproc copied to clipboard
[WIP] Combiner refactor
[Edit: Was apparently working off an old master, fixing conflicts right now] [Edit 2: Resolved conflicts]
See discussion in #578.
This is my first pass/prototype of the new structure laid out in #578. I have some things I'd like people to look at... okay, a lot of things. I updated the tests to reflect the new structure and they are all passing now.
-
There is still the same amount of repeat code here that was there before. But the only way I can think of getting rid of it, while still retaining
combine_methodas an abstract class, which I think makes it more clear to users that they should not be using the base class ever, is to put some of the repeated code bits into separate non-abstract functions. But that might end up making the code more piecemeal/confusing then it needs to be... -
We should rename
combine_method, anybody got any suggestions? -
I left the combine function override parameter with the
combine_method, seemed a little more obvious to the users that way, but I'm not set on it. -
Same could be said of scale_to. Do we think this is something users will want to change based on different combination runs? IN that case, I think it makes sense to leave it as a parameter to
combine_method -
Uncertainty still needs work/filling out, but I put in some basic structure for this. Not sure if I'm happy with what I have though. The simplest approach was to make constructing the uncertainty from the data, versus from input uncertainties a boolean flag. The default being re-calculate uncertainties from the input data values. But there's probably a more elegant solution to this.
-
With this restructure it seems like it shouldn't be too bad to move the chunking/memory handling into the BaseClass, per @crawfordsm's wish list in our tag up.
-
I think a test needs to be added for weighting with the
combinerfunction. I didn't get any errors related to that while I was still updatingcombinerand it should have been broken, so seems to be a gap. -
Obviously more doc strings / etc. to come.
Please have a look at the following list and replace the "[ ]" with a "[x]" if the answer to this question is yes.
- [ ] For new contributors: Did you add yourself to the "Authors.rst" file?
For documentation changes:
- [ ] For documentation changes: Does your commit message include a "[skip ci]"? Note that it should not if you changed any examples!
For bugfixes:
- [ ] Did you add an entry to the "Changes.rst" file?
- [ ] Did you add a regression test?
- [ ] Does the commit message include a "Fixes #issue_number" (replace "issue_number").
- [ ] Does this PR add, rename, move or remove any existing functions or parameters?
For new functionality:
- [ ] Did you add an entry to the "Changes.rst" file?
- [ ] Did you include a meaningful docstring with Parameters, Returns and Examples?
- [ ] Does the commit message include a "Fixes #issue_number" (replace "issue_number").
- [ ] Did you include tests for the new functionality?
- [ ] Does this PR add, rename, move or remove any existing functions or parameters?
Please note that the last point is not a requirement. It is meant as a check if the pull request potentially breaks backwards-compatibility.
Thank you @SaOgaz for your work here 👍.
I plan to have a look over the changes and the points you mentioned soon-ish (although I cannot promise anything currently).
@SaOgaz thanks for the implementation -- I think this is great and really gives a concrete example to discuss!
While there isn't a reduction in the repetition of the code, it does enable more precise handling of the uncertainty.
I think there is something very interesting in having different methods for the Uncertainty, and that would be incredible useful. I'd like to see a different name than fresh but I don't have a useful suggestion at the moment.
If this would make optimization of the memory easier, that is a very big positive to this implementation, so if you can expand on that point, that would be very helpful.
I'm okay with combine_method, though we might want to have a shortcut for each one of just average, median, and sum.
At to scale_to, probably not a common use case but it might be useful to provide it.
There are two concerns this raises:
-
If someone wants to compare a median combine and an average combine, they need to create the class twice which would result in a large memory footprint and possible needing to delete the object. I only really saw this once I saw the implementation so that makes it really worthwhile. This may become less of a concern if the memory usage is optimized and this is also likely not a common use case.
-
As another comment, if I search on using
combinevsCombiner, that shows about a 70/30 split in the community about who is using which class. We would want to maintain the Combiner class for a while or at least provide a wrapper around the current classes for the future just to provide backward compatibility for our community. I don't think this would be too hard giving the implementation you've made.
cc @eteq
I haven't had a chance to give this a detailed look-over yet (weekend, hopefully), but looks very promising.
One change -- I think we want to keep the old Combiner class around for backwards compatibility. We can stop mentioning it in favor of the new combiner classes but I don't think we should drop it entirely.
That said, it is great to have you working on this @SaOgaz -- this is an area of the code that has needed attention for a while!
About halfway though removing the data_arr instance attribute, I realized this is going to be a problem for the sigma_clipping, clip_extreme, and minmax_clipping methods. What are people's thoughts on turning these three methods into functions that will run on a list of CCDData objects? I'd like to avoid users (or the code) having to re-run the same sigma clipping etc. on the same dataset multiple times, if the case of the user wanting the same data set and clipping with different combination types.
@eteq I've actually just run across another problem when trying to use the built in CCDData arithmetic. Right now the combination math is getting done on masked numpy arrays. This means when a sum is done, for example, the masks are used (and as you know, uncertainties are calculated fresh using that masking information).
The current CCDData arthimetic methods do not use the masks. Now I could probably apply the masks to the data pre-calculation as a work around, but that wouldn't account for the masked values in the CCDData built in uncertainty calculation... so defeating our original purpose...
Since I'm posting here anyway I should add, the combination of pulling out the data from being an instance variable (something i do think is worthwhile) in combination with trying to use the native CCData type is basically resulting in a complete re-write of ALL the code. It ends up touching every single function/method. So if we do end up continuing down this route it will require a lot more time* on my part.
*lot more time meaning more than a few more hours, maybe a half week of effort including updating the tests and docs etc.
@SaOgaz -- I'll try to digest your comments by early next week. I need to review the current code base and your changes to make sure I understand the issues you are hitting.
Off the top of my head, I think CCDData's uncertainty propagation is only going to be useful for sum or average combine (at best).
Just so I understand does this
Since I'm posting here anyway I should add, the combination of pulling out the data from being an instance variable (something i do think is worthwhile) in combination with trying to use the native CCData type is basically resulting in a complete re-write of ALL the code.
mean all of the code in combine? Or in CCDData?
@mwcraig, all of the code in combine.