tedana
tedana copied to clipboard
Implement checksums for CircleCI integration tests
The current mechanism for integration testing with CircleCI is to run the full tedana
pipeline and then compare the content of the output files to previously generated files using np.allclose
. This procedure ensures that the results of the tedana
pipeline haven't changed in any appreciable way, which is critical as refactoring and updating occurs.
I was recently inspired by @WhitakerLab's BrainNetworksInPython repository, however, which uses Python's hashlib
library to implement checksum comparisons of their generated outputs. I think this procedure could be applied with great success to the tedana
repository!
Rather than having CircleCI re-download the previously generated outputs from Dropbox each time, they could be hashed once and kept in a JSON file in the tests/data
folder. The outputs generated during testing could then be hashed and compared against this reference. This would also make testing locally a lot easier (no need to download the 1.9GB output directory!).
Eeeeeexcellent - I’m glad that’s helpful! I’m tagging @islast because she did all that work so she could potentially pass on any advice etc :sparkles:
I'm thrilled that you like our method @rmarkello :smile:
My warning is that because hashing doesn't allow you to use tolerances as np.close
does you have to be very delicate about random states (unless tedana
is deterministic). We have a different hash for each environment we test in.
This is manageable but it often feels like our continuous integration is more effective at alerting us to dependency updates interfering with the random state or the formatting of written outputs than it is at checking for behavioural changes in BrainNetworksInPython.
This point is well met, @Islast! Thank you so much for raising it (and thanks for pinging her in, @KirstieJane!).
While tedana
is not deterministic, we do try to ensure consistent seeding for test purposes. However, you're completely right: given the sheer number of floating point operations that occur on the data in tedana
, we might need numpy
's ability to set tolerance thresholds!
I'm hopeful that with testing occurring inside a Docker container, things will be more-or-less consistent from run-to-run—enough so to at least give this a shot. Indeed, in some respects being made aware of dependency changes or alterations to output formats may be a feature of using the hashing approach, not a bug!
Very cool. We have a similar testing strategy in NIAK and this idea will help a lot. BTW in our case, with proper seeding and a container we exactly replicate outputs.
This is a great idea, @rmarkello !
I think I'd like to wait on implementing it until we resolve #14 (which is yielding issues in #18), since the ideal workflow will allow developers to test locally by just downloading a json of the hash values, rather than the current 2GB of dropbox data. This is definitely something for our 0.1.0
milestone, though !
Lately we've been making a bunch of breaking changes, so we've been fine with just checking for the existence of specific output files. However, once the final breaking PRs (#164 and #152) are merged in, I want to work on refactoring a lot of the more confusing code (e.g., fitmodels_direct
and the component selection). I think we should try implementing this before that point, because I would like to make sure that the contents of the files are correct after cleaning up the code.
@tsalo @KirstieJane is this formally going to be part of the GSoC student's work if accepted?
If they're able to do it, then definitely. It will have to come after unit tests though (since they're the low-hanging fruit).
Oh, no, totally, unit testing is higher priority. I'm just trying to make sure that there's a path forward for this issue. It seems like we should just see if the student can get to it this summer, and if not then we'll have to revisit it ourselves. Seem okay?
Just a quick note that recent nibabel now has a nib-diff
CLI tool that lets you set np.allclose
-style tolerances on neuroimages, in case you do need something looser than a checksum.
Added to the sprint agenda.
Is this still being considered @rmarkello ? I didn't see it in #418 ?
Still being considered, for sure! We're going to need to add regression tests to actually compare the outputs of the integration tests to pre-generated data, but it doesn't (didn't) make sense to do that when so many PRs were going to totally mess with the actual data generation! All our tests would consistently be failing 😢
Once #435, #436, and @handwerkerd's eventual PR to overhaul the decision tree are integrated we can do this. In the interim I can open a PR with an example of what this would look like, though?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions to tedana:tada: !
@rmarkello do you still think this is workable?
I do, especially if we use nib-diff
rather than checksums. That said, I've been totally delinquent on tedana
so I'm not sure about the status of some of the more major changes (e.g., the decision tree overhaul). I think it would probably best to hold off on implementing these until then because we know those are going to be breaking changes. But if others want to see these done sooner than that's fine; we'll just have to acknowledge those PRs will fail all the tests until we update the data to compare against!
Forgive me if this is incorrect, but doesn't nib-diff require that both files be in memory? In other words, don't you still have to download or upload a large amount of data to use it?
You're totally right, they would both have to be in memory. I think checksums are just gonna be a real pain with floating point data, though, so I think we need to do something that's a bit more np.allclose
-like. Since the test data are all down-sampled pretty aggressively do you think it would be a problem to load two images in memory at a time? My thought is that it would never be more than two (the new image and the comparison / baseline).
As stands, we are (or used tobe? at one point we definitely were) downloading the comparison output data on CircleCI every time (and, if I remember correctly, just not using it)!
We're definitely uploading comparison output to CircleCI every time, and not using it. I thought the point of the checksums was to reduce the download/upload need, but I forgot about the upload of artifacts to CI rather than just checksums.
I don't think that there's a memory constraint problem, especially with the truncated data. You would have to go pretty extreme to get a memory issue.
I wonder about simply running np.allclose
then on the floating-point data. You could add this to the integration test file pretty straightforwardly, correct? What's the advantage of a checksum over that strategy?
At this point I'm definitely not recommending checksums! The benefit of nib-diff
over a straight np.allclose
is that the former checks things like the file header, etc, to make sure you're not altering other things besides the actual data content.
Finally, it's worth noting we don't have to call nib-diff
from the command line—we could just import the relevant functions (nibabel.cmdline.diff.diff
) and use those in the integration tests!
I'm following this now. I think we'd discussed it briefly at the hackathon, but unfortunately my memory is trash and I didn't write it down so unfortunately we're having this conversation a second time. :man_facepalming: Okay, awesome, thanks @rmarkello . Other stuff is higher priority but I think we have a pretty good way forward for this.
This issue has been automatically marked as stale because it has not had any activity in 90 days. It will be closed in 600 days if no further activity occurs. Thank you for your contributions to tedana:tada: !
It seems like we've converged on nib-diff
over other possible checks. I noted that in #452. Once we have tests working for T2* estimation and optimal combination, I think we should at least look at doing it for the full integration tests.
This issue has been automatically marked as stale because it has not had any activity in 90 days. It will be closed in 600 days if no further activity occurs. Thank you for your contributions to tedana:tada: !