tedana icon indicating copy to clipboard operation
tedana copied to clipboard

Implement checksums for CircleCI integration tests

Open rmarkello opened this issue 6 years ago • 22 comments

The current mechanism for integration testing with CircleCI is to run the full tedana pipeline and then compare the content of the output files to previously generated files using np.allclose. This procedure ensures that the results of the tedana pipeline haven't changed in any appreciable way, which is critical as refactoring and updating occurs.

I was recently inspired by @WhitakerLab's BrainNetworksInPython repository, however, which uses Python's hashlib library to implement checksum comparisons of their generated outputs. I think this procedure could be applied with great success to the tedana repository!

Rather than having CircleCI re-download the previously generated outputs from Dropbox each time, they could be hashed once and kept in a JSON file in the tests/data folder. The outputs generated during testing could then be hashed and compared against this reference. This would also make testing locally a lot easier (no need to download the 1.9GB output directory!).

rmarkello avatar May 17 '18 20:05 rmarkello

Eeeeeexcellent - I’m glad that’s helpful! I’m tagging @islast because she did all that work so she could potentially pass on any advice etc :sparkles:

KirstieJane avatar May 18 '18 06:05 KirstieJane

I'm thrilled that you like our method @rmarkello :smile: My warning is that because hashing doesn't allow you to use tolerances as np.close does you have to be very delicate about random states (unless tedana is deterministic). We have a different hash for each environment we test in. This is manageable but it often feels like our continuous integration is more effective at alerting us to dependency updates interfering with the random state or the formatting of written outputs than it is at checking for behavioural changes in BrainNetworksInPython.

Islast avatar May 18 '18 09:05 Islast

This point is well met, @Islast! Thank you so much for raising it (and thanks for pinging her in, @KirstieJane!).

While tedana is not deterministic, we do try to ensure consistent seeding for test purposes. However, you're completely right: given the sheer number of floating point operations that occur on the data in tedana, we might need numpy's ability to set tolerance thresholds!

I'm hopeful that with testing occurring inside a Docker container, things will be more-or-less consistent from run-to-run—enough so to at least give this a shot. Indeed, in some respects being made aware of dependency changes or alterations to output formats may be a feature of using the hashing approach, not a bug!

rmarkello avatar May 18 '18 14:05 rmarkello

Very cool. We have a similar testing strategy in NIAK and this idea will help a lot. BTW in our case, with proper seeding and a container we exactly replicate outputs.

pbellec avatar May 19 '18 01:05 pbellec

This is a great idea, @rmarkello !

I think I'd like to wait on implementing it until we resolve #14 (which is yielding issues in #18), since the ideal workflow will allow developers to test locally by just downloading a json of the hash values, rather than the current 2GB of dropbox data. This is definitely something for our 0.1.0 milestone, though !

emdupre avatar May 21 '18 19:05 emdupre

Lately we've been making a bunch of breaking changes, so we've been fine with just checking for the existence of specific output files. However, once the final breaking PRs (#164 and #152) are merged in, I want to work on refactoring a lot of the more confusing code (e.g., fitmodels_direct and the component selection). I think we should try implementing this before that point, because I would like to make sure that the contents of the files are correct after cleaning up the code.

tsalo avatar Jan 11 '19 14:01 tsalo

@tsalo @KirstieJane is this formally going to be part of the GSoC student's work if accepted?

jbteves avatar Apr 20 '19 13:04 jbteves

If they're able to do it, then definitely. It will have to come after unit tests though (since they're the low-hanging fruit).

tsalo avatar Apr 22 '19 16:04 tsalo

Oh, no, totally, unit testing is higher priority. I'm just trying to make sure that there's a path forward for this issue. It seems like we should just see if the student can get to it this summer, and if not then we'll have to revisit it ourselves. Seem okay?

jbteves avatar Apr 22 '19 16:04 jbteves

Just a quick note that recent nibabel now has a nib-diff CLI tool that lets you set np.allclose-style tolerances on neuroimages, in case you do need something looser than a checksum.

effigies avatar Sep 19 '19 16:09 effigies

Added to the sprint agenda.

jbteves avatar Oct 29 '19 01:10 jbteves

Is this still being considered @rmarkello ? I didn't see it in #418 ?

emdupre avatar Nov 08 '19 16:11 emdupre

Still being considered, for sure! We're going to need to add regression tests to actually compare the outputs of the integration tests to pre-generated data, but it doesn't (didn't) make sense to do that when so many PRs were going to totally mess with the actual data generation! All our tests would consistently be failing 😢

Once #435, #436, and @handwerkerd's eventual PR to overhaul the decision tree are integrated we can do this. In the interim I can open a PR with an example of what this would look like, though?

rmarkello avatar Nov 10 '19 18:11 rmarkello

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions to tedana:tada: !

stale[bot] avatar Feb 08 '20 18:02 stale[bot]

@rmarkello do you still think this is workable?

jbteves avatar Feb 13 '20 15:02 jbteves

I do, especially if we use nib-diff rather than checksums. That said, I've been totally delinquent on tedana so I'm not sure about the status of some of the more major changes (e.g., the decision tree overhaul). I think it would probably best to hold off on implementing these until then because we know those are going to be breaking changes. But if others want to see these done sooner than that's fine; we'll just have to acknowledge those PRs will fail all the tests until we update the data to compare against!

rmarkello avatar Feb 14 '20 14:02 rmarkello

Forgive me if this is incorrect, but doesn't nib-diff require that both files be in memory? In other words, don't you still have to download or upload a large amount of data to use it?

jbteves avatar Feb 14 '20 14:02 jbteves

You're totally right, they would both have to be in memory. I think checksums are just gonna be a real pain with floating point data, though, so I think we need to do something that's a bit more np.allclose-like. Since the test data are all down-sampled pretty aggressively do you think it would be a problem to load two images in memory at a time? My thought is that it would never be more than two (the new image and the comparison / baseline).

As stands, we are (or used tobe? at one point we definitely were) downloading the comparison output data on CircleCI every time (and, if I remember correctly, just not using it)!

rmarkello avatar Feb 14 '20 14:02 rmarkello

We're definitely uploading comparison output to CircleCI every time, and not using it. I thought the point of the checksums was to reduce the download/upload need, but I forgot about the upload of artifacts to CI rather than just checksums. I don't think that there's a memory constraint problem, especially with the truncated data. You would have to go pretty extreme to get a memory issue. I wonder about simply running np.allclose then on the floating-point data. You could add this to the integration test file pretty straightforwardly, correct? What's the advantage of a checksum over that strategy?

jbteves avatar Feb 14 '20 14:02 jbteves

At this point I'm definitely not recommending checksums! The benefit of nib-diff over a straight np.allclose is that the former checks things like the file header, etc, to make sure you're not altering other things besides the actual data content.

Finally, it's worth noting we don't have to call nib-diff from the command line—we could just import the relevant functions (nibabel.cmdline.diff.diff) and use those in the integration tests!

rmarkello avatar Feb 14 '20 14:02 rmarkello

I'm following this now. I think we'd discussed it briefly at the hackathon, but unfortunately my memory is trash and I didn't write it down so unfortunately we're having this conversation a second time. :man_facepalming: Okay, awesome, thanks @rmarkello . Other stuff is higher priority but I think we have a pretty good way forward for this.

jbteves avatar Feb 14 '20 14:02 jbteves

This issue has been automatically marked as stale because it has not had any activity in 90 days. It will be closed in 600 days if no further activity occurs. Thank you for your contributions to tedana:tada: !

stale[bot] avatar May 14 '20 15:05 stale[bot]

It seems like we've converged on nib-diff over other possible checks. I noted that in #452. Once we have tests working for T2* estimation and optimal combination, I think we should at least look at doing it for the full integration tests.

tsalo avatar Aug 06 '20 15:08 tsalo

This issue has been automatically marked as stale because it has not had any activity in 90 days. It will be closed in 600 days if no further activity occurs. Thank you for your contributions to tedana:tada: !

stale[bot] avatar Nov 05 '20 00:11 stale[bot]