gcp/saver: Only return errors.KindAlreadyExists if all three exist
What is the problem I am trying to address?
In #1124, a GCP lock type was added as a singleflight backend. As part of this work, the GCP storage backend's Save() was made serial, likely because moduploader.Upload requires a call to Exists() before it, rendering the GCP lock less useful, by doubling the calls to GCS.
However, by doing this, the existence check was now only checking the existence of the mod file, and not the info or zip. This meant that if during a Save(), the zip or info uploads failed, on subsequent rquests, that when using the GCP singleflight backend, Athens woul assume everything had been stashed and saved properly, and then fail to serve up the info or zip that had failed upload, meaning the cache was in an unhealable broklen state, requiring a manual intervention.
How is the fix applied?
To fix this, without breaking the singleflight behavior, introduce a metadata key that is set on the mod file during its initial upload, indicating that a Stash is still in progress on subsequent files, which gets removed once all three files are uploaded successfully, which can be checked if it it is determined that the mod file already exists. That way we can return a errors.KindAlreadyExists if a Stash is in progress, but also properly return it when a Stash is not currently in progress if and only if all three files exist on GCS, which prevents the cache from becoming permanently poisoned.
One note is that it is possible the GCS call to remove the metadata key fails, which would mean it is left on the mod object forever. To avoid this, consider it stale after 2 minutes.
What GitHub issue(s) does this PR fix or close?
N/A
Extra Notes
This is my first time touching the Athens codebase, so apologies for any faux pas. We hit this at $dayjob, so thought I would take a stab at fixing it.
It is a more complex than I would like, so comments very welcome.
This also contains a small fix I found when writing the test: if io.Copy failed during an upload to GCS, it would call Close() on the writer, which would leave a partial/broken file. The storage docs say to cancel the context to prevent this.
@dwbuiten thanks for your contribution to Athens! I'm doing a release today, but I'll get to reviewing this and we'll see if we can get it into one of the fix releases.
Just off the bat, one thing that'd help show what this is fixing is if you're able to write tests to it in E2E or unit tests.
@matt0x6F Sure - adding a unit test to with_gcs_test.go. And it is a good thing too... I noticed the test there does not run by default (without GCP creds in the env), and when I ran the existing test, I realized this fix actually breaks singleflight, while fixing the poison cache problem. I need to ponder a bit on the subject, but will update this PR with a different fix + test after.
@matt0x6F It ended up being more complex than I thought (or would like it to be), but old and new tests all pass now. I updated the PR description to reflect that.
CCing @dfinkel who helped me think this through.