bits-service
bits-service copied to clipboard
Proper upload stategy to prevent multiple BITS services from uploadin…
…g to the same blob, and updated Azure/Go SDK.
- When uploading a blob, the upload now contains a Content-MD5.
- Uploads are only done if necessary, i.e.
- if the blob doesn't yet exist, or
- if it has a different size, or
- if it has the same size, but due to lack of a Content-MD5 can't be tracked whether the existing blob has the same contents, or
- if it has a ContentMD5, but the local / to-be-uploaded contents differ from the existing blob.
- When uploading to a certain location, we first upload to a temporary name, and then we copy within the storage account to the proper location.
- This copy operation is protected by an Azure blob lease, to prevent two concurrent copy operations.
- Each block blob upload is also protected against accidental modifications with a MD5
- The SDK has been upgraded from a really old one, to the current Azure Go SDK (which supports context.Context) to allow subsequent parallelization of uploads
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/164656890
The labels on this github issue will be updated when the story is started.
:x: Hey chgeuer!
All pull request submitters and commit authors must have a Contributor License Agreement (CLA). Click here for details on the CLA process.
The following github user @chgeuer is not covered by a CLA.
After the CLA process is complete, this pull request will need to be closed & reopened. DreddBot will then validate the CLA(s).
I guess Microsoft should have a CLA in place, I need to check... I published my potential memberships...
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/164664843
The labels on this github issue will be updated when the story is started.
:white_check_mark: Hey chgeuer! The commit authors and yourself have already signed the CLA.
Hi @smoser-ibm , @petergtz mentioned he'd be OOF for the next 2 months... Is there any chance you have a look at this PR?
Hi @suhlig, is that something you could help with?
Hi @chgeuer ,
Thanks for all your work. Your changes obviously do a lot more than just fix the issues we’ve seen with SAP’s Azure envs. This is very welcome, because we were aware of the room for optimization here, but simply weren’t able to address it due to bandwidth constraints.
Unfortunately, it makes it more difficult to merge, because we cannot merge portions that are uncontroversial, while iterating on those portions that need more discussion. That was the reason why I couldn’t merge it before I went on leave. I think what would make sense here is to tear apart those different portions and submit them as separate PRs. We’re happy to merge them then.
What we discussed via Slack, was: Avoid conflicting concurrent uploads via an “upload to temp path, then copy and delete”. That’s what I have done in my trivial commit in https://github.com/cloudfoundry-incubator/bits-service/commit/0422b42f2677037947bc920d20ff801a66d73eac ahead of your PR submission. My change is trivial, but should exactly address the problems that we saw in SAP’s Azure envs.
Assuming the above fixes the uploads, we can work on optimizations and improvements that you started:
- Size comparison, MD5 sum comparison etc. to avoid unnecessary uploads
- concurrent block uploads to further optimize times
- SDK update
Each should go into a separate PR as explained above.
Further comments in regards to your changes:
- The leases introduced for the copy operation will break the logic: if multiple concurrent uploads to the same blob take place, one of them will now fail with an error, likely the latter one. As discussed with you, this will not work in our use case. We need a last writer wins strategy.
- Your code fails when reads from local files don’t return all bytes (see doc for https://golang.org/pkg/os/#File.Read which explains that a read doesn’t necessarily return all bytes).
- The new version of the Put function is now long enough that it should probably be refactored. Ideally the refactoring would allow to partially unit-test the logic.
I hope my comments are helpful and I’m happy to discuss this further with you. Please keep in mind though that we’re still limited in bandwidth with what we can support and merge at this point in time. So please be patient with us as we cannot give any promises as to when we can actually merge stuff.
Thanks, Peter
Hi everyone,
we have validated this PR on CF v8.1.0 with the corresponding Cloud Foundry Acceptance Tests (CATS). All tests that are applicable for our CF deployment are green. The Bits-Service logs are more informative now.
As we are still interested to get the PR merged, can you please consider refactoring and merging the fix?
Thanks and Best Regards,
@FloThinksPi and Jochen.
Hey @jochenehret many thanks for running the CATS test suite, that's great. Thanks for the effort.
Hi @jochenehret, hi @FloThinksPi,
Thanks for your help here. Note that CATs are not sufficient to prove that the code works. We run CATs multiple times a day with Azure storage and still, we never saw the errors that Michael from SAP saw in their large Azure landscape.
Hi @chgeuer, finally able to follow up on this. Before diving too deep into the details of this change, could you please restate the problem that you are trying to solve with this PR?