atlas
atlas copied to clipboard
Upload issue reported by Codefi_keyz
After investigating the issue, the only problem that I noticed on the Atlas is that we blacklist and try the next operator (out of max 3) even if the connection problem was on the users' side.
The only thing that I can propose on the Atlas is a check for users' connection after the initial upload request fails, this information can be handled in a few ways:
- We can introduce an interval that will check if the connection is restored and try to upload the video (from the beginning) without blacklisting the operator.
- We can abort and just inform the user that because of his connection upload failed, and the one will need to retry an upload after regaining access to the internet.
- (REQUIRES INFRA CHANGE) Theoretically if the problem is on the user side, we could save the offset on which the upload stopped and when the connection is back we could resume uploading the file.
I think we need to get away from this client side model entirely, it's an artefact of the first model of Atlas where there was no backend. I think we need to bite the bullet and have either Atlas just pick a random operator which is supposed to be suitable, and then retry if it fails, and errors be reported. The fundamental operational side of this is that infrastructure has to be responsie to errors and self-correct, which is not for Orion to explicitly deal with. A v2 version of this is to have Orion be slightly more helpful, and have it suggest operators based on some local state about what is failing or not, but even this seems like a bad path really. What I'm suggesting will possibly lead to more problems afflicting users, but this will force us, the logging and the leads to mature enough to handle this properly.
The solution we discussed further was to introduce some sort of the state for operators kept in Orion, which can persist across user sessions (because right now new session would start uploading to same buckets).
However this would not solve the issue raised above without changes to the infra which allows to continue uploading the asset when client upload fails.
After another round of discussions the suggestion was to first introducing the means to distinguish between client and operator failure. In this case we can inform the user and the logging system correctly.
User perspective:
- When experiencing network issues preventing successful upload - show error message to the user and trying to upload again (from scratch) -
Issues with current logging:
- log failure on each bucket failing (switching and successful upload remains as error)
- log errors teh same way when it comes from client or infra side
Scope
From systems perspective:
- [ ] Log the error appropriately
From user perspective
- [ ] Customise the copy of the error "Check your network and try again." "Infrastructure had an unexpected error. Please try again."
I think this fix is quite easy, so lets go with it for now, user uploads are a minority of cases still with yt-synch, so its quite acceptable.
Is it today the case that at least on downloads, Atlas is not having to deal with what specific host to download from, that Orion just gives a full URL, or is that also not true?
However this would not solve the issue raised above without changes to the infra which allows to continue uploading the asset when client upload fails.
Can I ask for a more specific reference on this?
From the updated logs we know, that this issue is caused by the clients with the unstable connection. Currently, we have 3 redirects with 1s interval for the storage bucket, and the upload will start again after that. It's now clear to me that this is not optimal approach because:
- only 3s of lost connection is enough to fail the upload
- after failed upload, new storage bucket will be picked, but the download will have to start all over again
- new bucket will be most certainly located even further from the client's location
Because of these 3 points, its almost impossible to achieve the successful upload by switching the buckets for the clients with unstable connection. Here's what we can do instead:
a) if the client-side error detected - progressively increase the interval between retries and maybe increase the amount of retries. Do e.g 7 retries with 1-7s (timeout time == retry number) timeout b) send the basic chunks with calculated resumable point (e.g 1mb chunk and the amount of chunks sent in the localstorage) c) use 3rd party library for chunked uploads (I've only familiar with Resumable, but maybe there's something more suitable for our case)
We are going to move to an upload model which is going to be proxied via Orion, where Orion accepts the upload, does post-processing (transcoding, image previews, subtitling, toxic content detection, etc.), and then executes full upload to infra, so lets try to pick a fix that limits the complexity introduced in Atlas itself, since the long term path is to make Atlas simpler and simpler, and Orion more and more complex.