yas3fs
yas3fs copied to clipboard
yas3fs behavior when async background upload fails after 3 attempts
Piggybacking on https://github.com/danilop/yas3fs/issues/17
You noted that ya3fs by default attempts 3 times to upload the file in the background after committing locally to the cache (and reporting to the writer/caller that the write succeeded).
However what is the behavior if the 3 retries fail? Does yas3fs delete the locally cached file?
If not, could the following options be exposed?
a) deleteCachedFileOrphanAfterUploadFile = i.e. enable purging the locally cached file if the s3 upload processed exhausted all retries
b) Some sort of option to log locally a list of all files (paths) that were written OK to the local cache, but failed to upload to S3. This would permit integrations with calling applications so they could consult this file to cleanup meta-data that now points to orphaned files (i.e. files that yas3fs said were OK (written locally) but failed to truely write to s3 in the background)
I see your point, but if yas3fs is trying to upload a file to S3 it is because is a new or updated file, so deleting it means you loose all the possibilities to recover. I would prefer (after the 3rd failure) to wait for some time (e.g. 5 or 15 minutes) and then try again. What do you think?
Yes I think waiting an additional (configurable) period of time before a second retry cycle would be good. So different levels of "retry cycles" with a "give up" behavior
a) retry config = numberOfUploadAttempts = N, sleepMSTime = N
b) retry cycle config = numberOfCycles = N, sleepMSTime = N
c) retryCycleExhaustedAction = {deleteCachedFileOrphan=true, uploadFailureLog=/path/to/log/file?, otherOptionB?)
So personally I would configure this such as:
a) retryConfig: { numberOfUploadAttempts = 3, sleepMSTime = 30000} // 30s b) retryCycleConfig: {numberOfCycles = 2, sleepMSTime = 600000 }// 10 min c) retryCycleExhaustedAction = {deleteCachedFileOrphan=true, uploadFailureLog=/path/to/log/file, localBackupDir=/path/to/dir/to/move/orphan/to}
Format of the uploadFailureLog might just be something as simple as listing the files that failed, and their local paths for recovery/manual/automated action in the localBackupDir
The format of this uploadFailureLog files should be pretty clean/straightforward/simple. In my use cause I would likely ingest it via something like logstash and ship it off to an event system etc.
Thoughts?
Any thoughts on this idea?
Bigger architecture change?... add a --with-plugin-file options this loads a class of YAS3FSPlugin each of these plugins are wrappers for YAS3FS methods (decorated w/ @withplugin)
this also means the methods in YAS3FS should be broken up a bit more. ie. do_on_s3, do_on_s3_now, do_cmd_on_s3_now_w_retries, do_cmd_on_s3_now. and perhaps do_delete_on_s3_now, do_copy_on_s3_now, do_set_c_from_file_on...
for this scenario the yas3fs method would be decorated
@withplugin
def do_cmd_on_s3_now_w_retries(...):
last_exception = None
for i in self.retries:
try:
do_cmd_on_s3_now(...)
return pub
except Exception, e:
last_exception = e
pass
raise last_exception
and the plugin would be (MyYas3fsPlugin.py)
from yas3fs.YAS3FSPlugin import YAS3FSPlugin
class MyYAS3FSPlugin (YAS3FSPlugin):
def do_cmd_on_s3_now_w_retries(self, fn):
def wrapper(*args, **kargs):
try:
return fn(*args, **kargs)
except Exception as e:
# do failover here.. ie...
path = args[1][1]
action = args[2]
if args[1][0] = 'upload':
cache_file = args[0].cache.cache.get_cache_filename(path)
cache_stat = os.stat(cache_file)
emailCacheFile(cachefile)
return args[2] # pub
it would be run as
yas3fs ... --with-plugin-file MyYas3fsPlugin.py
Whats the diff between do_on_s3 vs do_on_s3_now
sync vs async exec?
do_on_s3 adds commands things to the s3 queue...
do_on_s3_now runs the commands.
Yes, the _now command is executed immediately, the other one adds it to a queue for async execution.
Minor, but potentially consider renaming some of the methods so the behavior is clearer by just the method name.
We use yas3fs to archive files by taring them up onto the mounted filesystem and if that succeeds we remove the files. With the current behavior data loss is very likely if we lose write access to the bucket or never had it). I think there are a few opportunities for improvement. 1) Mark files in the cache (perhaps use a different directory in the cache) for pending writes then move them to the read cache after successfully written. Then try to upload them at some point in the future, at least on subsequent mount. 2) If all writes are failing stop accepting writes by returning I/O error or permission denied to the filesystem write.
I would like to add that I tried to work around this problem by using --s3-num 0. Unfortunately when I remove the bucket permissions the yas3fs mounted filesystem still accepts writes.
#while date > /mnt/testbucket/writetest; do echo -n .;sleep 5; done .........................
I would expect it to work more like this: #while date > /root/writetest; do echo -n .;sleep 5; done -bash: /root/writetest: Permission denied