yas3fs icon indicating copy to clipboard operation
yas3fs copied to clipboard

Intermittent partial files issue

Open robmoore opened this issue 9 years ago • 14 comments

We are experiencing cases where files are uploaded to one of the nodes but something occurs during the upload to S3 so that we end up with a partial file. It's infrequent but occurs often enough that we will not be able to use yas3fs if we can't resolve it. The log files are clean and there's no indication of an error on the server the file is uploaded to initially.

We're using the default settings for yas3fs with version 2.3.0.

robmoore avatar Apr 02 '15 21:04 robmoore

I would appreciate any tips that might help track down the issue here. yas3fs works most of the time and we'd like to use it but this issue effectively blocks that from happening.

robmoore avatar Apr 07 '15 00:04 robmoore

How big is the file?

How big is your multipart split size set to?

Once written on a yas3fs node, does the file exist in its entirety in the local cache? (files are written locally before any s3 upload).

What do the logs state for this file specifically?

Do the other nodes logs show the receipt of an event for this file after a partial upload to S3?

If you try to retrieve the file from other nodes, do they show it as well and in its partial state?

If there is a failure in uploading or pre-mature termination of the connection, it should be in the logs.

Do you have the s3 retries option set?

bitsofinfo avatar Apr 07 '15 16:04 bitsofinfo

@bitsofinfo Thanks for the response.

We are taking the defaults in terms of the split size and s3 retry parameters.

The files are 646 bytes.

I'm not sure this is the case for all instances but the situation I've seen most frequently is where the file is correct on the node the file was uploaded to but not in S3 and, resultingly, not on the other node (we are running in a 2-node cluster).

Unfortunately, the logs haven't been much help here. We aren't running with debug mode on but could do if you think it would be helpful.

robmoore avatar Apr 07 '15 18:04 robmoore

Ok so how many bytes of the 646 make it to S3.

Enable debug mode as well which will give more output as to what is going on, but quite verbose

bitsofinfo avatar Apr 07 '15 19:04 bitsofinfo

For debugging,

Check that the error still happens when yas3fs is still running on one (or two) threads.

This will make your logs much more readable (though still verbose) in debug mode.

ewah avatar Apr 07 '15 21:04 ewah

@ewah Thanks. Just so I'm clear, I'm running the service in debug mode using the -d flag. Regarding the threading, I'm running it in foreground mode (via a upstart script) and do not know how to alter the number of threads. Perhaps you just mean if I'm running two instances of yas3fs on the same server then I shouldn't enable debug for both?

@bitsofinfo We've seen a variety of sizes from 100KB to 600KB (I just realized I said 600 bytes before when I meant KB).

robmoore avatar Apr 07 '15 21:04 robmoore

Only run one instance per node and just enable debugging as you have, wait for an occurrence of this partial file issue, then include all relevant logs containing a reference to that file path here.

bitsofinfo avatar Apr 07 '15 22:04 bitsofinfo

Also to clarify, regarding threads I think @ewah was just referring to trying to limit the number of calling threads (writers) to the yas3fs mount (i.e your calling application) to assist in limiting the amount of log activity that will be generated to make it easier to debug. Try to reproduce the scenario w/ less callers to yas3fs is basically the point

bitsofinfo avatar Apr 08 '15 13:04 bitsofinfo

Please also check that the temporary file system used as cache is not full when this issue happens.

danilop avatar Apr 09 '15 11:04 danilop

@danilop Thanks for the tip. We are using a non-temp filesystem for caching and it has plenty of space.

@bitsofinfo I appreciate the clarification. I've been trying to reproduce the issue and not having success so I'm working with the QA engineer who originally came across the issue. One thing to note is that we were originally running the yas3fs script in the background and have changed to using a upstart script with it running in the foreground. I was just curious if there's something (perhaps environment variables such as the AWS creds) that might behave differently in that situation as I haven't been able to reproduce the issue and I've been working with the upstart script version.

robmoore avatar Apr 10 '15 14:04 robmoore

There should not be any difference in running it in the foreground vs background (other than the obvious of not doing it on a screen and your session being disconnected)

bitsofinfo avatar Apr 10 '15 14:04 bitsofinfo

@bitsofinfo Thanks. Another difference between the initial script and the upstart script is that we are exporting the AWS creds instead of just defining the environment variables locally before running yas3fs. It occurred to me when creating the upstart script that if anything is launched in a separate process it wouldn't have access to the creds. I would expect this would result in an explicit error but perhaps it fails silently?

robmoore avatar Apr 10 '15 20:04 robmoore

@robmoore Out of curiosity was this ever solved?

rusterholz avatar Sep 25 '15 02:09 rusterholz

@rusterholz We were able to get past this but unfortunately I can't recall what steps we took. I no longer work at the company I was at when I ran into this so I no longer have access to the environment. Sorry I couldn't be helpful here.

robmoore avatar Sep 27 '15 17:09 robmoore