mountpoint-s3 icon indicating copy to clipboard operation
mountpoint-s3 copied to clipboard

ls: cannot open directory '...': Transport endpoint is not connected

Open tchaton opened this issue 1 year ago • 9 comments

Mountpoint for Amazon S3 version

1.1.1 with caching

AWS Region

us-east-1

Describe the running environment

Running on Amazon EC2

What happened?

This is happening quite frequentally ~ 7/10 for us in our filesystem tests.

ls: cannot open directory `....`: Transport endpoint is not connected

Relevant log output

The only log line I can see is the following.

«2023-11-24T13:46:44.170754913Z 2023-11-24T13:46:44.170582Z  WARN lookup{req=44 ino=1 name="Uploads"}: mountpoint_s3::fuse: lookup failed: inode error: file does not exist
¾2023-11-24T13:46:44.458244689Z 2023-11-24T13:46:44.458094Z  WARN lookup{req=46 ino=2 name="01hg0s363ta4kkvwyhcgvk83zc"}: mountpoint_s3::fuse: lookup failed: inode error: file does not exist
¾2023-11-24T13:46:51.310283712Z 2023-11-24T13:46:51.310113Z  WARN readdirplus{req=52 ino=1 fh=2 offset=1}: mountpoint_s3::fuse: readdirplus failed: out-of-order readdir, expected=4, actual=1

cc @dannycjones @passaro

tchaton avatar Nov 24 '23 12:11 tchaton

Addtitionally, we are observing a CPU spike every minute with --enable-metadata-caching --metadata-cache-ttl 60. I was hoping the listing would be lazy e.g if the users don't list or interact with the mount, no listing is done.

tchaton avatar Nov 24 '23 14:11 tchaton

Hi @tchaton, thanks for raising the issue. I see you were using a custom build of 1.1.1 with caching. Have you since upgraded to 1.2.0? Note that the flags to configure caching are different from the pre-release version. Once you upgrade, could you report if you are still observing the issue on 1.2.0?

Are you able to share more details on the workload you ran before seeing the error on ls? Do you get similar errors when running other commands? Or just ls? Is the mount-s3 process still running when the error occurs?

EDIT: for help with the new configuration flags, see this section in the docs.

passaro avatar Nov 27 '23 15:11 passaro

About the CPU spikes: Mountpoint does not proactively refresh metadata when it expires. So it should behave just as you were expecting. I suspect that the activity you are observing is due to applications accessing the filesystem and the kernel in turn requesting updated metadata from Mountpoint.

passaro avatar Nov 27 '23 15:11 passaro

Hey @passaro Let me update and give you more feedbacks.

tchaton avatar Nov 30 '23 09:11 tchaton

@passaro But if you want to see some failures, you can do something like this.

Create 1 bucket with 1M files with random sizes ranging from 100kb to 10GB.

And copy all the files from the mount to another bucket while trying to maximize the CPU usage of the machine to 100%( I am using a machine with 32 or 64 CPU cores).

docker run --rm -v ~/.aws:/root/.aws -v /{mount_to_bucket_1}/:/data/ peakcom/s5cmd --numworkers {2 * cpu_cores} cp /data/ s3://bucket_2

This always fails for me. However, other open source solutions are more reliable under that same stress.

tchaton avatar Nov 30 '23 19:11 tchaton

@tchaton, unfortunately, I was not able to reproduce the issue with the command you suggested. It may depend on specific factors like the content of your bucket or the load on your instance.

However, my (unconfirmed) suspicion is that you are seeing the result of an out of memory issue, similar to that reported in #502. Would you be able to verify if your syslog contains lines similar to these (once you reproduce the Transport endpoint is not connected error):

kernel: Out of memory: Killed process 2684 (mount-s3)
systemd[1]: session-1.scope: A process of this unit has been killed by the OOM killer. 
systemd[1]: session-1.scope: Killing process 3172 (docker) with signal SIGKILL.

passaro avatar Dec 05 '23 18:12 passaro

Hey @passaro I will try again. For the syslog, what do you mean exactly ? How can check them ?

tchaton avatar Dec 13 '23 13:12 tchaton

You can probably use journalctl. For example, the lines I copied above were extracted from the output of this command:

journalctl -t systemd -t kernel

journalctl should be available on most modern Linux distributions, including Amazon Linux. On other systems, syslog entries are likely written to a file such as /var/log/syslog.

passaro avatar Dec 13 '23 14:12 passaro

I also encountered this error when using s3fs and now mountpoint-s3.

I am applying a solution that I described in this comment: https://github.com/s3fs-fuse/s3fs-fuse/issues/2356#issuecomment-1791770501

nguyenminhdungpg avatar May 07 '24 14:05 nguyenminhdungpg