glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

error in writing data to file with kernel-writeback-cache=on

Open jirireischig opened this issue 4 years ago • 33 comments

Description of problem:

Error in slow writing data to file on Gluster file system mounted with kernel-writeback-cache=on.

Error in volume.log file:

[2020-09-29 20:46:45.710690] W [MSGID: 114031] [client-rpc-fops_v2.c:2688:client4_0_readv_cbk] 0-webs-h1-client-0: remote operation failed. [{errno=9}, {error=Bad file descriptor}] [2020-09-29 20:46:45.711228] W [MSGID: 114031] [client-rpc-fops_v2.c:2688:client4_0_readv_cbk] 0-webs-h1-client-1: remote operation failed. [{errno=9}, {error=Bad file descriptor}] [2020-09-29 20:46:45.711257] W [fuse-bridge.c:2961:fuse_readv_cbk] 0-glusterfs-fuse: 34417: READ => -1 gfid=53a590d9-1846-4747-a034-c2afed1b3db0 fd=0x7efbd8122ce8 (Bad file descriptor)

The exact command to reproduce the issue:

sample test code in php:

error_reporting(E_ALL);

$testFile = 'tmp/test.log'; $fp = fopen($testFile, 'ab'); if ($fp == FALSE) { echo "write error - $testFile\n"; } else { echo "file $testFile is open for write\n"; $fwrite = fwrite($fp, "chunk 1\n"); if ($fwrite === false) { echo "write error - $testFile\n"; } sleep(40); $message = "looooooooooooong chung 2\n"; $fwrite = fwrite($fp, $message); if ($fwrite === false) { echo "write error after sleep to $testFile\n"; } else { echo "written size $fwrite from ".strlen($message)." after sleep to $testFile\n"; } fclose($fp); };

The full output of the command that failed:

file tmp/test.log is open for write written size 0 from 25 after sleep to tmp/test.log/n

Expected results:

file tmp/test.log is open for write written size 25 from 25 after sleep to tmp/test.log/n

Additional info:

- The output of the gluster volume info command:

Volume Name: webs-h1Type: Replicate Volume ID: 6f95d12b-98f1-4fb2-bbb6-01b18fc04d5e Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: h1-internal:/data/gluster-webs-h1/brick Brick2: table-internal:/data/gluster-webs-h1/brick Brick3: vona-internal:/data/gluster-webs-h1/brick (arbiter) Options Reconfigured: network.inode-lru-limit: 200000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: on performance.open-behind: off performance.write-behind-window-size: 2MB

- The operating system / glusterfs version:

CentOS 8.2.2004 Gluster 8.1

jirireischig avatar Sep 29 '20 20:09 jirireischig

@jirireischig The program you gave is performing writes where as the logs you mentioned are doing READs. Are you sure the log messages correspond to the program you mentioned in the issue?

pranithk avatar Sep 30 '20 00:09 pranithk

I'm sorry, I don't known why is any READ in the log. But when i run php code I find this in the Gluster's log file.

I try strace also with this fragment:

3585237 getcwd("/data/webs", 4096) = 11 3585237 lstat("/data/webs/tmp/test.log", {st_mode=S_IFREG|0644, st_size=131, ...}) = 0 3585237 lstat("/data/webs/tmp", {st_mode=S_IFDIR|0755, st_size=3, ...}) = 0 3585237 openat(AT_FDCWD, "/data/webs/tmp/test.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 5 3585237 fstat(5, {st_mode=S_IFREG|0644, st_size=131, ...}) = 0 3585237 lseek(5, 0, SEEK_CUR) = 0 3585237 lseek(5, 0, SEEK_CUR) = 0 3585237 write(1, "file tmp/test.log is open for wr"..., 36) = 36 3585237 write(5, "chunk 1\n", 8) = 8 3585237 nanosleep({tv_sec=40, tv_nsec=0}, 0x7ffbffff7960) = 0 3585237 write(5, "looooooooooooong chung 2\n", 25) = -1 EBADF (Bad file descriptor) 3585237 write(1, "written size 0 from 25 after sle"..., 52) = 52 3585237 close(5) = 0

First chunk is written successfully immediately after file is open. Then second chunk is not written because of EBADF error.

jirireischig avatar Sep 30 '20 08:09 jirireischig

@jirireischig This amplifies my suspicion. The file is opened for O_WRONLY so reads on the file would fail with EBADFD. I wonder why reads are sent on this fd. Is this a test setup? Can you enable gluster volume profile when this script is executed and attach the output?

You can refer to this link for volume profile usage: https://docs.gluster.org/en/latest/Administrator%20Guide/Monitoring%20Workload/#start-profiling

pranithk avatar Oct 01 '20 00:10 pranithk

@jirireischig Basically we need to know why reads are sent on a file opened for Writing. If you could send the bricks logs as well along with profile info output I mentioned in the comment before, it would be great.

pranithk avatar Oct 01 '20 02:10 pranithk

OK, no problem. I create simple one brick gluster for testing:

gluster volume create write-test h1-internal:/data/gluster-webs-h1/brick-test

Than start and mount it with this options: defaults,_netdev,noatime,acl,direct-io-mode=disable,resolve-gids,kernel-writeback-cache=on

Start profile.

Run the test script with output:

php write-test.php

file tmp/test.log is open for write written size 0 from 25 after sleep to tmp/test.log

Output from profile info:

gluster volume profile write-test info

Brick: h1-internal:/data/gluster-webs-h1/brick-test

Cumulative Stats: Block Size: 32b+ 64b+ 128b+ No. of Reads: 0 0 0 No. of Writes: 7 1 2

Block Size: 256b+ 512b+ 1024b+ No. of Reads: 0 2 0 No. of Writes: 1 2 3

Block Size: 2048b+ 4096b+ 8192b+ No. of Reads: 0 0 0 No. of Writes: 8 41 14

Block Size: 16384b+ 65536b+ 131072b+ No. of Reads: 0 0 0 No. of Writes: 16 3 1

%-latency Avg-latency Min-Latency Max-Latency No. of calls Fop


  0.00       0.00 us       0.00 us       0.00 us             97      FORGET
  0.00       0.00 us       0.00 us       0.00 us             96     RELEASE
  0.00       0.00 us       0.00 us       0.00 us             53  RELEASEDIR
  0.71       8.48 us       7.97 us       8.98 us              2       FLUSH
  3.93      47.28 us      47.15 us      47.41 us              2       FSTAT
  4.18      50.29 us      49.64 us      50.94 us              2        OPEN
  5.22      62.76 us      45.72 us      79.80 us              2        READ
  7.35      88.34 us      86.08 us      90.60 us              2       WRITE
 12.81      76.98 us      66.52 us      82.57 us              4     SETATTR
 15.67      47.11 us      22.29 us      73.86 us              8      LOOKUP
 50.13      29.40 us      21.96 us      39.76 us             41      STATFS

Duration: 858 seconds

Data Read: 1172 bytes Data Written: 1190571 bytes

Interval 1 Stats: Block Size: 32b+ No. of Reads: 0 No. of Writes: 1 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop


  0.00       0.00 us       0.00 us       0.00 us              1     RELEASE
  0.58       7.97 us       7.97 us       7.97 us              1       FLUSH
  3.35      45.72 us      45.72 us      45.72 us              1        READ
  3.45      47.15 us      47.15 us      47.15 us              1       FSTAT
  3.64      49.64 us      49.64 us      49.64 us              1        OPEN
  6.64      90.60 us      90.60 us      90.60 us              1       WRITE
 12.04      82.18 us      81.79 us      82.57 us              2     SETATTR
 14.77      50.40 us      22.29 us      73.86 us              4      LOOKUP
 55.52      29.14 us      21.96 us      38.74 us             26      STATFS

Duration: 571 seconds

Data Read: 0 bytes Data Written: 57 bytes

And log: cat /var/log/glusterfs/data-write-test.log [2020-10-01 09:54:40.506437] I [glusterfsd-mgmt.c:77:mgmt_cbk_spec] 0-mgmt: Volume file changed [2020-10-01 09:54:40.537981] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vona-internal:24007 table-internal:24007 [2020-10-01 09:54:40.537999] I [MSGID: 101221] [common-utils.c:3822:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}] [2020-10-01 09:54:40.538202] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-write-test-client-0: option ping-timeout using set value 42 [2020-10-01 09:54:40.538219] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-client-0: option send-gids using set value true [2020-10-01 09:54:40.538226] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-client-0: option strict-locks using set value off [2020-10-01 09:54:40.538250] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-dht: option lock-migration using set value off [2020-10-01 09:54:40.538259] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-dht: option force-migration using set value off [2020-10-01 09:54:40.538289] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-utime: option noatime using set value on [2020-10-01 09:54:40.538304] I [MSGID: 0] [options.c:1242:xlator_option_reconf_size_uint64] 0-write-test-readdir-ahead: option rda-request-size using set value 131072 [2020-10-01 09:54:40.538312] I [MSGID: 0] [options.c:1242:xlator_option_reconf_size_uint64] 0-write-test-readdir-ahead: option rda-cache-limit using set value 10MB [2020-10-01 09:54:40.538318] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-readdir-ahead: option parallel-readdir using set value off [2020-10-01 09:54:40.538344] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-md-cache: option cache-posix-acl using set value true [2020-10-01 09:54:40.538377] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test: option count-fop-hits using set value on [2020-10-01 09:54:40.538383] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test: option latency-measurement using set value on [2020-10-01 09:54:40.538395] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-write-test: option log-level using set value INFO [2020-10-01 09:54:40.538411] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-write-test: option threads using set value 16 [2020-10-01 09:54:40.538012] I [MSGID: 101221] [common-utils.c:3822:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}] [2020-10-01 09:54:44.398935] C [mem-pool.c:873:mem_put] (-->/lib64/libglusterfs.so.0(fd_close+0x6a) [0x7efbff20ddda] -->/usr/lib64/glusterfs/8.1/xlator/performance/open-behind.so(+0x744a) [0x7efbf65a144a] -->/lib64/libglusterfs.so.0(mem_put+0x100) [0x7efbff211060] ) 0-mem-pool: invalid argument hdr->pool_list NULL [Invalid argument] [2020-10-01 09:55:24.399855] W [MSGID: 114031] [client-rpc-fops_v2.c:2688:client4_0_readv_cbk] 0-write-test-client-0: remote operation failed. [{errno=9}, {error=Bad file descriptor}] [2020-10-01 09:55:24.399893] W [fuse-bridge.c:2961:fuse_readv_cbk] 0-glusterfs-fuse: 1663: READ => -1 gfid=2d3f810a-8956-48a8-bcdc-d08b9a543118 fd=0x7efbe00172b8 (Bad file descriptor) [2020-10-01 09:56:14.440633] I [glusterfsd-mgmt.c:77:mgmt_cbk_spec] 0-mgmt: Volume file changed [2020-10-01 09:56:14.465776] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vona-internal:24007 table-internal:24007 [2020-10-01 09:56:14.465792] I [MSGID: 101221] [common-utils.c:3822:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}] [2020-10-01 09:56:14.465974] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-write-test-client-0: option ping-timeout using set value 42 [2020-10-01 09:56:14.465990] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-client-0: option send-gids using set value true [2020-10-01 09:56:14.465997] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-client-0: option strict-locks using set value off [2020-10-01 09:56:14.466025] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-dht: option lock-migration using set value off [2020-10-01 09:56:14.466035] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-dht: option force-migration using set value off [2020-10-01 09:56:14.466062] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-utime: option noatime using set value on [2020-10-01 09:56:14.466077] I [MSGID: 0] [options.c:1242:xlator_option_reconf_size_uint64] 0-write-test-readdir-ahead: option rda-request-size using set value 131072 [2020-10-01 09:56:14.466085] I [MSGID: 0] [options.c:1242:xlator_option_reconf_size_uint64] 0-write-test-readdir-ahead: option rda-cache-limit using set value 10MB [2020-10-01 09:56:14.466091] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-readdir-ahead: option parallel-readdir using set value off [2020-10-01 09:56:14.466116] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-md-cache: option cache-posix-acl using set value true [2020-10-01 09:56:14.466150] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test: option count-fop-hits using set value off [2020-10-01 09:56:14.466157] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test: option latency-measurement using set value off [2020-10-01 09:56:14.466171] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-write-test: option log-level using set value INFO [2020-10-01 09:56:14.466193] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-write-test: option threads using set value 16

jirireischig avatar Oct 01 '20 10:10 jirireischig

@jirireischig Looks like a bug in fuse-kernel? When I attach the process to gdb while testing it, fuse-kernel seems to be sending reads on a file opened with 'O_WRONLY'

If I change open flags to ab+, the script executes successfully.

Adding @csabahenk in case he knows more about this issue

pranithk avatar Oct 01 '20 10:10 pranithk

Found a similar problem from 6 years ago! http://fuse.996288.n3.nabble.com/PATCH-fuse-fix-fuse-writeback-problem-with-O-WRONLY-td12699.html

pranithk avatar Oct 01 '20 10:10 pranithk

I have also similar problem with uploaded file by ssh. When upload is fast everything is OK, but when upload is slow (big file) upload is failed with same error in gluster log.

I'm sorry, I can't simulate this easy. It's reason why I create test php code for this. I also have some problem in php application (Joomla - logs and backup) like this.

It's look like there is some problem when writing is "slow".

jirireischig avatar Oct 01 '20 11:10 jirireischig

@jirireischig You are correct. When I am debugging in gdb, if I put a breakpoint, it fails at the first write itself. I think kernel fuse is sending read when there is a pause on the file? Do you see significant bump in performance with kernel-writeback-cache?

pranithk avatar Oct 01 '20 11:10 pranithk

@jirireischig Could you follow up with kernel fuse mailing list as well? Let me include @amarts also in case he knows more about this.You can probably point to the link I pasted to tell them that the same issue happens if the application waits for some time before issuing the next write.

pranithk avatar Oct 01 '20 11:10 pranithk

@jirireischig You are correct. When I am debugging in gdb, if I put a breakpoint, it fails at the first write itself. I think kernel fuse is sending read when there is a pause on the file? Do you see significant bump in performance with kernel-writeback-cache?

I'll be testing gluster without enabled kernel-writeback-cache for our usage for some days. But, it's not good behaviour to fail writing when kernel-writeback-cache is enabled. :-)

jirireischig avatar Oct 02 '20 07:10 jirireischig

@jirireischig Could you follow up with kernel fuse mailing list as well? Let me include @amarts also in case he knows more about this.You can probably point to the link I pasted to tell them that the same issue happens if the application waits for some time before issuing the next write.

I'm sorry, I'm not in kernel fuse mailing list. I don't known if the problem is directly in kernel fuse or in Gluster implementation and how to test fuse with kernel-writeback-cache without Gluster.

jirireischig avatar Oct 02 '20 07:10 jirireischig

@jirireischig You are correct. When I am debugging in gdb, if I put a breakpoint, it fails at the first write itself. I think kernel fuse is sending read when there is a pause on the file? Do you see significant bump in performance with kernel-writeback-cache?

I'll be testing gluster without enabled kernel-writeback-cache for our usage for some days. But, it's not good behaviour to fail writing when kernel-writeback-cache is enabled. :-)

I totally agree. The bug is in kernel fuse as per my debugging. I will let @csabahenk (Contributor to fuse) shine some light on this.

pranithk avatar Oct 02 '20 16:10 pranithk

@jirireischig Could you follow up with kernel fuse mailing list as well? Let me include @amarts also in case he knows more about this.You can probably point to the link I pasted to tell them that the same issue happens if the application waits for some time before issuing the next write.

I'm sorry, I'm not in kernel fuse mailing list. I don't known if the problem is directly in kernel fuse or in Gluster implementation and how to test fuse with kernel-writeback-cache without Gluster.

Gluster is sending the read received from kernel-fuse to the server process as is and it is XFS (In my case the brick is formatted with XFS) which failed the read operation. XFS' behaviour is correct. Read shouldn't have been sent on an fd opened with WRONLY flag. I will close the issue as there won't be a fix in glusterfs as per my understanding. Please feel free to re-open the issue if my debugging is found to be wrong. Please note the discussions can continue. But the issue will be closed as there won't be a patch in glusterfs project that can fix the problem as per my understanding. Happy to be proven wrong

pranithk avatar Oct 02 '20 16:10 pranithk

@jirireischig Could you follow up with kernel fuse mailing list as well? Let me include @amarts also in case he knows more about this.You can probably point to the link I pasted to tell them that the same issue happens if the application waits for some time before issuing the next write.

I'm sorry, I'm not in kernel fuse mailing list. I don't known if the problem is directly in kernel fuse or in Gluster implementation and how to test fuse with kernel-writeback-cache without Gluster.

Gluster is sending the read received from kernel-fuse to the server process as is and it is XFS (In my case the brick is formatted with XFS) which failed the read operation. XFS' behaviour is correct. Read shouldn't have been sent on an fd opened with WRONLY flag. I will close the issue as there won't be a fix in glusterfs as per my understanding. Please feel free to re-open the issue if my debugging is found to be wrong. Please note the discussions can continue. But the issue will be closed as there won't be a patch in glusterfs project that can fix the problem as per my understanding. Happy to be proven wrong

I'm sorry, but its not good close this issue because there is potential of data loss when kernel-writeback-cache is on.

I'm not developer and I don't known in which layer is error, but, may be, its possible that Gluster can "translate" WRONLY flag to any useful variant for under-layer file system.

jirireischig avatar Oct 13 '20 09:10 jirireischig

@jirireischig I raised this issue in the glusterfs community meeting. @csabahenk who has knowledge about kernel fuse will respond to the issue. Translating WRONLY to something more will lead to other kinds of bugs. So it is best to avoid it.

pranithk avatar Oct 13 '20 10:10 pranithk

I will keep this open until Csaba responds to this.

pranithk avatar Oct 13 '20 10:10 pranithk

I plan to look at this in one week's time.

csabahenk avatar Oct 13 '20 10:10 csabahenk

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar May 12 '21 04:05 stale[bot]

Please, open this bug, because it's not resolved and there is potential of data loss when kernel-writeback-cache is on.

jirireischig avatar May 12 '21 11:05 jirireischig

If there's a risk, we shouldn't use this. I've read somewhere (#435) that it's not safe:

This mode
assumes that all changes to the filesystem go through the FUSE kernel module
(size and atime/ctime/mtime attributes are kept up-to-date by the kernel), so
it's generally not suitable for network filesystems.

mykaul avatar May 12 '21 12:05 mykaul

I think, that main problem is how writeback-cache is working. Specialy https://github.com/gluster/glusterfs/issues/435#issuecomment-392009971

This means, that even for files opened for O_WRONLY it is possible that READ requests will be generated by the kernel.

May be, its possible that Gluster can "translate" O_WRONLY flag to any useful variant.

jirireischig avatar May 12 '21 15:05 jirireischig

@csabahenk did you get a chance to look into this?

pranithk avatar May 12 '21 16:05 pranithk

I'm not sure why FUSE needs to send a read request in case of partial writes (it could simply send the partial write). Receiving reads on a file opened for write only could be problematic. It would need a more detailed analysis.

But I think that combining the writeback-cache with proper invalidation from gluster to kernel (we have invalidation, but it's not strict enough to ensure consistency right now), we could keep consistency while still getting the maximum performance from the kernel cache.

xhernandez avatar May 12 '21 17:05 xhernandez

This bug can be trivially replicated by the following sequence:

echo "Test" > testfile ; echo "Test 2" >> testfile

This is rather basic filesystem use and I suspect it breaks quite a lot of potential uses.

jpwit avatar Jul 09 '21 20:07 jpwit

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Feb 05 '22 08:02 stale[bot]

Please, don't close this bug, because it's not resolved and there is potential of data loss when kernel-writeback-cache is on.

jirireischig avatar Feb 07 '22 14:02 jirireischig

Please, don't close this bug, because it's not resolved and there is potential of data loss when kernel-writeback-cache is on.

Do you have a patch to fix this? Otherwise, I think it's best either not to use it, ensure it's off by default and provide a warning when enabling it.

mykaul avatar Feb 07 '22 14:02 mykaul

It seems issue is already fixed in libfuse (https://github.com/libfuse/libfuse/commit/b3109e71faf2713402f70d226617352815f6c72e)

mohit84 avatar Feb 07 '22 15:02 mohit84

It seems issue is already fixed in libfuse (libfuse/libfuse@b3109e7)

I mean we have to handle similar way in our codepath

mohit84 avatar Feb 07 '22 15:02 mohit84