glusterfs
glusterfs copied to clipboard
error in writing data to file with kernel-writeback-cache=on
Description of problem:
Error in slow writing data to file on Gluster file system mounted with kernel-writeback-cache=on.
Error in volume.log file:
[2020-09-29 20:46:45.710690] W [MSGID: 114031] [client-rpc-fops_v2.c:2688:client4_0_readv_cbk] 0-webs-h1-client-0: remote operation failed. [{errno=9}, {error=Bad file descriptor}] [2020-09-29 20:46:45.711228] W [MSGID: 114031] [client-rpc-fops_v2.c:2688:client4_0_readv_cbk] 0-webs-h1-client-1: remote operation failed. [{errno=9}, {error=Bad file descriptor}] [2020-09-29 20:46:45.711257] W [fuse-bridge.c:2961:fuse_readv_cbk] 0-glusterfs-fuse: 34417: READ => -1 gfid=53a590d9-1846-4747-a034-c2afed1b3db0 fd=0x7efbd8122ce8 (Bad file descriptor)
The exact command to reproduce the issue:
sample test code in php:
error_reporting(E_ALL);
$testFile = 'tmp/test.log'; $fp = fopen($testFile, 'ab'); if ($fp == FALSE) { echo "write error - $testFile\n"; } else { echo "file $testFile is open for write\n"; $fwrite = fwrite($fp, "chunk 1\n"); if ($fwrite === false) { echo "write error - $testFile\n"; } sleep(40); $message = "looooooooooooong chung 2\n"; $fwrite = fwrite($fp, $message); if ($fwrite === false) { echo "write error after sleep to $testFile\n"; } else { echo "written size $fwrite from ".strlen($message)." after sleep to $testFile\n"; } fclose($fp); };
The full output of the command that failed:
file tmp/test.log is open for write written size 0 from 25 after sleep to tmp/test.log/n
Expected results:
file tmp/test.log is open for write written size 25 from 25 after sleep to tmp/test.log/n
Additional info:
- The output of the gluster volume info
command:
Volume Name: webs-h1Type: Replicate Volume ID: 6f95d12b-98f1-4fb2-bbb6-01b18fc04d5e Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: h1-internal:/data/gluster-webs-h1/brick Brick2: table-internal:/data/gluster-webs-h1/brick Brick3: vona-internal:/data/gluster-webs-h1/brick (arbiter) Options Reconfigured: network.inode-lru-limit: 200000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on performance.client-io-threads: on performance.open-behind: off performance.write-behind-window-size: 2MB
- The operating system / glusterfs version:
CentOS 8.2.2004 Gluster 8.1
@jirireischig The program you gave is performing writes where as the logs you mentioned are doing READs. Are you sure the log messages correspond to the program you mentioned in the issue?
I'm sorry, I don't known why is any READ in the log. But when i run php code I find this in the Gluster's log file.
I try strace also with this fragment:
3585237 getcwd("/data/webs", 4096) = 11 3585237 lstat("/data/webs/tmp/test.log", {st_mode=S_IFREG|0644, st_size=131, ...}) = 0 3585237 lstat("/data/webs/tmp", {st_mode=S_IFDIR|0755, st_size=3, ...}) = 0 3585237 openat(AT_FDCWD, "/data/webs/tmp/test.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 5 3585237 fstat(5, {st_mode=S_IFREG|0644, st_size=131, ...}) = 0 3585237 lseek(5, 0, SEEK_CUR) = 0 3585237 lseek(5, 0, SEEK_CUR) = 0 3585237 write(1, "file tmp/test.log is open for wr"..., 36) = 36 3585237 write(5, "chunk 1\n", 8) = 8 3585237 nanosleep({tv_sec=40, tv_nsec=0}, 0x7ffbffff7960) = 0 3585237 write(5, "looooooooooooong chung 2\n", 25) = -1 EBADF (Bad file descriptor) 3585237 write(1, "written size 0 from 25 after sle"..., 52) = 52 3585237 close(5) = 0
First chunk is written successfully immediately after file is open. Then second chunk is not written because of EBADF error.
@jirireischig This amplifies my suspicion. The file is opened for O_WRONLY so reads on the file would fail with EBADFD. I wonder why reads are sent on this fd. Is this a test setup? Can you enable gluster volume profile when this script is executed and attach the output?
You can refer to this link for volume profile usage: https://docs.gluster.org/en/latest/Administrator%20Guide/Monitoring%20Workload/#start-profiling
@jirireischig Basically we need to know why reads are sent on a file opened for Writing. If you could send the bricks logs as well along with profile info output I mentioned in the comment before, it would be great.
OK, no problem. I create simple one brick gluster for testing:
gluster volume create write-test h1-internal:/data/gluster-webs-h1/brick-test
Than start and mount it with this options: defaults,_netdev,noatime,acl,direct-io-mode=disable,resolve-gids,kernel-writeback-cache=on
Start profile.
Run the test script with output:
php write-test.php
file tmp/test.log is open for write written size 0 from 25 after sleep to tmp/test.log
Output from profile info:
gluster volume profile write-test info
Brick: h1-internal:/data/gluster-webs-h1/brick-test
Cumulative Stats: Block Size: 32b+ 64b+ 128b+ No. of Reads: 0 0 0 No. of Writes: 7 1 2
Block Size: 256b+ 512b+ 1024b+ No. of Reads: 0 2 0 No. of Writes: 1 2 3
Block Size: 2048b+ 4096b+ 8192b+ No. of Reads: 0 0 0 No. of Writes: 8 41 14
Block Size: 16384b+ 65536b+ 131072b+ No. of Reads: 0 0 0 No. of Writes: 16 3 1
%-latency Avg-latency Min-Latency Max-Latency No. of calls Fop
0.00 0.00 us 0.00 us 0.00 us 97 FORGET
0.00 0.00 us 0.00 us 0.00 us 96 RELEASE
0.00 0.00 us 0.00 us 0.00 us 53 RELEASEDIR
0.71 8.48 us 7.97 us 8.98 us 2 FLUSH
3.93 47.28 us 47.15 us 47.41 us 2 FSTAT
4.18 50.29 us 49.64 us 50.94 us 2 OPEN
5.22 62.76 us 45.72 us 79.80 us 2 READ
7.35 88.34 us 86.08 us 90.60 us 2 WRITE
12.81 76.98 us 66.52 us 82.57 us 4 SETATTR
15.67 47.11 us 22.29 us 73.86 us 8 LOOKUP
50.13 29.40 us 21.96 us 39.76 us 41 STATFS
Duration: 858 seconds
Data Read: 1172 bytes Data Written: 1190571 bytes
Interval 1 Stats: Block Size: 32b+ No. of Reads: 0 No. of Writes: 1 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop
0.00 0.00 us 0.00 us 0.00 us 1 RELEASE
0.58 7.97 us 7.97 us 7.97 us 1 FLUSH
3.35 45.72 us 45.72 us 45.72 us 1 READ
3.45 47.15 us 47.15 us 47.15 us 1 FSTAT
3.64 49.64 us 49.64 us 49.64 us 1 OPEN
6.64 90.60 us 90.60 us 90.60 us 1 WRITE
12.04 82.18 us 81.79 us 82.57 us 2 SETATTR
14.77 50.40 us 22.29 us 73.86 us 4 LOOKUP
55.52 29.14 us 21.96 us 38.74 us 26 STATFS
Duration: 571 seconds
Data Read: 0 bytes Data Written: 57 bytes
And log: cat /var/log/glusterfs/data-write-test.log [2020-10-01 09:54:40.506437] I [glusterfsd-mgmt.c:77:mgmt_cbk_spec] 0-mgmt: Volume file changed [2020-10-01 09:54:40.537981] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vona-internal:24007 table-internal:24007 [2020-10-01 09:54:40.537999] I [MSGID: 101221] [common-utils.c:3822:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}] [2020-10-01 09:54:40.538202] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-write-test-client-0: option ping-timeout using set value 42 [2020-10-01 09:54:40.538219] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-client-0: option send-gids using set value true [2020-10-01 09:54:40.538226] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-client-0: option strict-locks using set value off [2020-10-01 09:54:40.538250] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-dht: option lock-migration using set value off [2020-10-01 09:54:40.538259] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-dht: option force-migration using set value off [2020-10-01 09:54:40.538289] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-utime: option noatime using set value on [2020-10-01 09:54:40.538304] I [MSGID: 0] [options.c:1242:xlator_option_reconf_size_uint64] 0-write-test-readdir-ahead: option rda-request-size using set value 131072 [2020-10-01 09:54:40.538312] I [MSGID: 0] [options.c:1242:xlator_option_reconf_size_uint64] 0-write-test-readdir-ahead: option rda-cache-limit using set value 10MB [2020-10-01 09:54:40.538318] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-readdir-ahead: option parallel-readdir using set value off [2020-10-01 09:54:40.538344] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-md-cache: option cache-posix-acl using set value true [2020-10-01 09:54:40.538377] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test: option count-fop-hits using set value on [2020-10-01 09:54:40.538383] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test: option latency-measurement using set value on [2020-10-01 09:54:40.538395] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-write-test: option log-level using set value INFO [2020-10-01 09:54:40.538411] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-write-test: option threads using set value 16 [2020-10-01 09:54:40.538012] I [MSGID: 101221] [common-utils.c:3822:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}] [2020-10-01 09:54:44.398935] C [mem-pool.c:873:mem_put] (-->/lib64/libglusterfs.so.0(fd_close+0x6a) [0x7efbff20ddda] -->/usr/lib64/glusterfs/8.1/xlator/performance/open-behind.so(+0x744a) [0x7efbf65a144a] -->/lib64/libglusterfs.so.0(mem_put+0x100) [0x7efbff211060] ) 0-mem-pool: invalid argument hdr->pool_list NULL [Invalid argument] [2020-10-01 09:55:24.399855] W [MSGID: 114031] [client-rpc-fops_v2.c:2688:client4_0_readv_cbk] 0-write-test-client-0: remote operation failed. [{errno=9}, {error=Bad file descriptor}] [2020-10-01 09:55:24.399893] W [fuse-bridge.c:2961:fuse_readv_cbk] 0-glusterfs-fuse: 1663: READ => -1 gfid=2d3f810a-8956-48a8-bcdc-d08b9a543118 fd=0x7efbe00172b8 (Bad file descriptor) [2020-10-01 09:56:14.440633] I [glusterfsd-mgmt.c:77:mgmt_cbk_spec] 0-mgmt: Volume file changed [2020-10-01 09:56:14.465776] I [glusterfsd-mgmt.c:2170:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: vona-internal:24007 table-internal:24007 [2020-10-01 09:56:14.465792] I [MSGID: 101221] [common-utils.c:3822:gf_set_volfile_server_common] 0-gluster: duplicate entry for volfile-server [{errno=17}, {error=File exists}] [2020-10-01 09:56:14.465974] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-write-test-client-0: option ping-timeout using set value 42 [2020-10-01 09:56:14.465990] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-client-0: option send-gids using set value true [2020-10-01 09:56:14.465997] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-client-0: option strict-locks using set value off [2020-10-01 09:56:14.466025] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-dht: option lock-migration using set value off [2020-10-01 09:56:14.466035] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-dht: option force-migration using set value off [2020-10-01 09:56:14.466062] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-utime: option noatime using set value on [2020-10-01 09:56:14.466077] I [MSGID: 0] [options.c:1242:xlator_option_reconf_size_uint64] 0-write-test-readdir-ahead: option rda-request-size using set value 131072 [2020-10-01 09:56:14.466085] I [MSGID: 0] [options.c:1242:xlator_option_reconf_size_uint64] 0-write-test-readdir-ahead: option rda-cache-limit using set value 10MB [2020-10-01 09:56:14.466091] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-readdir-ahead: option parallel-readdir using set value off [2020-10-01 09:56:14.466116] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test-md-cache: option cache-posix-acl using set value true [2020-10-01 09:56:14.466150] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test: option count-fop-hits using set value off [2020-10-01 09:56:14.466157] I [MSGID: 0] [options.c:1245:xlator_option_reconf_bool] 0-write-test: option latency-measurement using set value off [2020-10-01 09:56:14.466171] I [MSGID: 0] [options.c:1236:xlator_option_reconf_str] 0-write-test: option log-level using set value INFO [2020-10-01 09:56:14.466193] I [MSGID: 0] [options.c:1240:xlator_option_reconf_int32] 0-write-test: option threads using set value 16
@jirireischig Looks like a bug in fuse-kernel? When I attach the process to gdb while testing it, fuse-kernel seems to be sending reads on a file opened with 'O_WRONLY'
If I change open flags to ab+, the script executes successfully.
Adding @csabahenk in case he knows more about this issue
Found a similar problem from 6 years ago! http://fuse.996288.n3.nabble.com/PATCH-fuse-fix-fuse-writeback-problem-with-O-WRONLY-td12699.html
I have also similar problem with uploaded file by ssh. When upload is fast everything is OK, but when upload is slow (big file) upload is failed with same error in gluster log.
I'm sorry, I can't simulate this easy. It's reason why I create test php code for this. I also have some problem in php application (Joomla - logs and backup) like this.
It's look like there is some problem when writing is "slow".
@jirireischig You are correct. When I am debugging in gdb, if I put a breakpoint, it fails at the first write itself. I think kernel fuse is sending read when there is a pause on the file? Do you see significant bump in performance with kernel-writeback-cache?
@jirireischig Could you follow up with kernel fuse mailing list as well? Let me include @amarts also in case he knows more about this.You can probably point to the link I pasted to tell them that the same issue happens if the application waits for some time before issuing the next write.
@jirireischig You are correct. When I am debugging in gdb, if I put a breakpoint, it fails at the first write itself. I think kernel fuse is sending read when there is a pause on the file? Do you see significant bump in performance with kernel-writeback-cache?
I'll be testing gluster without enabled kernel-writeback-cache for our usage for some days. But, it's not good behaviour to fail writing when kernel-writeback-cache is enabled. :-)
@jirireischig Could you follow up with kernel fuse mailing list as well? Let me include @amarts also in case he knows more about this.You can probably point to the link I pasted to tell them that the same issue happens if the application waits for some time before issuing the next write.
I'm sorry, I'm not in kernel fuse mailing list. I don't known if the problem is directly in kernel fuse or in Gluster implementation and how to test fuse with kernel-writeback-cache without Gluster.
@jirireischig You are correct. When I am debugging in gdb, if I put a breakpoint, it fails at the first write itself. I think kernel fuse is sending read when there is a pause on the file? Do you see significant bump in performance with kernel-writeback-cache?
I'll be testing gluster without enabled kernel-writeback-cache for our usage for some days. But, it's not good behaviour to fail writing when kernel-writeback-cache is enabled. :-)
I totally agree. The bug is in kernel fuse as per my debugging. I will let @csabahenk (Contributor to fuse) shine some light on this.
@jirireischig Could you follow up with kernel fuse mailing list as well? Let me include @amarts also in case he knows more about this.You can probably point to the link I pasted to tell them that the same issue happens if the application waits for some time before issuing the next write.
I'm sorry, I'm not in kernel fuse mailing list. I don't known if the problem is directly in kernel fuse or in Gluster implementation and how to test fuse with kernel-writeback-cache without Gluster.
Gluster is sending the read received from kernel-fuse to the server process as is and it is XFS (In my case the brick is formatted with XFS) which failed the read operation. XFS' behaviour is correct. Read shouldn't have been sent on an fd opened with WRONLY flag. I will close the issue as there won't be a fix in glusterfs as per my understanding. Please feel free to re-open the issue if my debugging is found to be wrong. Please note the discussions can continue. But the issue will be closed as there won't be a patch in glusterfs project that can fix the problem as per my understanding. Happy to be proven wrong
@jirireischig Could you follow up with kernel fuse mailing list as well? Let me include @amarts also in case he knows more about this.You can probably point to the link I pasted to tell them that the same issue happens if the application waits for some time before issuing the next write.
I'm sorry, I'm not in kernel fuse mailing list. I don't known if the problem is directly in kernel fuse or in Gluster implementation and how to test fuse with kernel-writeback-cache without Gluster.
Gluster is sending the read received from kernel-fuse to the server process as is and it is XFS (In my case the brick is formatted with XFS) which failed the read operation. XFS' behaviour is correct. Read shouldn't have been sent on an fd opened with WRONLY flag. I will close the issue as there won't be a fix in glusterfs as per my understanding. Please feel free to re-open the issue if my debugging is found to be wrong. Please note the discussions can continue. But the issue will be closed as there won't be a patch in glusterfs project that can fix the problem as per my understanding. Happy to be proven wrong
I'm sorry, but its not good close this issue because there is potential of data loss when kernel-writeback-cache is on.
I'm not developer and I don't known in which layer is error, but, may be, its possible that Gluster can "translate" WRONLY flag to any useful variant for under-layer file system.
@jirireischig I raised this issue in the glusterfs community meeting. @csabahenk who has knowledge about kernel fuse will respond to the issue. Translating WRONLY to something more will lead to other kinds of bugs. So it is best to avoid it.
I will keep this open until Csaba responds to this.
I plan to look at this in one week's time.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Please, open this bug, because it's not resolved and there is potential of data loss when kernel-writeback-cache is on.
If there's a risk, we shouldn't use this. I've read somewhere (#435) that it's not safe:
This mode
assumes that all changes to the filesystem go through the FUSE kernel module
(size and atime/ctime/mtime attributes are kept up-to-date by the kernel), so
it's generally not suitable for network filesystems.
I think, that main problem is how writeback-cache is working. Specialy https://github.com/gluster/glusterfs/issues/435#issuecomment-392009971
This means, that even for files opened for O_WRONLY it is possible that READ requests will be generated by the kernel.
May be, its possible that Gluster can "translate" O_WRONLY flag to any useful variant.
@csabahenk did you get a chance to look into this?
I'm not sure why FUSE needs to send a read request in case of partial writes (it could simply send the partial write). Receiving reads on a file opened for write only could be problematic. It would need a more detailed analysis.
But I think that combining the writeback-cache with proper invalidation from gluster to kernel (we have invalidation, but it's not strict enough to ensure consistency right now), we could keep consistency while still getting the maximum performance from the kernel cache.
This bug can be trivially replicated by the following sequence:
echo "Test" > testfile ; echo "Test 2" >> testfile
This is rather basic filesystem use and I suspect it breaks quite a lot of potential uses.
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Please, don't close this bug, because it's not resolved and there is potential of data loss when kernel-writeback-cache is on.
Please, don't close this bug, because it's not resolved and there is potential of data loss when kernel-writeback-cache is on.
Do you have a patch to fix this? Otherwise, I think it's best either not to use it, ensure it's off by default and provide a warning when enabling it.
It seems issue is already fixed in libfuse (https://github.com/libfuse/libfuse/commit/b3109e71faf2713402f70d226617352815f6c72e)
It seems issue is already fixed in libfuse (libfuse/libfuse@b3109e7)
I mean we have to handle similar way in our codepath