glusterfs Volume becomes unresponsive when data is being written onto it via FUSE, same shard located on 2 instances of same subvol according to vol log

Volume becomes unresponsive when data is being written onto it via FUSE, same shard located on 2 instances of same subvol according to vol log

Open olegkrutov opened this issue 2 years ago • 0 comments

Description of problem: I applied https://github.com/gluster/glusterfs/pull/2304 and https://github.com/gluster/glusterfs/pull/3720 to gluster 10.2-1 on debian 11, created distributed volume with sharding enabled and run a script that creates files of fixed size (100MB) and then checks their crc32 and size (https://github.com/gluster/glusterfs/issues/2246#issuecomment-796708460). I noticed that after short time of testing, script stops to work. No errors, just stopped. Looking into logs gave me that messages like that are appeared in volume log:

[2022-09-1[2022-09-11 22:38:28.436095 +0000] W [MSGID: 109007] [dht-common.c:2759:dht_look up_everywhere_cbk] 2-bbb-dht: multiple subvolumes (bbb-client-2 and bbb-client-2 ) have file /.shard/9fdbc0ff-d12e-4e22-929c-2a41c78bee1e.1 (preferably rename th e file in the backend,and do a fresh lookup)

and that repeats 100.000's of times.

The exact command to reproduce the issue:

The full output of the command that failed:

Expected results: continued i/o with the volume

Mandatory info: - The output of the gluster volume info command:

Volume Name: bbb
Type: Distribute
Volume ID: a0ae42f3-1c7f-48e4-8dc9-3dd6318c1410
Status: Started
Snapshot Count: 0
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: 6ae01cdf-63ed-4360-9b04-ab25a7d7e07c:/storages/zfs/zpool
Brick2: 38869f5b-3cd8-4f38-91f3-c81e3d74b8d9:/storages/zfs/zpool
Brick3: 8e916a59-9435-4dc4-8377-9aee6bc6da00:/storages/zfs/zpool
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: disable
features.shard: on
user.cifs: off
client.event-threads: 4
server.event-threads: 4
performance.client-io-threads: on
cluster.lookup-optimize: off
performance.strict-o-direct: on
cluster.eager-lock: enable
cluster.quorum-type: none
cluster.server-quorum-type: none
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.choose-local: off
network.ping-timeout: 20
server.tcp-user-timeout: 20
server.keepalive-time: 10
server.keepalive-interval: 2
server.keepalive-count: 5
storage.owner-gid: 931
storage.owner-uid: 931

- The output of the gluster volume status command:

Status of volume: bbb
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 6ae01cdf-63ed-4360-9b04-ab25a7d7e07c:
/storages/zfs/zpool                         50669     0          Y       2438 
Brick 38869f5b-3cd8-4f38-91f3-c81e3d74b8d9:
/storages/zfs/zpool                         51532     0          Y       1442054
Brick 8e916a59-9435-4dc4-8377-9aee6bc6da00:
/storages/zfs/zpool                         55524     0          Y       1407824
 
Task Status of Volume bbb
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command: No redundancy so no heal

**- Provide logs present on following locations of client and server nodes - -- see attach

**- Is there any crash ? Provide the backtrace and coredump No crash

Additional info: With sharding disabled no such issue appears It seems that without https://github.com/gluster/glusterfs/pull/2304 problem exists too

PS: this behaviour repeats with different hardware configurations (I tried a cluster on VMs and on hardware servers too, the result is the same)

P.P.S: problem appears when i/o is performed via FUSE. With GFAPI access all seems to be ok.

- The operating system / glusterfs version:

Debian 11, Gluster 10.2-1 with patches applied: https://github.com/gluster/glusterfs/pull/2304 https://github.com/gluster/glusterfs/pull/3720/commits/64da12cdb7d82ba3dc69ee53f71c2e166603dbab

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

gluster_shard_logs.tar.gz

Sep 11 '22 23:09 olegkrutov

glusterfs glusterfs copied to clipboard

Volume becomes unresponsive when data is being written onto it via FUSE, same shard located on 2 instances of same subvol according to vol log

glusterfs
glusterfs copied to clipboard