nfs-ganesha icon indicating copy to clipboard operation
nfs-ganesha copied to clipboard

5.6: [fd_lru] fsal_start_fd_work :RW LOCK :CRIT :Error 22, acquiring mutex

Open aepotapov opened this issue 1 year ago • 5 comments

Hello, recently we enabled fd reaper in config but occasionally get the following error:

ganesha.nfsd-3236423[fd_lru] fsal_start_fd_work :RW LOCK :CRIT :Error 22, acquiring mutex 0x7f6d74004910 (&fsal_fd->work_mutex) at /usr/src/debug/nfs-ganesha/5.6-r0/src/FSAL/commonlib.c:2796

and nfs-ganesha service exited with status=6/ABRT.

So it look like corrupted or freed memory for fsal_fd.

Do you have any ideas how to fix this?

Thanks, Alexey

aepotapov avatar Feb 17 '24 21:02 aepotapov

We have some patches fixing lifetime of fsal_fd:

https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/1176997?usp=search

There is at least one merged into V6-dev.5 or earlier (soon we will backport some patches and tag V5.8)

ffilz avatar Feb 18 '24 06:02 ffilz

Thank you, I'll try these fixes.

aepotapov avatar Feb 18 '24 10:02 aepotapov

Unfortunately, after applying https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/1176997?usp=search fix we still get error, but much rare.

aepotapov avatar Feb 19 '24 22:02 aepotapov

@aepotapov - Please check, if your Ganesha code fork have picked up the below patch...This should resolve above lock ordering problem scenario...Kindly confirm once validated...

https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/1170151 https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/1174224

Also there have been many fixes made in fd_lru code path, some in FSALs as well, please ensure to have the community approved patches incorporated soon...

And if it's recurring again, could you please share some more logs with backtraces plugged in to the mutex pointer dumped in PTHREAD_MUTEX_(init/lock/unlock/destroy), it seems someone have destroyed the mutex or there is some ordering problem here...

cc: @ffilz

rojingeorge avatar Mar 19 '24 14:03 rojingeorge

Thank you! I will check that patch. Unfortunately original problem disappears even with v5.6 based code.

aepotapov avatar Mar 25 '24 10:03 aepotapov

Hello, I was able to reproduce this problem again. Attached log with FSAL = FULL_DEBUG;NFS3 = FULL_DEBUG; Please note me if something else is needed. ganesha_copy.log

aepotapov avatar Jun 24 '24 15:06 aepotapov

Could you try V5.9?

I don't see anything in that log that helps.

ffilz avatar Jun 25 '24 18:06 ffilz

It looks like it is fixed in V5.9. Thanks!

aepotapov avatar Jul 02 '24 21:07 aepotapov