dcache PnfsManager crashes with "java.util.ConcurrentModificationException: null"

Dear dCache devs,

Since we upgraded from 8.2 to 9.2, our PnfsManager is unstable. Every 1 to 3 weeks, it crashes: there is a large queue build-up of directory listings (up to 20k), and after a while even normal transfers start failing. When we restart the domain with the PnfsManager, things return to normal.

I've shared the log of the domain containing the PnfsManager here (valid for 5 days):

https://dcacheview.grid.surfsara.nl:2881/pnfs/grid.sara.nl/data/users/onno/disk-shared/namespaceDomain.log?authz=MDA1YWxvY2F0aW9uIE9wdGlvbmFsWy9wbmZzL2dyaWQuc2FyYS5ubC9kYXRhL3VzZXJzL29ubm8vZGlzay1zaGFyZWQvbmFtZXNwYWNlRG9tYWluLmxvZ10KMDAxOGlkZW50aWZpZXIgM2g3WGpmNy8KMDAxNWNpZCBpaWQ6Q1NjY1laTEMKMDAyMmNpZCBpZDozMTAyOTs1ODc1MiwzMTA0MDtvbm5vCjAwMmJjaWQgYmVmb3JlOjIwMjQtMDMtMTZUMTA6MzA6NTcuODkzNTIxWgowMDMyY2lkIGhvbWU6L3BuZnMvZ3JpZC5zYXJhLm5sL2RhdGEva25taS10cm9wb21pCjAwNTBjaWQgcGF0aDovcG5mcy9ncmlkLnNhcmEubmwvZGF0YS91c2Vycy9vbm5vL2Rpc2stc2hhcmVkL25hbWVzcGFjZURvbWFpbi5sb2cKMDAyMGNpZCBhY3Rpdml0eTogRE9XTkxPQUQsTElTVAowMDJmc2lnbmF0dXJlILA-331Jr-iIWaw76B6tQwSPUY9-ySFXLilB19CnEoo9Cg

The first problems seems to be a repeated java.util.ConcurrentModificationException: null. After that, there are loads of other error messages, perhaps a result of that.

Here is the configuration of the domain running the PnfsManager:

[namespaceDomain]
dcache.java.memory.direct=50g
dcache.java.memory.heap=50g
[namespaceDomain/pnfsmanager]
# pnfsmanager.limits.threads default is 12
pnfsmanager.limits.threads=24
#
# list-threads default 2
pnfsmanager.limits.list-threads = 30
#
# Fix dir listing queueing: https://github.com/dCache/dcache/issues/7252
pnfsmanager.enable.parallel-listing = true
#
# Chunk size default is 100, but that's way too small
# for dirs with 3 million files
pnfsmanager.limits.list-chunk-size = 10000

[namespaceDomain/nfs]
nfs.version = 4.1
#nfs.service.loginbroker=unpublished-LoginBroker
nfs.domain=dcachedomain

This is affecting the stability of our service. Any suggestions?

Kind regards, Onno

Mar 11 '24 10:03 onnozweers

Hi Onno,

What java version is used?

What is happening on DB end? Is DB busy? If you, say, enable query time logging, do you see increased execution time?

log_min_duration_statement = 100
log_line_prefix = '< %m %r %p > '

Also, make sure you db is configured like:

max_connections = 512
shared_buffers = 8192MB
temp_buffers = 1024MB
work_mem = 512MB
maintenance_work_mem = 2048MB
#maintenance_work_mem = 4096MB
max_stack_depth = 2MB
vacuum_cost_delay = 50
autovacuum_vacuum_cost_delay = 5ms
synchronous_commit = off
wal_buffers = 245MB
wal_writer_delay = 10s
checkpoint_completion_target = 0.9
checkpoint_timeout=1800
random_page_cost = 2.0
effective_cache_size = 256GB
enable_seqscan = off
max_wal_size=4GB
min_wal_size=2GB
bgwriter_delay=10ms

(max connections can obviously be different).

Anyhow, at the moment the suspicion is that DB gets overloaded creating queueing on dCache end.

select pg_stat_activity.datname,pg_class.relname,pg_locks.transactionid,
       pg_locks.mode, pg_locks.granted,pg_stat_activity.usename,
       substr(pg_stat_activity.query,1,50),
       pg_stat_activity.query_start, age(now(),pg_stat_activity.query_start) as "age",
       pg_stat_activity.pid from pg_stat_activity,pg_locks
       left outer join pg_class on (pg_locks.relation = pg_class.oid)
       where pg_locks.pid=pg_stat_activity.pid order by query_start;

I suggest to open RT ticket to post output of the above (just in case for privacy).

Why am I jumping to DB? US T2 at University of Chicago had similar issue and I wanted to test explored avenue.

Mar 11 '24 14:03 DmitryLitvintsev

Hi Dmitry, thanks.

[root@db4 ~]# java -version
openjdk version "11.0.21" 2023-10-17 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.21.0.9-1) (build 11.0.21+9-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.21.0.9-1) (build 11.0.21+9-LTS, mixed mode, sharing)

The first error in PnfsManager happened at 00:12:45. I found this long query:

2024-03-09 00:12:46.014 CET [140295] LOG:  duration: 15134.349 ms  execute <unnamed>: DELETE FROM srmspacefile WHERE state IN (2,3) AND spacereservationid IN (SELECT id FROM srmspace WHERE state IN (2,1) AND expirationtime < $1)
2024-03-09 00:12:46.014 CET [140295] DETAIL:  parameters: $1 = '1707347550879'
sending incremental file list
000000010000033500000002

15 seconds for a delete, that does seem rather long.

Here are the Postgres settings:

[root@db4 data]# grep ^[a-z] postgresql.conf
listen_addresses = '*'
max_connections = 1000			# (change requires restart)
shared_buffers = 64GB
temp_buffers = 128MB
work_mem = 1024MB
maintenance_work_mem = 16GB
dynamic_shared_memory_type = posix	# the default is usually the first option
effective_io_concurrency = 100
max_worker_processes = 32
max_parallel_maintenance_workers = 16
max_parallel_workers = 64
wal_level = replica
max_wal_size = 16GB
min_wal_size = 1GB
archive_mode = on
archive_command = 'rsync -av %p db3.mgmt.grid.surfsara.nl:wal_archive/%f'
max_standby_archive_delay = -1          # For backup - Onno
effective_cache_size = 400GB
log_destination = 'stderr'		# Valid values are combinations of
logging_collector = on			# Enable capturing of stderr, jsonlog,
log_directory = 'log'			# directory where log files are written,
log_filename = 'postgresql-%a.log'	# log file name pattern,
log_rotation_age = 1d			# Automatic rotation of logfiles will
log_rotation_size = 0			# Automatic rotation of logfiles will
log_truncate_on_rotation = on		# If on, an existing log file with the
log_min_duration_statement = 10000
log_line_prefix = '%m [%p] '		# special values:
log_timezone = 'Europe/Amsterdam'
bytea_output = 'escape'		# hex breaks dmfcp.sh - Onno
datestyle = 'iso, mdy'
timezone = 'Europe/Amsterdam'
lc_messages = 'en_US.UTF-8'		# locale for system error message
lc_monetary = 'en_US.UTF-8'		# locale for monetary formatting
lc_numeric = 'en_US.UTF-8'		# locale for number formatting
lc_time = 'en_US.UTF-8'			# locale for time formatting
default_text_search_config = 'pg_catalog.english'

The memory settings seem sufficient to me. I'm missing some of the settings that you have (eg for vacuuming), but sure how important they are.

You query returns this (capped below the first "page"):

 datname  |              relname              | transactionid |      mode       | granted | usename  |                       substr                       |          query_start   
       |       age        |   pid   
----------+-----------------------------------+---------------+-----------------+---------+----------+----------------------------------------------------+------------------------
-------+------------------+---------
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE true ORDER BY ar | 2024-03-11 17:16:53.391
955+01 | 00:00:49.608626  | 2543026
 qos      |                                   |               | ExclusiveLock   | t       | dcache   | SELECT * FROM qos_operation WHERE true ORDER BY ar | 2024-03-11 17:16:53.391
955+01 | 00:00:49.608626  | 2543027
 qos      |                                   |               | ExclusiveLock   | t       | dcache   | SELECT * FROM qos_operation WHERE true ORDER BY ar | 2024-03-11 17:16:53.391
955+01 | 00:00:49.608626  | 2538538
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE true ORDER BY ar | 2024-03-11 17:16:53.391
955+01 | 00:00:49.608626  | 2543027
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE true ORDER BY ar | 2024-03-11 17:16:53.391
955+01 | 00:00:49.608626  | 2538538
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE true ORDER BY ar | 2024-03-11 17:16:53.391
955+01 | 00:00:49.608626  | 2538538
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE true ORDER BY ar | 2024-03-11 17:16:53.391
955+01 | 00:00:49.608626  | 2538538
 qos      |                                   |               | ExclusiveLock   | t       | dcache   | SELECT * FROM qos_operation WHERE true ORDER BY ar | 2024-03-11 17:16:53.391
955+01 | 00:00:49.608626  | 2543026
 postgres | pg_stat_activity                  |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_class.relname,p | 2024-03-11 17:17:43.000
581+01 | 00:00:00         | 2550000
 postgres | pg_class_tblspc_relfilenode_index |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_class.relname,p | 2024-03-11 17:17:43.000
581+01 | 00:00:00         | 2550000
 postgres | pg_class_relname_nsp_index        |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_class.relname,p | 2024-03-11 17:17:43.000
581+01 | 00:00:00         | 2550000

Kind regards, Onno

Mar 11 '24 16:03 onnozweers

Are we perhaps missing an index for qos_operation?

                  List of relations
 Schema |          Name          |   Type   | Owner  
--------+------------------------+----------+--------
 public | databasechangelog      | table    | dcache
 public | databasechangeloglock  | table    | dcache
 public | qos_file_status        | table    | dcache
 public | qos_file_status_id_seq | sequence | dcache
 public | qos_operation          | table    | dcache
(5 rows)

                      Table "public.qos_operation"
    Column    |          Type          | Collation | Nullable | Default 
--------------+------------------------+-----------+----------+---------
 pnfsid       | character varying(64)  |           | not null | 
 arrived      | bigint                 |           | not null | 
 msg_type     | character varying(64)  |           | not null | 
 pool_group   | character varying(128) |           |          | 
 storage_unit | character varying(128) |           |          | 
 parent       | character varying(128) |           |          | 
 source       | character varying(128) |           |          | 
 target       | character varying(128) |           |          | 
Indexes:
    "idx_msg_type_qos" btree (msg_type)
    "idx_pnfsid_qos" UNIQUE, btree (pnfsid)

Mar 11 '24 16:03 onnozweers

I am not sure about qos index. Are you using QoS system? I would stop it and comment it out and see if situation is any different. That query I provided needs to run during "listing crunch"

To first order I do not think that delete from srmspace has any bearing on PnfsMamnager listing.

Mar 11 '24 18:03 DmitryLitvintsev

Onno

in your db setup make sure you use:

enable_seqscan = off

(it is listed in my provided settings and missing in your settings)

This could be crucial.

Dmitry

Mar 12 '24 16:03 DmitryLitvintsev

It happened again now. So I ran the query:

[root@db4 ~]# psql -U postgres -c 'select pg_stat_activity.datname,pg_class.relname,pg_locks.transactionid, pg_locks.mode, pg_locks.granted,pg_stat_activity.usename, substr(pg_stat_activity.query,1,50), pg_stat_activity.query_start, age(now(),pg_stat_activity.query_start) as "age", pg_stat_activity.pid from pg_stat_activity,pg_locks left outer join pg_class on (pg_locks.relation = pg_class.oid) where pg_locks.pid=pg_stat_activity.pid order by query_start;'
 datname  |              relname              | transactionid |      mode       | granted | usename  |                       substr       
                |          query_start          |       age       |   pid   
----------+-----------------------------------+---------------+-----------------+---------+----------+------------------------------------
----------------+-------------------------------+-----------------+---------
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE t
rue ORDER BY ar | 2024-03-19 09:18:21.65278+01  | 00:01:34.758501 | 2650577
 qos      |                                   |               | ExclusiveLock   | t       | dcache   | SELECT * FROM qos_operation WHERE t
rue ORDER BY ar | 2024-03-19 09:18:21.65278+01  | 00:01:34.758501 | 2650577
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE t
rue ORDER BY ar | 2024-03-19 09:18:21.65278+01  | 00:01:34.758501 | 2642976
 qos      |                                   |               | ExclusiveLock   | t       | dcache   | SELECT * FROM qos_operation WHERE t
rue ORDER BY ar | 2024-03-19 09:18:21.65278+01  | 00:01:34.758501 | 2642976
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE t
rue ORDER BY ar | 2024-03-19 09:18:21.65278+01  | 00:01:34.758501 | 2642976
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE t
rue ORDER BY ar | 2024-03-19 09:18:21.65278+01  | 00:01:34.758501 | 2650578
 qos      |                                   |               | ExclusiveLock   | t       | dcache   | SELECT * FROM qos_operation WHERE t
rue ORDER BY ar | 2024-03-19 09:18:21.65278+01  | 00:01:34.758501 | 2650578
 qos      |                                   |               | AccessShareLock | t       | dcache   | SELECT * FROM qos_operation WHERE t
rue ORDER BY ar | 2024-03-19 09:18:21.65278+01  | 00:01:34.758501 | 2642976
 postgres | pg_authid_rolname_index           |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_authid                         |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_authid_oid_index               |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_database_oid_index             |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_database                       |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_database_datname_index         |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_class_tblspc_relfilenode_index |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_class_relname_nsp_index        |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_class_oid_index                |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_class                          |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_locks                          |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres | pg_stat_activity                  |               | AccessShareLock | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
 postgres |                                   |               | ExclusiveLock   | t       | postgres | select pg_stat_activity.datname,pg_
class.relname,p | 2024-03-19 09:19:56.411281+01 | 00:00:00        | 2670744
(21 rows)

So the oldest activity here is less than 2 minutes old, but the listings started accumulating before midnight (>9 hours ago). So it looks to me that the database activity is not the problem here, would you agree?

Mar 19 '24 08:03 onnozweers

did you apply

enable_seqscan = off

in DB just to rule that out.

Mar 19 '24 12:03 DmitryLitvintsev

We didn't have that yet. Now we have.

Mar 19 '24 12:03 onnozweers

I've sent you some debugging info like thread dumps via email.

Mar 19 '24 13:03 onnozweers

That's interesting: since we applied enable_seqscan = off, the CPU usage on our database node went down. Looks like it does something. afbeelding

Mar 20 '24 07:03 onnozweers

it avoids full scans on queries.

Mar 20 '24 13:03 DmitryLitvintsev

Last night after 21:10 we had a queue build-up again, this time slower, but with lots of transfers failing. In the PnfsManager log I see many database connection issues:

20 Mar 2024 21:12:42 (PnfsManager) [door:GFTP-seacow13-AAYUHTTGIZg@ftp-seacow13Domain GFTP-seacow13-AAYUHTTGIZg PnfsGetFileAttributes] Could not get connection to database
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:696)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:181)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:146)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
        at org.dcache.db.AlarmEnabledDataSource.getConnection(AlarmEnabledDataSource.java:146)
        at org.springframework.jdbc.datasource.DataSourceTransactionManager.doBegin(DataSourceTransactionManager.java:265)
        at org.springframework.transaction.support.AbstractPlatformTransactionManager.startTransaction(AbstractPlatformTransactionManager.java:400)
        at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:373)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.createTransactionIfNecessary(TransactionAspectSupport.java:595)
        at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:382)
        at org.springframework.transaction.aspectj.AbstractTransactionAspect.ajc$around$org_springframework_transaction_aspectj_AbstractTransactionAspect$1$2a73e96c(AbstractTransactionAspect.aj:71)
        at diskCacheV111.namespace.PnfsManagerV3.processMessageTransactionally(PnfsManagerV3.java:2929)
        at diskCacheV111.namespace.PnfsManagerV3.processPnfsMessage(PnfsManagerV3.java:2900)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.run(PnfsManagerV3.java:2499)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

But there's nothing wrong with our database, it's just running like normal. After a restart of the PnfsManager things went back to normal. I didn't need to start or restart Postgres.

I'm very worried now because this happened so soon after the last event.

I collected again a lot of info like thread dumps etc. Let me know if you need them.

Mar 21 '24 08:03 onnozweers

Hi Onno,

So we see inability to get connection from connection pool becuase all connections are busy presumably waiting for query results.

I do not know what do you mean postgresql was normal. I would expect N long running queries holding the connections. DB may still look normal.

If you can make gzipped postgresql log available somewhere plus the thread and heap dump it will be useful.

The postgresql log file needs to cover that event at 21:12.

TBH I am not aware of any particular changes in PnfsManager between 8.2 and 9.2. What changed significantly between 8.2 and 9.2 is how ditectory tags are treated. And that change seem to have exposed the need of having sequential scan set to off.

Mar 21 '24 12:03 DmitryLitvintsev

When we look at the db during such a event (dir listings going up and up). We don't see any long running queries in the db. So I don't think it is on the db side.

Could it be that we are hitting this bug? https://github.com/dcache/dcache/commit/da22a29c7335be58ec942022d8f7c316c614b096 We are on 9.2.10 and this was fixed in 9.2.11 We have some users whom do a dir list on every file upload. Which happens in parallel. So this can result in many listings in a short time. We are working to fix this. When we combine this with the 10x retry on the webdav door. We get 10 dir listings per 5seconds for every file upload. (500ms sleep between tries) Could that maybe trigger the behavior described above? Curious to hear what you think.

Mar 21 '24 16:03 trancilo

It may! But I was under impression you aleady had patched version. If possible can you deploy latest 9.2? This patch is needed on WebDAV door host(s).

Mar 21 '24 17:03 DmitryLitvintsev

Hi Dmitry, we had patched only a single WebDAV door that we thought was causing problems. The rest is still at 9.2.10. But we will update if you think it might help.

I've sent you an email with links to thread dumps, postgres logs and other info.

@trancilo is my new colleague, as you may have guessed.

Mar 21 '24 18:03 onnozweers

Hi Dmitry, We indeed going to upgrade all our webdav doors to 9.2.11. We can recreate the dirlisting slowness on 9.2.10 and see its resolved on 9.2.11.

Mar 22 '24 11:03 trancilo

We have upgraded our WebDAV doors to 9.2.11 today.

Mar 28 '24 14:03 onnozweers

If this helps, that would be the best proof that you should never do a "quick and easy" retry loop to fight race conditions. Hopefully we will fix the occasional errors to query checksums on upload (by FTS) properly.

Apr 01 '24 16:04 DmitryLitvintsev

The PnfsManager crashed again. Here's the first error around the time it crashed:

04 Apr 2024 09:37:24 (PnfsManager) [PnfsManager PnfsListDirectory] Could not get connection to database
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30001ms.
        at com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:696)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:181)
        at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:146)
        at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
        at org.dcache.db.AlarmEnabledDataSource.getConnection(AlarmEnabledDataSource.java:146)
        at org.springframework.jdbc.datasource.DataSourceUtils.fetchConnection(DataSourceUtils.java:159)
        at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:117)
        at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80)
        at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:646)
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:713)
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:744)
        at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:773)
        at org.dcache.chimera.FsSqlDriver.getAllTags(FsSqlDriver.java:1071)
        at org.dcache.chimera.JdbcFs.getAllTags(JdbcFs.java:1244)
        at jdk.internal.reflect.GeneratedMethodAccessor290.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.dcache.commons.stats.MonitoringProxy.invoke(MonitoringProxy.java:52)
        at com.sun.proxy.$Proxy29.getAllTags(Unknown Source)
        at org.dcache.chimera.namespace.ExtendedInode.getTags(ExtendedInode.java:139)
        at org.dcache.chimera.namespace.ExtendedInode.getTag(ExtendedInode.java:146)
        at org.dcache.chimera.namespace.ChimeraOsmStorageInfoExtractor.getDirStorageInfo(ChimeraOsmStorageInfoExtractor.java:78)
        at org.dcache.chimera.namespace.ChimeraOsmStorageInfoExtractor.getFileStorageInfo(ChimeraOsmStorageInfoExtractor.java:44)
        at org.dcache.chimera.namespace.ChimeraHsmStorageInfoExtractor.getStorageInfo(ChimeraHsmStorageInfoExtractor.java:250)
        at org.dcache.chimera.namespace.ChimeraNameSpaceProvider.getFileAttributes(ChimeraNameSpaceProvider.java:1036)
        at org.dcache.chimera.namespace.ChimeraNameSpaceProvider.list(ChimeraNameSpaceProvider.java:1322)
        at diskCacheV111.namespace.ForwardingNameSpaceProvider.list(ForwardingNameSpaceProvider.java:155)
        at diskCacheV111.namespace.MonitoringNameSpaceProvider.list(MonitoringNameSpaceProvider.java:282)
        at diskCacheV111.namespace.PnfsManagerV3.listDirectory(PnfsManagerV3.java:2417)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.run(PnfsManagerV3.java:2502)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

I've collected some debugging info. Let me know if you need that.

Apr 04 '24 10:04 onnozweers

I still have the old ones, I believe.

So, how about this:

pnfsmanager.db.connections.max = 30

Above is default valule. Can you set it to 100 (or any number greater than 60 based on your settings. IMO 100 is a good number) and restart PndfsManager .

Apr 04 '24 14:04 DmitryLitvintsev

Thanks Dmitry, I've set pnfsmanager.db.connections.max = 100.

Apr 04 '24 22:04 onnozweers

The PnfsManager crashed again with "java.util.ConcurrentModificationException: null".

Around that time (14:12), we saw a surge in database connections.

No long Postgres queries around the time of the exception.

Assuming we again reached the limit of database connections, we increased pnfsmanager.db.connections.max from 100 to 200 (the max in Postgres is 1000) and restarted the PnfsManager.

14 Apr 2024 14:12:11 (PnfsManager) [door:webdav2882-penguin13@webdav2882-penguin13Domain:AAYWDWoRECg PnfsManager PnfsListDirectory] processPnfsMessage: proc-33 : {}
java.util.ConcurrentModificationException: null
        at java.base/java.util.ArrayList.writeObject(ArrayList.java:901)
        at java.base/jdk.internal.reflect.GeneratedMethodAccessor300.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at java.base/java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1016)
        at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1487)
        at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1423)
        at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1169)
        at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1543)
        at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1500)
        at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1423)
        at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1169)
        at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:345)
        at dmg.cells.nucleus.MsgSerializerJos.encode(MsgSerializerJos.java:45)
        at dmg.cells.nucleus.SerializationHandler.encode(SerializationHandler.java:77)
        at dmg.cells.nucleus.CellMessage.encodeWith(CellMessage.java:239)
        at dmg.cells.nucleus.CellGlue.sendMessage(CellGlue.java:448)
        at dmg.cells.nucleus.CellNucleus.sendMessage(CellNucleus.java:417)
        at dmg.cells.nucleus.CellAdapter.sendMessage(CellAdapter.java:429)
        at dmg.cells.nucleus.AbstractCellComponent.sendMessage(AbstractCellComponent.java:36)
        at diskCacheV111.namespace.PnfsManagerV3.access$200(PnfsManagerV3.java:177)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.fold(PnfsManagerV3.java:2544)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.run(PnfsManagerV3.java:2513)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Apr 15 '24 10:04 onnozweers

I just found out that our secondary PnfsManager was the leader. I had set that up at the advise of Tigran for troubleshooting.

[root@install-fes ~]# dcache-admin-command PnfsManager-RO-Tropomi 'pnfsmanager ha show participants'
Participant{id='PnfsManager-RO-Tropomi@anonwebdav-tropomiDomain', isLeader=true}
Participant{id='PnfsManager@namespaceDomain', isLeader=false}

That is not efficient because all database changes come in through the main PnfsManager, which forwards them to the secondary PnfsManager-RO-Tropomi, which then sends them back to the node running the main PnfsManager because that's also the node running the databases.

Also, we did the database connections tuning in the main PnfsManager, but I guess it does not have much effect if it is only the follower.

I've now used pnfsmanager ha release leadership to switch roles.

Apr 25 '24 12:04 onnozweers

Hi @DmitryLitvintsev I was looking at the code related to this issue. We have the following setting set to true: 'pnfsmanager.enable.parallel-listing'. Which means we only have one listQueue instead one per listThread. (PnfsManagerV3,java:463) I was wondering if this could cause the ConcurrentModificationException. When looking at the above stacktrace. Something seems to go wrong during the folding of the dirlist reponse messages. Its like two _listProcessThreads are trying to concurrently write to the same _listQueue. It would make sense that not many dcache instances run into these issues because they are probably using the new listThread per queue implementation.

Apr 25 '24 12:04 trancilo

Hi @DmitryLitvintsev I was looking at the code related to this issue. We have the following setting set to true: 'pnfsmanager.enable.parallel-listing'. Which means we only have one listQueue instead one per listThread. (PnfsManagerV3,java:463) I was wondering if this could cause the ConcurrentModificationException. When looking at the above stacktrace. Something seems to go wrong during the folding of the dirlist reponse messages. Its like two _listProcessThreads are trying to concurrently write to the same _listQueue. It would make sense that not many dcache instances run into these issues because they are probably using the new listThread per queue implementation.

May be, but you seem to have this issue after upgrade to 9.2 whereas this code was there in 8.2 and you run without problems.

When you raised the DB pool connection sized - did it change the frequency of these crashes any?

BTW., do I get it right that now stacktrace looks different. This stacktrace above is triggered in message folding

at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.fold(PnfsManagerV3.java:2544)

Did it also complain about :

20 Mar 2024 21:12:42 (PnfsManager) [door:GFTP-seacow13-AAYUHTTGIZg@ftp-seacow13Domain GFTP-seacow13-AAYUHTTGIZg PnfsGetFileAttributes] Could not get connection to database
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.

IOW. This looks like new stack trace to me - different from what it was before.

_listQueue contains 1 element which is BlockingQueue whcih is thread safe.

The fault in that stacktrace happens on:

sendMessage(envelope);

And further down in the de-serializer. Anyhow, unless I missed this before in preivously provided data by Onno, this looks like a new problem compared to original caused by mismatch between number of allowed parallel threads and available DB connections.

Apr 25 '24 13:04 DmitryLitvintsev

The PnfsManager crashed again with "java.util.ConcurrentModificationException: null".

Around that time (14:12), we saw a surge in database connections.

No long Postgres queries around the time of the exception.

Assuming we again reached the limit of database connections, we increased pnfsmanager.db.connections.max from 100 to 200 (the max in Postgres is 1000) and restarted the PnfsManager.

14 Apr 2024 14:12:11 (PnfsManager) [door:webdav2882-penguin13@webdav2882-penguin13Domain:AAYWDWoRECg PnfsManager PnfsListDirectory] processPnfsMessage: proc-33 : {}
java.util.ConcurrentModificationException: null
        at java.base/java.util.ArrayList.writeObject(ArrayList.java:901)
        at java.base/jdk.internal.reflect.GeneratedMethodAccessor300.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at java.base/java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1016)
        at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1487)
        at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1423)
        at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1169)
        at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1543)
        at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1500)
        at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1423)
        at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1169)
        at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:345)
        at dmg.cells.nucleus.MsgSerializerJos.encode(MsgSerializerJos.java:45)
        at dmg.cells.nucleus.SerializationHandler.encode(SerializationHandler.java:77)
        at dmg.cells.nucleus.CellMessage.encodeWith(CellMessage.java:239)
        at dmg.cells.nucleus.CellGlue.sendMessage(CellGlue.java:448)
        at dmg.cells.nucleus.CellNucleus.sendMessage(CellNucleus.java:417)
        at dmg.cells.nucleus.CellAdapter.sendMessage(CellAdapter.java:429)
        at dmg.cells.nucleus.AbstractCellComponent.sendMessage(AbstractCellComponent.java:36)
        at diskCacheV111.namespace.PnfsManagerV3.access$200(PnfsManagerV3.java:177)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.fold(PnfsManagerV3.java:2544)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.run(PnfsManagerV3.java:2513)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

@onnozweers

Altho exception is the same:

java.util.ConcurrentModificationException: null

the stacktrace looks different. No? It looks like the fault is happenning in the folding method. This seems new to me. I will have a look at this piece.

Apr 25 '24 13:04 DmitryLitvintsev

@onnozweers @trancilo

You can try, as a way of debugging, to disable folding:

_listQueues in your case contains single BlockingQueue element. BlockingQueue is thread safe.

The fault happens when serializing message when called by sendMessage(envelope) from fold call.

At first glance this looks clean, and that code hasn changed since 2016

As a debuging tool you can disable folding:

pnfsmanager.enable.folding = false

and see if you still get the same stack.

Also, if you can tolerate loggin at info level, you can raise logging to INFO in PnfsManager and just before stacking we should see Folded ... message in the log so we will knnow what message caused the issue.

Also, does PnfsManager actually crash now, or reports stack in the log and proceeds?

Apr 25 '24 16:04 DmitryLitvintsev

@DmitryLitvintsev I see that indeed we have been posting two different stacktraces. But these are happening right after eachother. For example on March18.

--- there are similar stacktraces before---
18 Mar 2024 23:17:23 (PnfsManager) [door:webdav443-seacow14@webdav443-seacow14Domain:AAYT9rImWKE PnfsManager PnfsListDirectory] processPnfsMessage: proc-44 : {}
java.util.ConcurrentModificationException: null
        at java.base/java.util.ArrayList.writeObject(ArrayList.java:901)
        at java.base/jdk.internal.reflect.GeneratedMethodAccessor266.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at java.base/java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1016)
        at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1487)
        at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1423)
        at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1169)
        at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1543)
        at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1500)
        at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1423)
        at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1169)
        at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:345)
        at dmg.cells.nucleus.MsgSerializerJos.encode(MsgSerializerJos.java:45)
        at dmg.cells.nucleus.SerializationHandler.encode(SerializationHandler.java:77)
        at dmg.cells.nucleus.CellMessage.encodeWith(CellMessage.java:239)
        at dmg.cells.nucleus.CellGlue.sendMessage(CellGlue.java:448)
        at dmg.cells.nucleus.CellNucleus.sendMessage(CellNucleus.java:417)
        at dmg.cells.nucleus.CellAdapter.sendMessage(CellAdapter.java:429)
        at dmg.cells.nucleus.AbstractCellComponent.sendMessage(AbstractCellComponent.java:36)
        at diskCacheV111.namespace.PnfsManagerV3.access$200(PnfsManagerV3.java:177)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.fold(PnfsManagerV3.java:2544)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.run(PnfsManagerV3.java:2513)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
18 Mar 2024 23:17:24 (PnfsManager) [door:webdav443-seacow14@webdav443-seacow14Domain:AAYT9rImWKE PnfsManager PnfsListDirectory] processPnfsMessage: proc-24 : {}
java.util.ConcurrentModificationException: null
        at java.base/java.util.ArrayList.writeObject(ArrayList.java:901)
        at java.base/jdk.internal.reflect.GeneratedMethodAccessor266.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at java.base/java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1016)
        at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1487)
        at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1423)
        at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1169)
        at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1543)
        at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1500)
        at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1423)
        at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1169)
        at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:345)
        at dmg.cells.nucleus.MsgSerializerJos.encode(MsgSerializerJos.java:45)
        at dmg.cells.nucleus.SerializationHandler.encode(SerializationHandler.java:77)
        at dmg.cells.nucleus.CellMessage.encodeWith(CellMessage.java:239)
        at dmg.cells.nucleus.CellGlue.sendMessage(CellGlue.java:448)
        at dmg.cells.nucleus.CellNucleus.sendMessage(CellNucleus.java:417)
        at dmg.cells.nucleus.CellAdapter.sendMessage(CellAdapter.java:429)
        at dmg.cells.nucleus.AbstractCellComponent.sendMessage(AbstractCellComponent.java:36)
        at diskCacheV111.namespace.PnfsManagerV3.postProcessLocationModificationMessage(PnfsManagerV3.java:3073)
        at diskCacheV111.namespace.PnfsManagerV3.postProcessMessage(PnfsManagerV3.java:2992)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.run(PnfsManagerV3.java:2511)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
--- more stacktraces follow

The last crash on april14 I only see the stacktrace with 'ProcessThread.fold(PnfsManagerV3.java:2544)' Both instances mentioned above have the following in the log following the ConcurrentModificationException

14 Apr 2024 14:12:13 (PnfsManager) [door:webdav2882-penguin13@webdav2882-penguin13Domain:AAYWDWoRECg PnfsManager PnfsListDirectory] processPnfsMessage: proc-26 : {}
java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
        at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
        at java.base/java.util.Objects.checkIndex(Objects.java:374)
        at java.base/java.util.ArrayList.get(ArrayList.java:459)
        at dmg.cells.nucleus.CellPath.getSourceAddress(CellPath.java:157)
        at dmg.cells.nucleus.CellMessage.revertDirection(CellMessage.java:189)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.fold(PnfsManagerV3.java:2543)
        at diskCacheV111.namespace.PnfsManagerV3$ProcessThread.run(PnfsManagerV3.java:2513)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

The pnfsManager does not completely crash immidiatly in the beginning after the crash some message still seem to be processed. But after a while it all seems to be getting to a grinding halt.

Apr 26 '24 09:04 trancilo

May be, but you seem to have this issue after upgrade to 9.2 whereas this code was there in 8.2 and you run without problems.

Yes indeed. But I was thinking maybe its this change that introduced the issue. Because now we have multiple threads going over one queue. But like you said: its a BlockingQueue. So I'm not sure how this could go wrong.

Also, if you can tolerate loggin at info level...

The pnfsManager getting stuck problem only happens once every two or three weeks. So putting the logging level to INFO will be difficult. We are currently waiting for the next crash. We now store 'PnfsManager monitoring info' from the dCache admin interface every minute. Hopefully this will give us some extra insight on what is happening during ConcurrentModificationException.

As a debuging tool you can disable folding:...

I will enable that on after the next crash and the subsequent restart. Another option is. We are looking if we can move away from 'pnfsmanager.enable.parallel-listing'. Which would bring us more in line with other dCache instances.

Apr 26 '24 09:04 trancilo

dcache dcache copied to clipboard

PnfsManager crashes with "java.util.ConcurrentModificationException: null"

dcache
dcache copied to clipboard