dcache
dcache copied to clipboard
RemoteTransferManager issues
Hello We have observed many error during an 250k request in the dCache production system. Could you please advise if this should be reported here or in any other system?
dcache-6.2.22-1.noarch
Many errors in
22 Sep 2021 03:30:47 (RemoteTransferManager) [door:WebDAV2-dcdoor08-2@webdav2-dcdoor08_httpsDomain:AAXMkH4k9zA WebDAV2-dcdoor08-2 RemoteTransferManager] javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: Batch entry 0 INSERT INTO "REMOTETRANSFERS" ("CREATED","CREATIONTIME","CREDENTIALID","LIFETIME","MOOVERID","PNFSID","LOCALPATH","REMOTEURL","STATE","WRITE","ID") VALUES ('FALSE',1632295847274,0,0,NULL,NULL,'/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data17_13TeV/95/c2/DAOD_TOPQ4.21546471._001400.pool.root.1','https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/lake/rucio/data17_13TeV/95/c2/DAOD_TOPQ4.21546471._001400.pool.root.1?copy_mode=pull',0,'TRUE',1632295847274000) was aborted: ERROR: null value in column "GID" violates not-null constraint
[root@dcsrm03 ~]# grep 'Batch entry 0 INSERT INTO "REMOTETRANSFERS"' /var/log/dcache/srm-dcsrm03Domain.log|wc
275923 5794383 163659732
[root@dcsrm03 ~]# grep 'Exception thrown flushing changes to datastore' /var/log/dcache/srm-dcsrm03Domain.log|wc
2462370 52788034 1107830210
The GID is not sent in the request it is required for the transaction as is set as not NULL.
dcache=# \d "REMOTETRANSFERS";
Table "public.REMOTETRANSFERS"
Column | Type | Collation | Nullable | Default
--------------------+--------------------------+-----------+----------+---------
ID | bigint | | not null |
CREATED | boolean | | not null |
CREATIONTIME | bigint | | not null |
CREDENTIAL_ID | bigint | | |
GID | integer | | not null |
LIFETIME | bigint | | not null |
MOOVERID | bigint | | |
PNFSID | character varying(32672) | | |
LOCALPATH | character varying(32672) | | |
POOLNAME | character varying(32672) | | |
REMOTEURL | character varying(32672) | | |
SPACERESERVATIONID | character varying(256) | | |
STATE | integer | | not null |
WRITE | boolean | | not null |
UID | integer | | not null |
CREDENTIALID | bigint | | |
Indexes:
"REMOTETRANSFERS_pkey" PRIMARY KEY, btree ("ID")
"REMOTETRANSFERS_CREATIONTIME_IDX" btree ("CREATIONTIME")
"REMOTETRANSFERS_CREATION_TIME_IDX" btree ("CREATIONTIME")
"REMOTETRANSFERS_MOOVERID_IDX" btree ("MOOVERID")
"REMOTETRANSFERS_PNFSIDSTRING_IDX" btree ("PNFSID")
"REMOTETRANSFERS_POOLNAME_IDX" btree ("POOLNAME")
"REMOTETRANSFERS_SPACERESERVATIONID_IDX" btree ("SPACERESERVATIONID")
"REMOTETRANSFERS_STATE_IDX" btree ("STATE")
dcache=# \d "REMOTETRANSFERS_B"
Table "public.REMOTETRANSFERS_B"
Column | Type | Collation | Nullable | Default
--------------------+--------------------------+-----------+----------+---------
ID | bigint | | not null |
CREATED | boolean | | not null |
CREATIONTIME | bigint | | not null |
CREDENTIAL_ID | bigint | | |
GID | integer | | not null |
LIFETIME | bigint | | not null |
MOOVERID | bigint | | |
PNFSID | character varying(32672) | | |
LOCALPATH | character varying(32672) | | |
POOLNAME | character varying(32672) | | |
REMOTEURL | character varying(32672) | | |
SPACERESERVATIONID | character varying(256) | | |
STATE | integer | | not null |
WRITE | boolean | | not null |
UID | integer | | not null |
CREDENTIALID | bigint | | |
Indexes:
"REMOTETRANSFERS_B_pkey" PRIMARY KEY, btree ("ID")
"REMOTETRANSFERS_B_CREATIONTIME_IDX" btree ("CREATIONTIME")
"REMOTETRANSFERS_B_PNFSIDSTRING_IDX" btree ("PNFSID")
"REMOTETRANSFERS_B_POOLNAME_IDX" btree ("POOLNAME")
"REMOTETRANSFERS_B_SPACERESERVATIONID_IDX" btree ("SPACERESERVATIONID")
Question, Do you set this variable in dcahe.conf (or layout file):
transfermanagers.enable.db=true
?
Setting it to false may help to avoid errors for now. We will investogate why GID is not filled for HTTPS 3rd party transfers.
Dmitry
Hi Thank you. Yes transfermanagers.enable.db=true
Carlos
OK, I now understand what has happened.
In 2009, dCache was updated so it no longer logs uid
and gid
values in the database (commit aee2fa25
).
Looking at the patch, I believe this was an unintended consequence: a bug. That's just my guess, though.
Also, I don't believe this change in what is logged was communicated in anyway, and the database schema isn't controlled automatically.
Therefore, this change would have triggered the error message for sites using SRM third-party transfers and have enabled storing state in the database. I guess, since nobody is using SRM third-party transfers, this bug went unnoticed.
HTTP-TPC uses the same dCache-internal infrastructure as SRM TPC, so the bug is finally observed and reported.
So, the question is: do we fix this by starting to log uid and gid values (potentially breaking existing sites) or ask older sites to update their schema (i.e., ALTER TABLE DROP ...
) ?
thank you.
For tables REMOTETRANSFER and REMOTETRANSFER_B be if modified as
alter table "REMOTETRANSFERS_B" alter column "GID" drop not null; alter table "REMOTETRANSFERS" alter column "GID" drop not null;
Will lock the table. Is this what you are thinking of?
Yes, you can drop the NOT NULL
requirement: that's a good idea as a short-term solution.
We (dCache team) can decide whether we want to reintroduce the uid/gid logging. That's a decision we can make independently.