dcache icon indicating copy to clipboard operation
dcache copied to clipboard

RemoteTransferManager issues

Open cfgamboa opened this issue 3 years ago • 7 comments

Hello We have observed many error during an 250k request in the dCache production system. Could you please advise if this should be reported here or in any other system?

dcache-6.2.22-1.noarch

Many errors in

22 Sep 2021 03:30:47 (RemoteTransferManager) [door:WebDAV2-dcdoor08-2@webdav2-dcdoor08_httpsDomain:AAXMkH4k9zA WebDAV2-dcdoor08-2 RemoteTransferManager] javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: Batch entry 0 INSERT INTO "REMOTETRANSFERS" ("CREATED","CREATIONTIME","CREDENTIALID","LIFETIME","MOOVERID","PNFSID","LOCALPATH","REMOTEURL","STATE","WRITE","ID") VALUES ('FALSE',1632295847274,0,0,NULL,NULL,'/pnfs/usatlas.bnl.gov/BNLT0D1/rucio/data17_13TeV/95/c2/DAOD_TOPQ4.21546471._001400.pool.root.1','https://dcgftp.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/lake/rucio/data17_13TeV/95/c2/DAOD_TOPQ4.21546471._001400.pool.root.1?copy_mode=pull',0,'TRUE',1632295847274000) was aborted: ERROR: null value in column "GID" violates not-null constraint
[root@dcsrm03 ~]# grep 'Batch entry 0 INSERT INTO "REMOTETRANSFERS"'   /var/log/dcache/srm-dcsrm03Domain.log|wc
 275923 5794383 163659732
[root@dcsrm03 ~]# grep 'Exception thrown flushing changes to datastore'   /var/log/dcache/srm-dcsrm03Domain.log|wc
2462370 52788034 1107830210

cfgamboa avatar Sep 22 '21 14:09 cfgamboa

The GID is not sent in the request it is required for the transaction as is set as not NULL.

dcache=# \d "REMOTETRANSFERS";
                         Table "public.REMOTETRANSFERS"
       Column       |           Type           | Collation | Nullable | Default 
--------------------+--------------------------+-----------+----------+---------
 ID                 | bigint                   |           | not null | 
 CREATED            | boolean                  |           | not null | 
 CREATIONTIME       | bigint                   |           | not null | 
 CREDENTIAL_ID      | bigint                   |           |          | 
 GID                | integer                  |           | not null | 
 LIFETIME           | bigint                   |           | not null | 
 MOOVERID           | bigint                   |           |          | 
 PNFSID             | character varying(32672) |           |          | 
 LOCALPATH          | character varying(32672) |           |          | 
 POOLNAME           | character varying(32672) |           |          | 
 REMOTEURL          | character varying(32672) |           |          | 
 SPACERESERVATIONID | character varying(256)   |           |          | 
 STATE              | integer                  |           | not null | 
 WRITE              | boolean                  |           | not null | 
 UID                | integer                  |           | not null | 
 CREDENTIALID       | bigint                   |           |          | 
Indexes:
    "REMOTETRANSFERS_pkey" PRIMARY KEY, btree ("ID")
    "REMOTETRANSFERS_CREATIONTIME_IDX" btree ("CREATIONTIME")
    "REMOTETRANSFERS_CREATION_TIME_IDX" btree ("CREATIONTIME")
    "REMOTETRANSFERS_MOOVERID_IDX" btree ("MOOVERID")
    "REMOTETRANSFERS_PNFSIDSTRING_IDX" btree ("PNFSID")
    "REMOTETRANSFERS_POOLNAME_IDX" btree ("POOLNAME")
    "REMOTETRANSFERS_SPACERESERVATIONID_IDX" btree ("SPACERESERVATIONID")
    "REMOTETRANSFERS_STATE_IDX" btree ("STATE")
dcache=# \d "REMOTETRANSFERS_B"
                        Table "public.REMOTETRANSFERS_B"
       Column       |           Type           | Collation | Nullable | Default 
--------------------+--------------------------+-----------+----------+---------
 ID                 | bigint                   |           | not null | 
 CREATED            | boolean                  |           | not null | 
 CREATIONTIME       | bigint                   |           | not null | 
 CREDENTIAL_ID      | bigint                   |           |          | 
 GID                | integer                  |           | not null | 
 LIFETIME           | bigint                   |           | not null | 
 MOOVERID           | bigint                   |           |          | 
 PNFSID             | character varying(32672) |           |          | 
 LOCALPATH          | character varying(32672) |           |          | 
 POOLNAME           | character varying(32672) |           |          | 
 REMOTEURL          | character varying(32672) |           |          | 
 SPACERESERVATIONID | character varying(256)   |           |          | 
 STATE              | integer                  |           | not null | 
 WRITE              | boolean                  |           | not null | 
 UID                | integer                  |           | not null | 
 CREDENTIALID       | bigint                   |           |          | 
Indexes:
    "REMOTETRANSFERS_B_pkey" PRIMARY KEY, btree ("ID")
    "REMOTETRANSFERS_B_CREATIONTIME_IDX" btree ("CREATIONTIME")
    "REMOTETRANSFERS_B_PNFSIDSTRING_IDX" btree ("PNFSID")
    "REMOTETRANSFERS_B_POOLNAME_IDX" btree ("POOLNAME")
    "REMOTETRANSFERS_B_SPACERESERVATIONID_IDX" btree ("SPACERESERVATIONID")
    

cfgamboa avatar Sep 22 '21 15:09 cfgamboa

Question, Do you set this variable in dcahe.conf (or layout file):

transfermanagers.enable.db=true

?

Setting it to false may help to avoid errors for now. We will investogate why GID is not filled for HTTPS 3rd party transfers.

Dmitry

DmitryLitvintsev avatar Sep 22 '21 15:09 DmitryLitvintsev

Hi Thank you. Yes transfermanagers.enable.db=true

Carlos

cfgamboa avatar Sep 22 '21 15:09 cfgamboa

OK, I now understand what has happened.

In 2009, dCache was updated so it no longer logs uid and gid values in the database (commit aee2fa25).

Looking at the patch, I believe this was an unintended consequence: a bug. That's just my guess, though.

Also, I don't believe this change in what is logged was communicated in anyway, and the database schema isn't controlled automatically.

Therefore, this change would have triggered the error message for sites using SRM third-party transfers and have enabled storing state in the database. I guess, since nobody is using SRM third-party transfers, this bug went unnoticed.

HTTP-TPC uses the same dCache-internal infrastructure as SRM TPC, so the bug is finally observed and reported.

paulmillar avatar Sep 23 '21 11:09 paulmillar

So, the question is: do we fix this by starting to log uid and gid values (potentially breaking existing sites) or ask older sites to update their schema (i.e., ALTER TABLE DROP ...) ?

paulmillar avatar Sep 23 '21 11:09 paulmillar

thank you.

For tables REMOTETRANSFER and REMOTETRANSFER_B be if modified as

alter table "REMOTETRANSFERS_B" alter column "GID" drop not null; alter table "REMOTETRANSFERS" alter column "GID" drop not null;

Will lock the table. Is this what you are thinking of?

cfgamboa avatar Sep 23 '21 12:09 cfgamboa

Yes, you can drop the NOT NULL requirement: that's a good idea as a short-term solution.

We (dCache team) can decide whether we want to reintroduce the uid/gid logging. That's a decision we can make independently.

paulmillar avatar Sep 23 '21 13:09 paulmillar