WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

Agents continuously failing to insert blocks into DBS

Open amaltaro opened this issue 10 months ago • 11 comments

Impact of the bug WMAgent

Describe the bug There seems to be an unusual number of blocks that are continuously failing to be inserted into DBS Server, with a variety of errors, as can be seen in [1] and [2].

For [1], that/those blocks actually belong to a worfklow that went all the way to completed in the system and then got rejected, as can be seen from this ReqMgr2 API.

For [2], that block belongs to a workflow that is currently in running-closed status. Block failing injection for about 10h.

This is based on vocms0255, I haven't yet checked the other agents.

How to reproduce it Not sure

Expected behavior For the rejected workflow (or aborted), we should make DBS3Upload aware that output data is no longer relevant and skip their injection into DBS Server. This might require persisting information in the DBSBuffer tables (like marking the block and relevant files as injected), otherwise the same blocks will come up every time we run a cycle of the DBS3Upload component.

For the malformed SQL statement (note a typo mailformed(!)), we probably need to correlate this error with further information from DBS Server. Is it the same error as we have with concurrent HTTP requests? Or what is actually wrong with this. Maybe @todor-ivanov can shed some light on this. Expected behavior of this fix is to be determined.

Additional context and error message [1]

2024-04-11 15:32:06,562:140685583296256:ERROR:DBSUploadPoller:Error trying to process block /TKCosmics_38T/Run3Winter24Reco-TkAlCosmics0T-AlcaRecoTkAlCosmics0T_cosmics_133X_mcRun3_2024cosmics_realistic_deco_v1-v5/ALCARECO#a5225151-fe56-45b1-b4dc-244b4644c02d through DBS. Details: DBSError code: 0, message: , reason: 

[2]

2024-04-11 14:09:09,438:140685583296256:ERROR:DBSUploadPoller:Error trying to process block /SingleNeutrino_E-10-gun/Run3Winter24Reco-133X_mcRun3_2024_realistic_v10-v2/GEN-SIM-RECO#995c334f-6648-4c55-98a
1-44afbed8a57f through DBS. Details: DBSError code: 131, message: 5d0aae4c60a9089bfd22c0602c1bcecffd88106ed1a4578923297eda9e7da9d2 unable to find dataset_id for /SingleNeutrino_E-10-gun/Run3Winter24Digi-
133X_mcRun3_2024_realistic_v10-v2/GEN-SIM-RAW, error DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set, reason:
 DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

amaltaro avatar Apr 11 '24 14:04 amaltaro

@todor-ivanov as discussed in the meeting today - and right now with Andrea as well - let us put this back to ToDo and come back to this beginning of October (2 weeks more should not hurt us here).

amaltaro avatar Sep 16 '24 15:09 amaltaro

Following discussion in mattermost wm-ops thread with @amaltaro.

Related to failure in inserting data to DBS, the current T0 production agent is struggling with inserting files into the blocks. I see the following error message in the DBS3Upload component log

Error trying to process block /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7 through DBS. Details: DBSError code: 110, message: d93d36f53eaf3097db5c9f50851359041c418a18727e6f363e6c18c37d3f25bb una
ble to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error, reason: DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunk
s Message: Error: concurrency error

This is present for the following blocks:

  • /AlCaP0/Run2024H-v1/RAW#3bbaf481-068c-4fda-8656-663fa9a987a4
  • /Muon0/Run2024H-v1/RAW#7369ccdf-3d3a-4d32-bad9-b04b02f279d4
  • /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd
  • /AlCaP0/Run2024H-v1/RAW#983e96f3-5dca-4919-a3b4-fa291f145fb7

LinaresToine avatar Sep 26 '24 19:09 LinaresToine

I suggest that you review https://github.com/dmwm/WMCore/issues/11106 which describes the actual issue with concurrent data insertion. In short, to make it work we must have all pieces (like dataset configuration, etc.) in place to make concurrent injection. To solve this problem someone must inject first one block with all necessary information, and then can safely use concurrent pattern to inject other blocks.

vkuznet avatar Sep 26 '24 20:09 vkuznet

@vkuznet thank you for jumping into this discussion.

I had a feeling that there was another obscure problem with DBS Server, and reviewing the ticket you pointed to (11106) - and according to your sentence above - I understand that, provided that we have at least 1 block injected into DBS for a given dataset, the "concurrency error" should no longer happen, given that all the foundation information is already in the database. Correct?

I picked one of the blocks provided by Antonio and queried DBS Server for its blocks: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/blocks?dataset=/AlCaP0/Run2024H-v1/RAW

as you can see, this dataset already has a bunch of blocks in the database. So, how come we are having a "concurrency error" here?

amaltaro avatar Sep 26 '24 21:09 amaltaro

If you'll inspect the code [1], in order to insert DBS block concurrently we need to have in place:

  • dataset configuration
  • primary dataset info
  • processing era
  • acquisition era
  • data tier
  • physics group
  • dataset access type
  • processed dataset

So, if all of these information is present and it is consistent across all blocks in DBS then answer is yes the concurrency error (based on database content) should not arise. In other words DBS server first acquire or insert this info into DBS tables and if two or more HTTP calls arrives at the same time it can cause database error which lead to concurrency error form DBS server. Is it the case of the discussed blocks I don't know. But it is possible to not have all the information present in DB across all blocks if any of the above have differ among them.

You may look at example of bulkblocks JSON [2] to see actually how this information is structured and provided to DBS. In particular, the information in dataset_conf_list and file_conf_list is used to look-up aforementioned info, along with primds, processing_era, etc. So, if you inject multiple JSON they need to have identical info for those attributes, otherwise you may potentially get into racing conditions described in https://github.com/dmwm/WMCore/issues/11106

[1] https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L478 [2] https://github.com/dmwm/dbs2go/blob/master/test/bulkblocks.json

vkuznet avatar Sep 26 '24 21:09 vkuznet

Valentin, unless there is a bug in the (T0)WMAgent, all the blocks for the same dataset should carry exactly the same metadata. That means, same acquisition era, primary dataset, etc etc etc.

Having said that, if a block exists in DBS Server, we can conclude that all of its metadata is already available as well. IF that metadata is already available and we are trying to inject more blocks for the same dataset, hence the same meta-data, there should be NO concurrency error.

Based on your explanation and on the data shared by Antonio, I fail to see how we would hit a "concurrency error". That means there is more to what we have discussed/understood so far; or the error message is misleading...

In any case, I would suggest to have @todor-ivanov following this up next week, comparing things with the DBS Server logs and against the source code.

amaltaro avatar Sep 26 '24 22:09 amaltaro

I further looked into the dbs code and I think I identified the issue. According to the dbs code

  • the concurrency error is defined here https://github.com/dmwm/dbs2go/blob/master/dbs/errors.go#L20
  • it is used in here https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L1002
    • more precisely the code fails in insertFilesChunk function, see https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L1018
    • and if you look into body of this function there are couple of places where error can occur, one of them is checking file type attribute "file_type" which should be provided by input, see https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L1018

Then, I looked at one of the dbs logs and found

[2024-09-24 00:45:17.228980109 +0000 UTC m=+2471302.202098481] fail to insert files chunks, trec &{IsFileValid:1 DatasetID:15071289 BlockID:37951592 CreationDate:1727138717 CreateBy:/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0 FilesMap:{mu:{state:0 sema:0} read:{_:[] _:{} v:<nil>} dirty:map[] misses:0} NErrors:2}

So, indeed input file record DOES NOT contain required file type attribute, see File structure over here https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks.go#L65. The "file_type" must be present in provided JSON, otherwise it will be assigned to default value 0 which is what file injection tries to get from database and it should be non-zero value.

To summarize, I suggest to check JSON records T0 provides and ensure it provides "file_type" along other file attributes (all of them are defiend in this struct: https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks.go#L65). Without it DBS code correctly fails, but probably it would be useful to adjust error message to properly report the error.

vkuznet avatar Sep 27 '24 12:09 vkuznet

For the record, here is how DBS error look in a log:

[2024-09-24 00:45:17.228980109 +0000 UTC m=+2471302.202098481] fail to insert files chunks, trec &{IsFileValid:1 DatasetID:15071289 BlockID:37951592 CreationDate:1727138717 CreateBy:/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0 FilesMap:{mu:{state:0 sema:0} read:{_:[] _:{} v:<nil>} dirty:map[] misses:0} NErrors:2}
[2024-09-24 00:45:17.229561212 +0000 UTC m=+2471302.202679583] 5ecdc2bdcd03492fd64efc269de332cdcf1c8a53c3e3cc07168b0c741f0270ba unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error
[2024-09-24 00:45:17.232415539 +0000 UTC m=+2471302.205533911] DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:5ecdc2bdcd03492fd64efc269de332cdcf1c8a53c3e3cc07168b0c741f0270ba unable to insert files, error DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error Error: nested DBSError Code:110 Description:DBS DB insert record error Function:dbs.bulkblocks.insertFilesViaChunks Message: Error: concurrency error Stacktrace:
goroutine 300475111 [running]:
github.com/dmwm/dbs2go/dbs.Error({0xb054e0?, 0xc0009f2410?}, 0x6e, {0xc0004f60f0, 0xe6}, {0xa3b23e, 0x2b})
        /go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99
github.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc00036c000)
        /go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:743 +0x2546
github.com/dmwm/dbs2go/web.DBSPostHandler({0xb08290, 0xc000012cd8}, 0xc000616700, {0xa1d753, 0xa})
        /go/src/github.com/vkuznet/dbs2go/web/handlers.go:544 +0x1374
github.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb08290?, 0xc000012cd8?}, 0xc000a9f460?)
        /go/src/github.com/vkuznet/dbs2go/web/handlers.go:960 +0x3b
net/http.HandlerFunc.ServeHTTP(0xc00055f1a0?, {0xb08290?, 0xc000012cd8?}, 0x95d5a0?)
        /usr/local/go/src/net/http/server.go:2136 +0x29
github.com/dmwm/dbs2go/web.limitMiddleware.func1({0xb08290?, 0xc000012cd8?}, 0xc00055f1a0?)
        /go/src/github.com/vkuznet/dbs2go/web/middlewares.go:110 +0x32
net/http.HandlerFunc.ServeHTTP(0x7f8c001964c0?, {0xb08290?, 0xc000012cd8?}, 0xc0003

So, you have all pointers to look which lines of code fails by inspecting its stack, and that exactly what I did.

vkuznet avatar Sep 27 '24 12:09 vkuznet

As far as I can tell, it should always be set like:

      "file_type": "EDM",

@LinaresToine can you please change the component configuration and provide one of the block names that is failing to be inserted, in the following line:

config.DBS3Upload.dumpBlockJsonFor = ""

then restart DBS3Upload and you should soon get a JSON dump of the content that the component is POSTing to the DBS Server. Output file should be under the component directory (e.g. install/DBS3Upload/).

amaltaro avatar Sep 27 '24 19:09 amaltaro

Ok, I changed the config as suggested. Waiting on the loadFiles method to complete the cycle. Ill follow up

LinaresToine avatar Sep 27 '24 20:09 LinaresToine

I have placed the output json file in /eos/home-c/cmst0/public/dbsError/dbsuploader_block.json.

Another error is showing up in the DBS3Upload component for all 4 pending blocks:

Hit a general exception while inserting block /AlCaP0/Run2024H-v1/RAW#92fab5d9-9a27-4a7c-a57e-4b2691c654cd. Error: (52, 'Empty reply from server')
Traceback (most recent call last):
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/WMComponent/DBS3Buffer/DBSUploadPoller.py", line 94, in uploadWorker
    dbsApi.insertBulkBlock(blockDump=block)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 647, in insertBulkBlock
    result =  self.__callServer("bulkblocks", data=blockDump, callmethod='POST' )
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/data/tier0/WMAgent.venv3/lib64/python3.9/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 56, in __call__
    curl_object.perform()
pycurl.error: (52, 'Empty reply from server')

LinaresToine avatar Sep 27 '24 21:09 LinaresToine