WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

The new DQM GUI file management

Open andrius-k opened this issue 4 years ago • 78 comments

Impact of the new feature This request affects all systems that are responsible for harvested DQM data being uploaded to the DQM GUIs. This includes T0 processed DQM data and RelVal/reprocessing DQM data.

Is your feature request related to a problem? Please describe. We're deploying a new, upgraded version of the DQM GUI tool. The procedure which notifies the DQM GUI about the new DQM files is different in a new version. We would like this new procedure to be used along side the the old, visDQMUpload based DQM file upload.

Describe the solution you'd like Now, the DQM data is uploaded to the DQM GUIs using a tool called visDQMUpload. New procedure requires this process to be split into two stages:

  • Making a copy of the new DQM file to EOS.
  • Making an HTTP POST request to the new DQM GUI to notify it about the presence of a new file.

If required, a facade could be provided by us (DQM) that would have exactly the same interface as visDQMUpload. In such case, we would only like you to call the facade script (visDQMUpload_new) alongside the old one.

Describe alternatives you've considered No viable, future proof alternatives were found.

Additional context Bellow is a diagram that represents the current Offline DQM file movement:

current-dqm-diagram

Bellow is a diagram that represents the desired Offline DQM file movement, after the changes mentioned in this request:

new-dqm-diagram2

andrius-k avatar Feb 15 '21 16:02 andrius-k

For the records: link to the visDQMUpload tool https://github.com/cms-sw/cmssw/blob/ba6e8604a35283e39e89bc031766843d0afc3240/DQMServices/FileIO/scripts/visDQMUpload.py

jfernan2 avatar Feb 15 '21 17:02 jfernan2

For the records 2: the POST should use the following API

https://cmsweb-testbed.cern.ch/dqm/offline-test-new/api/v1/register

HTTP request body:

[{"dataset": "/a/b/c", "run": "123456", "lumi": "0", "file": "/eos/cms/store/group/comm_dqm/DQMGUI_data/location/file.root", "fileformat": 1}]

For more information about this API endpoint (and others), please refer to:

https://github.com/cms-DQM/dqmgui#new-file-registering-endpoint

jfernan2 avatar Feb 16 '21 09:02 jfernan2

Hi @andrius-k , I'm very sorry for missing this GH issue.

Has this new DQM Gui server been deployed already?

Your proposal looks feasible to me, and it will make the DQMHarvesting process easier and more robust in the long run. We are going to discuss this issue in the coming weeks and come back to you. Thanks

amaltaro avatar Jun 18 '21 15:06 amaltaro

Hi @amaltaro Unfortunately Andrius left CMS so, we take the baton: the new GUI reading the new backend in eos is working since January on https://cmsweb-testbed.cern.ch/dqm/offline-test-new/ which reads the following (temporary?) eos folder: /eos/cms/store/group/comm_dqm/DQMGUI_data Thanks

jfernan2 avatar Jun 20 '21 08:06 jfernan2

Hi @jfernan2 and @andrius-k , I'm working on this and have a question. Basically, the new replacement for visDQMUpload requires now more input parameters, is this right? Before, it needed only the file location. Now, it needs:

  • datasetName
  • RunNumber
  • Lumi
  • File Location
  • fileformat I think I know how to get datasetName and runNumber, but I'm not sure about Lumi (I can't easily find that in the Workflow Management object). Is fileformat always 1 or 2 in this case? I'm reading there are 3 options
Legacy DQM TDirectory based ROOT files (1)
DQMIO TTree based ROOT files (2)
Protobuf based format used in Online live mode (3)

Is there an easy way to tell which one is the right one for a specific root file? Do you have an example of how this information is gotten/used? Also, from what I understand we want to call both the new method and the old visDQMUpload method at this moment, correct?

khurtado avatar Feb 25 '22 22:02 khurtado

Hi @khurtado Thanks for looking into this. For the moment you can take Lumi=0 since this is reproducing the current per Run based root files. In the future we might have to upload single root files per LS where lumi will be declared somewhere About file format, you can assume 1 since they are the ones uploaded to the GUI Thank you

jfernan2 avatar Feb 26 '22 20:02 jfernan2

All DQM root files produced by Harvest processing are type 1 (plain ROOT), type 2 are DQMIO datasets in DAS not uploaded to the GUI The name of the method could be changed now to visDQMRegister since it is not uploading files to any server but copying them to eos and registering them in DB instead. Thanks https://github.com/cms-DQM/dqmgui

jfernan2 avatar Feb 26 '22 20:02 jfernan2

@jfernan2 Thank you! That helps a lot. One more question from the diagrams. Right now we have:

  1. visDQMUpload: Which uploads ROOT files to vocms0738,39,31

and we want: 2. Upload DQM ROOT files to EOS 3. To make this new HTTP post that does not upload files, only registers new files

Do we want to do 1, 2 and 3 and eventually get rid of 1, but not right now? Is 3 dependent on 2? (E.g.: 1 worked, but upload of DQM root files to EOS failed for some reason, do we abort step 3?) Asking mainly because we are just trying to split this work in 2 pieces since there are 2 stages.

khurtado avatar Feb 28 '22 14:02 khurtado

Since we want to have the old (legacy) DQM GUI (visDQMUpload) working on parallel until we end the commisioning of this new DQM GUI (visDQMRegister), I would vote for decouple both processes as much as possible.

When we are sure the new DQM GUI and the FW/WMcore workflow chain which makes it work is totally accepted by the Collaboration, we could start decommisioning the old visDQMUpload machinery.

On the visDQMRegister side:

  • If the copy of the file to eos does not succeed, step 3 should NOT be performed, since we don't want to register a file in the DB which cannot be accessed by the GUI
  • If step 3 fails for some reason, it should be retried. If you cannot register the file after several retries, we should log this somehow

Thanks

jfernan2 avatar Feb 28 '22 15:02 jfernan2

@jfernan2 : I'm testing some changes with the following workflow test: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053

For EOS: It's basically using the WMAgent certificate and trying to write to:

/eos/cms/store/group/comm_dqm/DQMGUI_data/wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/output/DQM_V0001_R000278175__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0002.root

But it cannot create the parent directory. Would the above be the expected path location to write though? Just double checking. I'm not sure if there is any voms DN mapping that need to be done in order to write to /eos/cms/store/group/comm_dqm/DQMGUI_data. If so, please let me know. Cert used was:

 openssl x509 -in /data/certs/myproxy.pem -noout  -subject
subject= /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=2073988181/CN=172652124/CN=1063249136/CN=379961315

Log:

2022-03-09 06:38:14,677:INFO:DQMUpload:Writing DQM root files to CERN EOS with retries: 3 and retry pause: 300
2022-03-09 06:38:14,677:INFO:StageOutMgr:==>Working on file: /wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/output/DQM_V0001_R000278175__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0002.root
2022-03-09 06:38:14,677:INFO:StageOutMgr:===> Attempting 1 Fallback Stage Outs
2022-03-09 06:38:14,677:INFO:StageOutImpl:Creating output directory...
2022-03-09 06:38:14,681:INFO:StageOutImpl:Running the stage out...
2022-03-09 06:38:15,818:INFO:StageOutImpl:Command exited with status: 151
Output message: stdout: Local File Size is: 49421926
Remote File Size is:
ERROR: Size Mismatch between local and SE

stderr: Run: [ERROR] Server responded with an error: [3010] Unable to create parent directory /eos/cms/store/group/comm_dqm/DQMGUI_data/wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/; Operation not permitted^@ (destination)

[ERROR] Server responded with an error: [3011] Unable to stat /eos/cms/store/group/comm_dqm/DQMGUI_data/wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/output/DQM_V0001_R000278175__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0002.root; No such file or directory^@

[ERROR] Server responded with an error: [3011] Unable to remove /eos/cms/store/group/comm_dqm/DQMGUI_data/wmagent_DQMHarvesting_LumiMask_khurtado_dqmup_v2_220308_174119_8053/output/DQM_V0001_R000278175__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0002.root; No such file or directory^@

khurtado avatar Mar 09 '22 14:03 khurtado

Thanks @khurtado

This is one of the key points of this request: that eos space has been granted to DQM group in the past as a temporary space for the project[1], but has a quota (at present) of 66TB out of which 43.56TB are being used. So, this space may not be a definite solution for the future once you set this workflow running in view of Run3.

Having said that, for your testing purpouses, I believe you should add yourself or Alan Malta if you are using his grid certificate to the following e-group which controls the write access[2], according to [3]. Or I can add you if you prefer, not sure if I should add you or Alan instead.

However, for the long term, it will be needed from WMCore or computing team another eos space to host all the DQM GUI root files, or this same space with larger quota.

Thank you very much

[1] https://twiki.cern.ch/twiki/bin/viewauth/CMS/T2CHCERNEosTeams [2] https://e-groups.cern.ch/e-groups/Egroup.do?egroupName=cms-eos-PPD-DQM&tab=3 [3] eos root://eosproject.cern.ch attr ls /eos/cms/store/group/comm_dqm/ sys.accounting.vos.0="cms" sys.acl="u:22014:rw,u:31275:rw,u:5410:rw,g:1399:!d,egroup:cms-eos-ppd-dqm:rw!d,egroup:cms-eos-ppd-dqm-cleaners:rw+d" sys.forced.blockchecksum="crc32c" sys.forced.blocksize="4k" sys.forced.checksum="adler" sys.forced.layout="replica" sys.forced.nstripes="2" sys.forced.space="default" sys.recycle="/eos/cms/proc/recycle/" user.acl=""

jfernan2 avatar Mar 10 '22 09:03 jfernan2

@jfernan2 Thank you! It would be @amaltaro, to keep consistency with the certs used in the test agent. He will be requesting access to it.

khurtado avatar Mar 10 '22 20:03 khurtado

Hi @jfernan2 ,

So, Alan is part now of the egroup, but I can't still copy to the area. Is there anything else missing? I tried this interactively

[cmst1@vocms0192 ~]$ xrdcp test.txt root://eoscms.cern.ch//eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt
[0B/0B][100%][==================================================][0B/s]
Run: [ERROR] Server responded with an error: [3010] Unable to open file /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt; Operation not permitted (destination)

[cmst1@vocms0192 ~]$ voms-proxy-info -all
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=2073988181/CN=172652124/CN=2101403252/CN=334491060
issuer    : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues/CN=2073988181/CN=172652124/CN=2101403252
identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues
type      : RFC3820 compliant impersonation proxy
strength  : 2048
path      : /data/certs/myproxy.pem
timeleft  : 163:56:40
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=amaltaro/CN=718748/CN=Alan Malta Rodrigues
issuer    : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=production/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/uscms/Role=NULL/Capability=NULL
timeleft  : 163:56:41
uri       : voms2.cern.ch:15002

khurtado avatar Mar 15 '22 15:03 khurtado

Hi @khurtado That is very strange, and honestly it scapes my knowledge of the system... Could you please try interactively xrdcp -v -d 9 test.txt root://eoscms.cern.ch//eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt ? I'd like to see the debug messages seeking for a possible mismatch between Alan's username and the one associated to his grid certificate, since for us (DQM) xrdcp command works fine, and a direct cp to eos as well, which for Alan's account (the one linked to his email registedred in the e-group) should work too.

Could you try a direct copy from lxplus using Alan's account or your account after registering to the e-group?

Are you xrdcpying from lxplus or somewhere lese? In my case, from lxplus and with my account lnked to the grid certificate the xrdcp -d 3 gives: [2022-03-15 17:37:22.053623 +0100][Debug ][XRootDTransport ] [eoscms.cern.ch:1094.0] Sending out kXR_login request, username: jfernan, cgi: ?xrd.cc=ch&xrd.tz=1&xrd.appname=xrdcp&xrd.info=&xrd.hostname=lxplus789.cern.ch&xrd.rn=v5.4.1, dual-stack: true, private IPv4: false, private IPv6: false [2022-03-15 17:37:22.057027 +0100][Dump ][XRootD ] [eoscms.cern.ch:1094] Got a kXR_ok response to request kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/text.txt Thanks

jfernan2 avatar Mar 15 '22 16:03 jfernan2

@jfernan2 This is from lxplus (I had to change 9 to 3, since xrdcp complains that's the max debug level): It first tries krb5, then moves to gsi:

env -i X509_USER_PROXY=$PWD/myproxy.pem xrdcp -v -d 3 test.txt root://eoscms.cern.ch//eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt
2022-03-15 18:01:53.440219 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Logged in, session: 8889fd03deae0200a7420000ca4c5304
[2022-03-15 18:01:53.440224 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Authentication is required: &P=krb5,xrootd/[email protected]&P=gsi,v:10400,c:ssl,ca:5168735f.0|4339b4bc.0&P=sss,0.13:/etc/eos.keytab&P=unix
[2022-03-15 18:01:53.440238 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Sending authentication data
[2022-03-15 18:01:53.442406 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Trying to authenticate using krb5
[2022-03-15 18:01:53.442878 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Cannot get credentials for protocol krb5: Seckrb5: No or invalid credentials; No credentials cache found (p=xrootd/[email protected]).
[2022-03-15 18:01:53.445015 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Trying to authenticate using gsi
[2022-03-15 18:01:53.869336 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Wrote a message:  (0x840084d0), 136 bytes
[2022-03-15 18:01:53.928655 +0100][Dump   ][XRootDTransport   ] [msg: 0x84007c80] Expecting 4385 bytes of message body
[2022-03-15 18:01:53.928701 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received message header, size: 8
[2022-03-15 18:01:53.928715 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received a message of 4393 bytes
[2022-03-15 18:01:53.928725 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Sending more authentication data for gsi
[2022-03-15 18:01:53.933586 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Wrote a message:  (0x840b6780), 15180 bytes
[2022-03-15 18:01:53.943879 +0100][Dump   ][XRootDTransport   ] [msg: 0x84002190] Expecting 0 bytes of message body
[2022-03-15 18:01:53.943929 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received message header, size: 8
[2022-03-15 18:01:53.943936 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received a message of 8 bytes
[2022-03-15 18:01:53.944078 +0100][Debug  ][XRootDTransport   ] [eoscms.cern.ch:1094.0] Authenticated with gsi.
[2022-03-15 18:01:53.944105 +0100][Debug  ][PostMaster        ] [eoscms.cern.ch:1094] Stream 0 connected.
[2022-03-15 18:01:53.944120 +0100][Debug  ][Utility           ] Monitor library name not set. No monitoring
[2022-03-15 18:01:53.944164 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Wrote a message: kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt, flags: none) (0x2225650), 74 bytes
[2022-03-15 18:01:53.944206 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Successfully sent message: kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt, flags: none) (0x2225650).
[2022-03-15 18:01:53.944222 +0100][Dump   ][XRootD            ] [eoscms.cern.ch:1094] Message kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt, flags: none) has been successfully sent.
[2022-03-15 18:01:53.944228 +0100][Debug  ][ExDbgMsg          ] [eoscms.cern.ch:1094] Moving MsgHandler: 0x222dcb0 (message: kXR_stat (path: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt, flags: none) ) from out-queu to in-queue.
[2022-03-15 18:01:53.944241 +0100][Dump   ][PostMaster        ] [eoscms.cern.ch:1094.0] All messages consumed, disable uplink

However, I get the permission denied:

[2022-03-15 17:59:35.423151 +0100][Dump   ][Utility           ] Path:      /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt
[2022-03-15 17:59:35.423191 +0100][Debug  ][File              ] [0x1737bc0@root://eoscms.cern.ch:1094//eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10&xrdcl.requuid=da4053e1-383e-45e6-8f2c-19731aa88a1f] Sending an open command
[2022-03-15 17:59:35.423213 +0100][Dump   ][XRootD            ] [eoscms.cern.ch:1094] Sending message kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat )
[2022-03-15 17:59:35.423231 +0100][Debug  ][ExDbgMsg          ] [eoscms.cern.ch:1094] MsgHandler created: 0x1738c10 (message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) ).
[2022-03-15 17:59:35.423243 +0100][Dump   ][PostMaster        ] [eoscms.cern.ch:1094] Sending message kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) (0x1738570) through substream 0 expecting answer at 0
[2022-03-15 17:59:35.423283 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Wrote a message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) (0x1738570), 87 bytes
[2022-03-15 17:59:35.423308 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Successfully sent message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) (0x1738570).
[2022-03-15 17:59:35.423316 +0100][Dump   ][XRootD            ] [eoscms.cern.ch:1094] Message kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) has been successfully sent.
[2022-03-15 17:59:35.423322 +0100][Debug  ][ExDbgMsg          ] [eoscms.cern.ch:1094] Moving MsgHandler: 0x1738c10 (message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) ) from out-queu to in-queue.
[2022-03-15 17:59:35.423328 +0100][Dump   ][PostMaster        ] [eoscms.cern.ch:1094.0] All messages consumed, disable uplink
[2022-03-15 17:59:35.424149 +0100][Dump   ][XRootDTransport   ] [msg: 0xdc000ac8] Expecting 100 bytes of message body
[2022-03-15 17:59:35.424171 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received message header for 0xdc000ac8 size: 8
[2022-03-15 17:59:35.424185 +0100][Dump   ][AsyncSock         ] [eoscms.cern.ch:1094.0] Received message 0xdc000ac8 of 108 bytes
[2022-03-15 17:59:35.424191 +0100][Dump   ][PostMaster        ] [eoscms.cern.ch:1094] Handling received message: 0xdc000ac8.
[2022-03-15 17:59:35.424254 +0100][Dump   ][XRootD            ] [eoscms.cern.ch:1094] Got a kXR_error response to request kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) [3010] Unable to open file /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt; Operation not permitted
[2022-03-15 17:59:35.424284 +0100][Debug  ][XRootD            ] [eoscms.cern.ch:1094] Handling error while processing kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ): [ERROR] Error response: permission denied.
[2022-03-15 17:59:35.424293 +0100][Debug  ][ExDbgMsg          ] [eoscms.cern.ch:1094] Calling MsgHandler: 0x1738c10 (message: kXR_open (file: /eos/cms/store/group/comm_dqm/DQMGUI_data/test.txt?oss.asize=10, mode: 0644, flags: kXR_new kXR_open_updt kXR_async kXR_retstat ) ) with status: [ERROR] Error response: permission denied.

khurtado avatar Mar 15 '22 17:03 khurtado

Hi @khurtado In my case I get authenticated with krb5 successfully. In your log I see krb5 fails but then gsi makes it: [2022-03-15 18:01:53.944078 +0100][Debug ][XRootDTransport ] [eoscms.cern.ch:1094.0] Authenticated with gsi.

I have tried to do the same with my secondary lxplus account, which is not linked to my grid certificate, to decouple: I authenticate successfully with krb5 but then copy fails with the same message as you.

I suspect the issue is linked to the fact that you are using Alan's grid certificate from your lxplus account. Can you do a klist command?

From my (primary) successful account I get: Valid starting Expires Service principal 03/15/2022 18:33:49 03/16/2022 19:33:49 krbtgt/[email protected] renew until 03/20/2022 18:33:49 03/15/2022 18:33:49 03/16/2022 19:33:49 afs/[email protected] renew until 03/20/2022 18:33:49 03/15/2022 18:33:51 03/16/2022 19:33:49 xrootd/[email protected] renew until 03/20/2022 18:33:49 03/15/2022 18:33:56 03/16/2022 19:33:49 xrootd/[email protected] renew until 03/20/2022 18:33:49 03/15/2022 18:33:56 03/16/2022 19:33:49 xrootd/[email protected] renew until 03/20/2022 18:33:49

From my (secondary) non-successful account I get: Valid starting Expires Service principal 03/15/2022 18:28:48 03/16/2022 19:28:48 krbtgt/[email protected] renew until 03/20/2022 18:28:48 03/15/2022 18:28:48 03/16/2022 19:28:48 afs/[email protected] renew until 03/20/2022 18:28:48 03/15/2022 18:30:15 03/16/2022 19:28:48 xrootd/[email protected] renew until 03/20/2022 18:28:48

I understand gsi authentication is ignored and we need krb5 against eoshome and eosproject too

jfernan2 avatar Mar 15 '22 17:03 jfernan2

@jfernan2 Ah, I see. So the authentication is working only with kerberos. Which means it will probably work from Alan's account itself with his kerberos. I did kdestroy so that it wouldn't try to use my kerberos credentials and only use the GSI/proxy credential. eoshome and eosproject do need krb5, yes, but I thought eoscms could work with gsi.

I suspect the agents, which run jobs using the cmst1 account won't have Alan's kerberos credentials or any kerberos at all from the condor jobs, only the proxy, so I was expecting xrdcp to do the authentication using the grid certificate alone.

@amaltaro Do you know how we write to e.g.: root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/TESTBED ? Is it done only with GSI authentication?

khurtado avatar Mar 15 '22 17:03 khurtado

That's even stranger since now, I did a kdestroy from my secondary account and I was able to do xrdcp after gsi authentication ONLY :-S So, indeed gsi is also working, may it be that Alan's grid certificate is not mapped to his cern account somehow so that e-group is not considering it?

jfernan2 avatar Mar 15 '22 18:03 jfernan2

The certificate is giving read access only for some reason. That means the mapping to the user is working, but the e-group portion is not recognized to give write permissions is not, it's hard to tell what is going on without knowing what the EOS configuration is. Is this something Service Desk at CERN is supposed to help with?

@amaltaro: Is it okay if I create a directory e.g.::

/eos/cms/store/logs/prod/recent/TESTBED/DQMGUI

For the tests, since we do have write access to that location? If so, I think we can just ignore the issues with the other path, since it's for temporary tests only.

khurtado avatar Mar 16 '22 13:03 khurtado

Hi @khurtado I am not sure about your statement since I was able to write using my secondary account through gsi authentication only, but it is true CERN IT should be able to solve this Thanks

jfernan2 avatar Mar 16 '22 14:03 jfernan2

@jfernan2 Thank you! I will just create a new directory for the tests inside, following Aln's suggestion: /eos/cms/store/unmerged/DQMGUI

khurtado avatar Mar 16 '22 15:03 khurtado

@jfernan2 I'm getting closer here. I have one question regarding the API. In this format:

[{"dataset": "/a/b/c", "run": "123456", "lumi": "0", "file": "/eos/cms/store/group/comm_dqm/DQMGUI_data/location/file.root", "fileformat": 1}]

How do I pass a range of runs or lumis in the request? This is what I see from WMCore:

'runAndLumis': {278175: [[70, 90]]}}

So, if we have a DQMHarvest workflow set to multiRun harvesting, it can have multiple runs and lumis associated with each run.

For example: This template: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=amaltaro_DQMHarvesting_MultiRun_HG2202_Val_220203_213001_614 Had a job with the following run and lumis:

runAndLumis': {277981: [[1, 82], [84, 158]], 278017: [[1, 589]], 277932: [[1, 12], [14, 15], [17, 127]], 278193: [[1, 239]]}}

EDIT:

Ah, wait, would it be many dictionaries, 1 for each run and single lumi and the same filename per dictionary (and fileformat and datasetname)? E.g.:

[{run:"X", lumi:"1", file:"filename1"},{run:"X", lumi:"2", file:"filename1"},{run:"Y", lumi:"1",file:"filename1"}]

khurtado avatar Mar 18 '22 18:03 khurtado

@khurtado Current DQM GUI is only able to show DQM root files per RUN. In the future is expected to be able to handle per LS too.

This implies that, a single root file uploaded (now copied to eos and registered to the DB) can only contain a single run or a single LS (of a run). Hence, per RUN root files should be registered with Lumi=0, since this is reproducing the current per Run based root files. Once we were able to show in the GUI per LS data, runs will be registered with a single LS, not several, since plots per LS must be displayed.

Please note that same RUN (and LS) may be associated to severl datasets, but a single file each, like in: https://github.com/cms-DQM/dqmgui#api-documentation

Multirun harvesting root files are a special case of DQM root files; in this case, since all stats are harvested in a single root file and the GUI is not able to display the runs it contains (they are embedded in the dataset name or config which has produced it), for teh GUI runNumber is always forced/set to 999999 for data (and to 1 for MC as any MC). See: https://github.com/dmwm/WMCore/pull/9746 and https://github.com/dmwm/WMCore/issues/9690

Bottomline, in principle you should not copy and register the same file for more than one RUN or LS.

@ahmad3213 @emanueleusai @rvenditti please consider to correct me at any point since you are the official DQM conveners, I am not DQM convener since 31st Dec 2021. Thanks

jfernan2 avatar Mar 19 '22 19:03 jfernan2

@jfernan2 Thank you!

Here is one workflow test with the current changes: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

ROOT files were uploaded here (I'm using the unmerged area as a temporary path):

/eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

And the POST call to the register site looks like this:

2022-03-21 04:03:18,643:INFO:DQMUpload:HTTP Upload is about to start:
 => URL: https://cmsweb-testbed.cern.ch/dqm/offline-test-new/api/v1/register
 => Filename: /eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322/output/DQM_V0001_R000277991__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0001.root

2022-03-21 04:03:18,643:INFO:DQMUpload:Using proxy file: /srv/myproxy.pem
2022-03-21 04:03:18,643:INFO:DQMUpload:Using CA certificate path: None
2022-03-21 04:03:18,643:INFO:DQMUpload:HTTP Register POST arguments: [{'file': '/eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322/output/DQM_V0001_R000277991__NoBPTX__Run2016F-23Sep2016-v1__DQMIO_0001.root', 'dataset': '/NoBPTX/Run2016F-23Sep2016-v1/DQMIO', 'run': 277991, 'lumi': 0, 'fileformat': 1}]

2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:Found 149 default trusted CA certificates.
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:SSL context manager created with the following settings:
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  check_hostname : True
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  options : Options.OP_ALL|OP_NO_SSLv3|OP_NO_SSLv2|OP_CIPHER_SERVER_PREFERENCE|OP_SINGLE_DH_USE|OP_SINGLE_ECDH_USE|OP_NO_COMPRESSION
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  protocol : _SSLMethod.PROTOCOL_TLS
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  verify_flags : VerifyFlags.VERIFY_X509_TRUSTED_FIRST
2022-03-21 04:03:18,657:INFO:HTTPSAuthHandler:  verify_mode : VerifyMode.CERT_REQUIRED
2022-03-21 04:03:18,886:INFO:DQMUpload:HTTP POST to register url finished succesfully with response:
  Status code: 201

I do see some DQM histograms associated with that Run and Dataset here: https://cmsweb-testbed.cern.ch/dqm/offline-test-new/?folder_path=DQM%2FTimerService&dataset_name=%2FNoBPTX%2FRun2016F-23Sep2016-v1%2FDQMIO&run_number=277991&workspaces=Everything&overlay=overlay&normalize=true&lumi=0

but it's unclear to me how to tell if they came from what I registered in this test or if they were there before. Anyway, does this look well to you?

khurtado avatar Mar 21 '22 10:03 khurtado

@jfernan2 And this is for multiRun: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196

The run number will be 999999 for current multiRun workflows but should start showing 999999 or 1 accordingly for new workflows once #9746 is applied.

2022-03-21 11:18:57,248:INFO:DQMUpload:HTTP Upload is about to start:
 => URL: https://cmsweb-testbed.cern.ch/dqm/offline-test-new/api/v1/register
 => Filename: /eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196/output/DQM_V0001_R000999999__NoBPTX__Run2016F-23Sep2016-v1-277932-278193__DQMIO_0001.root

2022-03-21 11:18:57,248:INFO:DQMUpload:Using proxy file: /srv/myproxy.pem
2022-03-21 11:18:57,248:INFO:DQMUpload:Using CA certificate path: None
2022-03-21 11:18:57,248:INFO:DQMUpload:HTTP Register POST arguments: [{'file': '/eos/cms/store/unmerged/DQMGUI/wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196/output/DQM_V0001_R000999999__NoBPTX__Run2016F-23Sep2016-v1-277932-278193__DQMIO_0001.root', 'dataset': '/NoBPTX/Run2016F-23Sep2016-v1/DQMIO', 'run': 999999, 'lumi': 0, 'fileformat': 1}]

2022-03-21 11:18:57,273:INFO:HTTPSAuthHandler:Found 149 default trusted CA certificates.
2022-03-21 11:18:57,273:INFO:HTTPSAuthHandler:SSL context manager created with the following settings:
2022-03-21 11:18:57,273:INFO:HTTPSAuthHandler:  check_hostname : True
2022-03-21 11:18:57,274:INFO:HTTPSAuthHandler:  options : Options.OP_ALL|OP_NO_SSLv3|OP_NO_SSLv2|OP_CIPHER_SERVER_PREFERENCE|OP_SINGLE_DH_USE|OP_SINGLE_ECDH_USE|OP_NO_COMPRESSION
2022-03-21 11:18:57,274:INFO:HTTPSAuthHandler:  protocol : _SSLMethod.PROTOCOL_TLS
2022-03-21 11:18:57,274:INFO:HTTPSAuthHandler:  verify_flags : VerifyFlags.VERIFY_X509_TRUSTED_FIRST
2022-03-21 11:18:57,274:INFO:HTTPSAuthHandler:  verify_mode : VerifyMode.CERT_REQUIRED
2022-03-21 11:18:57,392:INFO:DQMUpload:HTTP POST to register url finished succesfully with response:
  Status code: 201

khurtado avatar Mar 21 '22 12:03 khurtado

Sorry @khurtado but I am not following you, at least not completely:

Here is one workflow test with the current changes: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

What do you mean with current changes? Changes in the script you are creating to acomplish this task?

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

This workflow is weird, CMSSW_8_0_20 ? And it has several runs in it, I understand there is a DQM root file per run

And the POST call to the register site looks like this:

The post looks OK, in principle

but it's unclear to me how to tell if they came from what I registered in this test or if they were there before. Anyway, does this look well to you?

They look OK, but bear in mind that since you did several tests, last upload/register is the one which will be displayed

@jfernan2 And this is for multiRun: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196 The run number will be 999999 for current multiRun workflows but should start showing 999999 or 1 accordingly for new workflows once #9746 is applied.

I am confused here: PR #9746 is almost two year old, not yet merged (plans?) and it keeps saying that run Number = 1 only will only be assigned to MC, but never to data. On the other hand, for data, run = 999999, info for harvested runs is lost, I would have expected to keep it on the dataset name somehow, otherwise another multirun harvesting (MRH) for the same dataset will overwrite this one.

For me MRH is a nasty task, in the sense that it may be subdetector dependent since the list of runs for a given dataset may vary from one DPG to another. It has been encouraged in the past that every DPG makes its own MRH for that reason. On the other hand, current DQM GUI will have problems displaying it since it was designed for single run uploads.

If you want to register brand new data which is not in the GUI yet, so that you can ensure that the plots you see come from your last registering, perhaps you could use current dataset: /Cosmics/Commissioning2022-PromptReco-v1/DQMIO

Thanks a lot

jfernan2 avatar Mar 21 '22 16:03 jfernan2

@jfernan2

Sorry @khurtado but I am not following you, at least not completely:

Here is one workflow test with the current changes: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

What do you mean with current changes? Changes in the script you are creating to acomplish this task?

Yes, in this PR: https://github.com/dmwm/WMCore/pull/11015

https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_LumiMask_khurtado_dqm_v10_220320_210531_8322

This workflow is weird, CMSSW_8_0_20 ? And it has several runs in it, I understand there is a DQM root file per run

Yes, it has 1 DQM root file per run. This workflow is one of the WMCore ReqMgr templates we have for testing: https://github.com/dmwm/WMCore/blob/master/test/data/ReqMgr/requests/Integration/DQMHarvesting_MultiRun.json

And the POST call to the register site looks like this:

The post looks OK, in principle

but it's unclear to me how to tell if they came from what I registered in this test or if they were there before. Anyway, does this look well to you?

They look OK, but bear in mind that since you did several tests, last upload/register is the one which will be displayed

That sounds good, thanks!

@jfernan2 And this is for multiRun: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=wmagent_DQMHarvesting_MultiRun_khurtado_dqm_v11_220321_111702_196 The run number will be 999999 for current multiRun workflows but should start showing 999999 or 1 accordingly for new workflows once #9746 is applied.

I am confused here: PR #9746 is almost two year old, not yet merged (plans?) and it keeps syaing that run Number = 1 only will only be assigned to MC, but never to data. On the other hand, for data, run = 999999, info for harvested runs is lost, I would have expected to keep it on the dataset name somehow, otherwise another multirun harvesting (MRH) for the same dataset will overwrite this one.

For me MRH is a nasty task, in the sense that it may be subdetector dependent since the list of runs for a given dataset may vary from one DPG to another. It has been encouraged in the past that every DPG makes its own MRH for that reason. On the other hand, current DQM GUI will have problems displaying it since it was designed for single run uploads.

If you want to register brand new data which is not in the GUI yet, so that you can ensure that the plots you see come from your last registering, perhaps you could use current dataset: /Cosmics/Commissioning2022-PromptReco-v1/DQMIO

Thanks a lot

Ah, okay. So, right now this is using the 999999 thing which apparently would be useless. I can alternatively just NOT call the new DQMUI register or copy the file to EOS at all if the harvesting job is multiRun, since it is not supported anyway (so, we would just keep the old visDQMGUI call). How does that sound?

khurtado avatar Mar 21 '22 16:03 khurtado

I can alternatively just NOT call the new DQMUI register or copy the file to EOS at all if the harvesting job is multiRun, since it is not supported anyway (so, we would just keep the old visDQMGUI call). How does that sound?

How is current visDQMGUI treating MRH files?

jfernan2 avatar Mar 21 '22 17:03 jfernan2

I can alternatively just NOT call the new DQMUI register or copy the file to EOS at all if the harvesting job is multiRun, since it is not supported anyway (so, we would just keep the old visDQMGUI call). How does that sound?

How is current visDQMGUI treating MRH files?

From the WMCore side, the HTTP post for visDQMGUI only asks for the full path filename in the worker node. No info on Run numbers or lumis, so there is no distinction with ByRun mode files in that sense.

khurtado avatar Mar 21 '22 18:03 khurtado

Then, it is basing its DB registering on the Run Number from the DQM root file name, just as the new GUI. No info about the run numbers contained in dataset or file name

jfernan2 avatar Mar 22 '22 08:03 jfernan2