WMCore Truncate output module name before sending it over to DBS

Tier0 is failing to inject one specific kind of block into DBS and the reason for that comes from the long (>45 chars) string used for the output_module_label, which in the end responds with a 500 Internal Server Error.

@ericvaandering this patch will truncate the output module as we discussed in the office yesterday. @hufnagel may have a comment on it as well.

Jun 23 '16 13:06 amaltaro

Beginning pylint and unit tests. See https://cmssdt.cern.ch/jenkins/job/DMWM-WMCore-PR-test/2485/ for details

Jun 23 '16 13:06 cmsdmwmbot

@johnhcasallasl FYI

Jun 23 '16 14:06 amaltaro

Summary of pylint changes for pull request 6950:

Abbreviated pylint report for src/python/WMComponent/DBS3Buffer/DBSBufferBlock.py follows:

Line 294 in DBSBufferBlock.setDataset W0612 Unused variable 'junk'
plus 3 total warnings
plus 61 comments on code style

Jun 23 '16 14:06 cmsdmwmbot

Unit test changes for pull request 6950:

WMCore_t.WorkQueue_t.WorkQueue_t.WorkQueueTest:testResubmissionWorkflowSiteWhitelistLocations changed from failure to success

Jun 23 '16 14:06 cmsdmwmbot

This is dangerous. We need some check on the DBS side to see if there is any unique constraint or index on that field. If there is, then just blindly truncating can get you into trouble because two outputs that are nominally different won't be considered as such by DBS.

I would much prefer if this patch had something like:

if conf.get('output_module_label', '') == "PromptCalibProdSiStripGainsAfterAbortGap": conf['output_module_label'] = "PromptCalibProdSiStripGainsAAG

This sets exactly the mapping suggested by Alca and it makes sure it still breaks for any other too long parameter (as it should, we don't want a general and automatic remapping here).

Jun 24 '16 13:06 hufnagel

I'm not sure what kind of uniqueness constraint you think there could be? This field is a property of the dataset in DBS, but of course many datasets can have the same value. Do you mean that if it we try to "create" the same dataset a second time with a different value, we might have issues?

I'm just not sure anyone ever uses this field. As far as I know, we could fill it with "GIBBERISH" for every dataset as we do for the PSET hash and we may never notice.

Jun 24 '16 13:06 ericvaandering

Yes, I worry about the case where two datasets have different output module labels and we truncate to the same value. Maybe a bit paranoid, but this could be a real mess if it happens.

Another consideration: Since we currently want to keep the length limit and you are adding validation checks in another patch, I would keep this workaround patch as restrictive as possible and only fix the one use case we have for it now. Don't add general workaround if we don't really want a general workaround.

Jun 24 '16 14:06 hufnagel

This is not a general way to handle output module labels. We are going to truncate only the existing longer than 45 characters' output module labels as a temperate solution. All new output module labels should be limited to 45 character long when it creates.

Jun 24 '16 14:06 yuyiguo

Answering Dirk's question about output_module_label in DBS: It is not unique by itself. The schema requires a maximum 45 character long and not NULL output_module_label. APP_EXEC_ID, RELEASE_VERSION_ID, PARAMETER_SET_HASH_ID, OUTPUT_MODULE_LABEL and GLOBAL_TAG all together are required to be unique,

Can someone check how many output_module_labels longer than 45 characters out there? What are the other fields associated with them?

Jun 24 '16 15:06 yuyiguo

Dirk, the suggestion proposed in the HN is not valid because that's not exactly the output module name bugging DBS. The correct one is 'ALCARECOStreamPromptCalibProdSiStripGainsAfterAbortGap', so if you want to create a rule only for this one, that's it fine with me. Though I'd prefer to have something more generic in case we have other long names out there, otherwise we need to keep updating the source code for these exceptions.

About your concern, what would be the problem if different datasets with different output module names have the same truncated output name? Nobody is using that information anyways, AFAICT...

About the column length on the DBS side, is there any good reason that we cannot stretch it a bit (besides a possible very small performance penalty)?

Anyways, whatever path you decide to go is fine with me, I just think we should get it in place asap to avoid all these unneeded calls to dbs and pollution of the logs.

Jun 27 '16 11:06 amaltaro

Well, to first order this needs to be tested and put in (or Dirk's alternate) into the Tier0. After that we can put something into WMCore itself, but I don't see the point in putting a "one time only" fix into the WMCore code base, while the general fix that @amaltaro wrote could go in.

Jun 27 '16 12:06 ericvaandering

@hufnagel @johnhcasallasl can you validate it in a T0 replay?

Jun 27 '16 12:06 amaltaro

Problem is, we only know the remapping for the specific use case we know about.

ALCARECOStreamPromptCalibProdSiStripGainsAfterAbortGap => ALCARECOStreamPromptCalibProdSiStripGainsAAG

As far as I am concerned there is no agreement how to shorten any other potential long output module and we don't want any other long output modules anyways (and your other patch will make sure we will never get them).

Jun 27 '16 12:06 hufnagel

Btw, we also need to put this exception into Alans more general validation patch. Either that or tell Alca that they need to change the name of the producer in CMSSW. Otherwise we can't run the Tier0 anymore after Alans more general patch is merged.

Jun 27 '16 12:06 hufnagel

Hello:

I was wondering how to proceed here, it is becoming urgent. IMO, we could solve it for the particular problem we have now and still continue the discussion about a general solution. If you agree, we can setup a replay to test the change as proposed by @hufnagel.

Adding @ebohorqu @blallen @drkovalskyi

Jun 30 '16 11:06 johnhcasallasl

Please run a small functional Tier0 replay immediately with this patch

https://github.com/hufnagel/WMCore/commit/cca04d11488e7fba3bf719f04f5ba8bb4b085bdc

The replay config should contain the PromptCalibProdSiStripGainsAfterAbortGap producer and you only need to inject Express stream data. Replay can be considered successful if the DBS registration of the data from that producer works.

If the replay is successful you can also deploy the patch in Tier0 production and restart the DBS3Buffer component.

After this is done I can make a PR from my patch or Alan can update this one.

PS: I am on vacation at a music festival in Belgium, only have decent internet in the morning or very late at night.

Jul 01 '16 00:07 hufnagel

The replay was successful and the patch applied in Tier0 production.

Jul 05 '16 09:07 johnhcasallasl

@amaltaro , you want to adjust this pull request or close it and I create a new pull request based on my patch ?

Jul 05 '16 14:07 hufnagel

Dirk, my understand is that we're not going to make a one specific fix for WMCore. Instead, we will add validation code while the request is created and fail it in case it does not fit into the DBS column. So you can leave it to me.

Jul 05 '16 18:07 amaltaro

Yes, but you NEED to add this exception to the validation code since we already have existing data that violates the convention. You can't just add validation checks that will fail existing and already used workflows.

Jul 05 '16 23:07 hufnagel

This basically means we are stuck supporting this exception, both with my workaround and in the validation check.

Jul 05 '16 23:07 hufnagel

I thought we were going to have another cmssw release where it would be sorted out, no? What's in the system is already there and won't have issues - provided you keep this patch in the T0 until all the workflows for that release are completely gone.

I dislike very much all these exceptions that we have to keep in the code, at some point we a) don't need them anymore, never b) but we don't know if they are still needed c) the code remains in WMCore forever and d) it complicates any code maintenance.

If the release will be kept longer in T0 with this long output name, then the only thing left is to hack once again the code though. Can you check that out and then we resume the discussion here?

Jul 06 '16 07:07 amaltaro

So far no one pushed for the workflow name (and therefore output module label) to be changed. The only reply on this I found mentioned changing it for next year (which is a bit late).

Effectively, once you merge the validation patch, we won't be able to upgrade the Tier0 to this version of the WMCore code. Not until the PCL workflow has been changed to pass the validation.

I am not planning to update the Tier0 release anytime soon, so this is not a concern in the short term. OTOH, I don't want to rule out that we will want or have to change the Tier0 release before the end of the year.

I asked in HN for the workflow to be changed asap, lets see what kind of feedback I get.

Jul 06 '16 14:07 hufnagel

If you need to keep a patch that does the mapping of the output label for the Tier0 for the rest of the year, you can also keep a patch that reverts the check that Alan is going to insert.

Jul 06 '16 17:07 ericvaandering

In principle correct, but I'd rather not have to write a patch just to partially revert the patch that Alan is still writing/testing :-)

Jul 06 '16 17:07 hufnagel

Well, you don't have to "write" it, you get git to do it for you with git revert.

Eric

Jul 06 '16 17:07 ericvaandering

I actually want the validation so we don't get more of these configs with too long output module labels. Which means I would actually have to patch on top of Alans patch, simple revert is not what we want.

Jul 06 '16 17:07 hufnagel

Ok, I think we need to first fix the workflow configuration on T0 and then we can create the validation patch, just to be on the safe side. Dirk, don't forget updating this issue once you get updates from that HN.

Jul 07 '16 11:07 amaltaro

Oh, create the validation patch please. We just can't use it on the Tier0 side until we fix the workflow. Doesn't mean we shouldn't get the validation in asap though.

Jul 07 '16 11:07 hufnagel

Ok, the workflow is NOT going to be fixed until early 2017. I didn't really expect anything else. The response basically was "it's working as-is, we don't have time to work on this now, we'll look at it once we have time. ie. between the 2016 and 2017 data taking".

Which basically leaves us two options:

We don't put in a validation until early 2017 (which IMO is unacceptable).
We put in a validation check, but leave an exception for this one known use case and also leave the workaround for this one known use case. We can put in a an issue to remove the workaround and the exception once the workflow has been fixed.

Jul 20 '16 22:07 hufnagel