Truncate output module name before sending it over to DBS
Tier0 is failing to inject one specific kind of block into DBS and the reason for that comes from the long (>45 chars) string used for the output_module_label, which in the end responds with a 500 Internal Server Error.
@ericvaandering this patch will truncate the output module as we discussed in the office yesterday. @hufnagel may have a comment on it as well.
Beginning pylint and unit tests. See https://cmssdt.cern.ch/jenkins/job/DMWM-WMCore-PR-test/2485/ for details
@johnhcasallasl FYI
Summary of pylint changes for pull request 6950:
Abbreviated pylint report for src/python/WMComponent/DBS3Buffer/DBSBufferBlock.py follows:
- Line 294 in DBSBufferBlock.setDataset W0612 Unused variable 'junk'
- plus 3 total warnings
- plus 61 comments on code style
Unit test changes for pull request 6950:
- WMCore_t.WorkQueue_t.WorkQueue_t.WorkQueueTest:testResubmissionWorkflowSiteWhitelistLocations changed from failure to success
This is dangerous. We need some check on the DBS side to see if there is any unique constraint or index on that field. If there is, then just blindly truncating can get you into trouble because two outputs that are nominally different won't be considered as such by DBS.
I would much prefer if this patch had something like:
if conf.get('output_module_label', '') == "PromptCalibProdSiStripGainsAfterAbortGap": conf['output_module_label'] = "PromptCalibProdSiStripGainsAAG
This sets exactly the mapping suggested by Alca and it makes sure it still breaks for any other too long parameter (as it should, we don't want a general and automatic remapping here).
I'm not sure what kind of uniqueness constraint you think there could be? This field is a property of the dataset in DBS, but of course many datasets can have the same value. Do you mean that if it we try to "create" the same dataset a second time with a different value, we might have issues?
I'm just not sure anyone ever uses this field. As far as I know, we could fill it with "GIBBERISH" for every dataset as we do for the PSET hash and we may never notice.
Yes, I worry about the case where two datasets have different output module labels and we truncate to the same value. Maybe a bit paranoid, but this could be a real mess if it happens.
Another consideration: Since we currently want to keep the length limit and you are adding validation checks in another patch, I would keep this workaround patch as restrictive as possible and only fix the one use case we have for it now. Don't add general workaround if we don't really want a general workaround.
This is not a general way to handle output module labels. We are going to truncate only the existing longer than 45 characters' output module labels as a temperate solution. All new output module labels should be limited to 45 character long when it creates.
Answering Dirk's question about output_module_label in DBS: It is not unique by itself. The schema requires a maximum 45 character long and not NULL output_module_label. APP_EXEC_ID, RELEASE_VERSION_ID, PARAMETER_SET_HASH_ID, OUTPUT_MODULE_LABEL and GLOBAL_TAG all together are required to be unique,
Can someone check how many output_module_labels longer than 45 characters out there? What are the other fields associated with them?
Dirk, the suggestion proposed in the HN is not valid because that's not exactly the output module name bugging DBS. The correct one is 'ALCARECOStreamPromptCalibProdSiStripGainsAfterAbortGap', so if you want to create a rule only for this one, that's it fine with me. Though I'd prefer to have something more generic in case we have other long names out there, otherwise we need to keep updating the source code for these exceptions.
About your concern, what would be the problem if different datasets with different output module names have the same truncated output name? Nobody is using that information anyways, AFAICT...
About the column length on the DBS side, is there any good reason that we cannot stretch it a bit (besides a possible very small performance penalty)?
Anyways, whatever path you decide to go is fine with me, I just think we should get it in place asap to avoid all these unneeded calls to dbs and pollution of the logs.
Well, to first order this needs to be tested and put in (or Dirk's alternate) into the Tier0. After that we can put something into WMCore itself, but I don't see the point in putting a "one time only" fix into the WMCore code base, while the general fix that @amaltaro wrote could go in.
@hufnagel @johnhcasallasl can you validate it in a T0 replay?
Problem is, we only know the remapping for the specific use case we know about.
ALCARECOStreamPromptCalibProdSiStripGainsAfterAbortGap => ALCARECOStreamPromptCalibProdSiStripGainsAAG
As far as I am concerned there is no agreement how to shorten any other potential long output module and we don't want any other long output modules anyways (and your other patch will make sure we will never get them).
Btw, we also need to put this exception into Alans more general validation patch. Either that or tell Alca that they need to change the name of the producer in CMSSW. Otherwise we can't run the Tier0 anymore after Alans more general patch is merged.
Hello:
I was wondering how to proceed here, it is becoming urgent. IMO, we could solve it for the particular problem we have now and still continue the discussion about a general solution. If you agree, we can setup a replay to test the change as proposed by @hufnagel.
Adding @ebohorqu @blallen @drkovalskyi
Please run a small functional Tier0 replay immediately with this patch
https://github.com/hufnagel/WMCore/commit/cca04d11488e7fba3bf719f04f5ba8bb4b085bdc
The replay config should contain the PromptCalibProdSiStripGainsAfterAbortGap producer and you only need to inject Express stream data. Replay can be considered successful if the DBS registration of the data from that producer works.
If the replay is successful you can also deploy the patch in Tier0 production and restart the DBS3Buffer component.
After this is done I can make a PR from my patch or Alan can update this one.
PS: I am on vacation at a music festival in Belgium, only have decent internet in the morning or very late at night.
The replay was successful and the patch applied in Tier0 production.
@amaltaro , you want to adjust this pull request or close it and I create a new pull request based on my patch ?
Dirk, my understand is that we're not going to make a one specific fix for WMCore. Instead, we will add validation code while the request is created and fail it in case it does not fit into the DBS column. So you can leave it to me.
Yes, but you NEED to add this exception to the validation code since we already have existing data that violates the convention. You can't just add validation checks that will fail existing and already used workflows.
This basically means we are stuck supporting this exception, both with my workaround and in the validation check.
I thought we were going to have another cmssw release where it would be sorted out, no? What's in the system is already there and won't have issues - provided you keep this patch in the T0 until all the workflows for that release are completely gone.
I dislike very much all these exceptions that we have to keep in the code, at some point we a) don't need them anymore, never b) but we don't know if they are still needed c) the code remains in WMCore forever and d) it complicates any code maintenance.
If the release will be kept longer in T0 with this long output name, then the only thing left is to hack once again the code though. Can you check that out and then we resume the discussion here?
So far no one pushed for the workflow name (and therefore output module label) to be changed. The only reply on this I found mentioned changing it for next year (which is a bit late).
Effectively, once you merge the validation patch, we won't be able to upgrade the Tier0 to this version of the WMCore code. Not until the PCL workflow has been changed to pass the validation.
I am not planning to update the Tier0 release anytime soon, so this is not a concern in the short term. OTOH, I don't want to rule out that we will want or have to change the Tier0 release before the end of the year.
I asked in HN for the workflow to be changed asap, lets see what kind of feedback I get.
If you need to keep a patch that does the mapping of the output label for the Tier0 for the rest of the year, you can also keep a patch that reverts the check that Alan is going to insert.
In principle correct, but I'd rather not have to write a patch just to partially revert the patch that Alan is still writing/testing :-)
Well, you don't have to "write" it, you get git to do it for you with git revert.
Eric
I actually want the validation so we don't get more of these configs with too long output module labels. Which means I would actually have to patch on top of Alans patch, simple revert is not what we want.
Ok, I think we need to first fix the workflow configuration on T0 and then we can create the validation patch, just to be on the safe side. Dirk, don't forget updating this issue once you get updates from that HN.
Oh, create the validation patch please. We just can't use it on the Tier0 side until we fix the workflow. Doesn't mean we shouldn't get the validation in asap though.
Ok, the workflow is NOT going to be fixed until early 2017. I didn't really expect anything else. The response basically was "it's working as-is, we don't have time to work on this now, we'll look at it once we have time. ie. between the 2016 and 2017 data taking".
Which basically leaves us two options:
-
We don't put in a validation until early 2017 (which IMO is unacceptable).
-
We put in a validation check, but leave an exception for this one known use case and also leave the workaround for this one known use case. We can put in a an issue to remove the workaround and the exception once the workflow has been fixed.