genproductions
genproductions copied to clipboard
[lxplus, gridpack generation] W+4jets in MG2.6.1
I got errors during generation of MG5 2.6.1 gridpack. The process is W+4jets. I used mg261 branch. It seems related to lhapdf6 but I don't know..
Actually, gridpack of DY+4jets is generated without any errors. It is hard to understand why this happens to Wjets.
The error log file is on lxplus : /afs/cern.ch/work/j/jhchoi/public/temp/mg_4jet_v261_validation/validation_lv_bwcutoff_wjetlv_htincl/v261/false_pdfwgt/LSFJOB_180852388/STDOUT
=======ERR MESSAGE======= STDOUT.txt
Hi @soarnsoar,
It is indeed surprising that this LHAPDF problem appears sometimes, but not all the time. Some questions:
-
Are you certain that nothing else has changed? Did you use the same state of the genproductions repository for both tests?
-
Have you run this only once, or have you tried resubmitting multiple times? The error may just be due to a temporary problem with boost on one of the workers.
-
Could you try restoring the previous LHAPDF patch we applied? This was removed in 97cffd79829f170a678e4b381828495e22d9845a. I made a test branch you can pull from to get the patch back.
Dear Andreas Albert,
Thanks for your comments!
For comment2, yes I did it several times.
For comment1, I checked all runcards of the gridpacks that I generated.
I can find that when I use the lhapdf=263000 for a v261 gridpack, the gridpack generation fails.
The v261 DY gridpack, which I succeeded in generation, I changed the pdfset and I could make the gridpack! So, as you mentioned in comment3, LHAPDF will work for this problem (my guess).
I will try with the test branch that you gave me. Thanks :)
I'll be back after the test.
I should have checked the runcards much earlier.. Anyway, thanks for your great help!
Thanks for the update @soarnsoar.
I am not sure I understand your point about the PDF choice. Could you clarify what happens for exactly which setting? You say that with lhaid = 263000
, generation fails, but for another value it works? In the case that works, did you use pdlabel = lhapdf
?
Also, regarding point 1. above: Could you share the log files of the working gridpacks? I would like to check the commit hash to see which state of the git repository you used and if there are any differences in the scripts compared to the failing case above.
Dear Andreas,
Sorry for unclear post. Let's me explain one by one.
- choice of pdf
When I selected pdf in runcard like below lhapdf = pdlabel ! PDF set 263000 = lhaid ! if pdlabel=lhapdf, this is the lhapdf number
I got errors as I mentioned at my 1st post.(w4jet gridpack)
When I selected pf in runcard like below nn23lo1 = pdlabel ! PDF set 230000 = lhaid ! if pdlabel=lhapdf, this is the lhapdf number
No errors.(DY4jet gridpack)
For both cases, I use the batch script - "submit_gridpack_generation_local.sh". (All v261 gridpack)
2)Batch OR non-batch I tested another case. This test was done by using W(leptonic)+2jet v261 gridpack production with PDFsetup lhapdf = pdlabel ! PDF set 263000 = lhaid ! if pdlabel=lhapdf, this is the lhapdf number
(For a fast test, I used 2jet process)
One was done by "submit_gridpack_generation_local.sh" script.[Batch] The other one was done by "gridpack_generation.sh"[Non-batch]
I found that the 2jet production fails when using batch script.(submit_gridpack_generation_local.sh). But non-batch job finishes without error!(gridpack_generation.sh)
I guess LXPLUS's LSF batch environment(?) has some problem on LHAPDF.
3)The test branch that you gave. It also failed.. The log files are
/afs/cern.ch/work/j/jhchoi/public/temp/toAlexander/lhapdf_patch/log_failed_jobs/
@LXPLUS
you can see I used the git branch -mg261_2018-09-05_forJunho
4)The working gridpack log - DY4j v261 gridpack generation is attached.
Thank you !
5)Summary of my tests.(v261) DY2jet - nn23lo1/230000, non-batch submission [Success] W2jet - lhapdf/263000, non-batch submission [Success] W2jet - lhapdf/263000, batch submission [Fail]
DY4jet - nn23lo1/230000, batch submission [Success] W4jet - lhapdf/263000, batch submission [Fail] W4jet - lhapdf/263000, batch submission + test brach [Fail]
W4jet - nn23lo1/230000, batch submission [Running]
I hope this would be helpful. The W4jet job with nn23lo1/230000 PDF will give us a useful hint.
Best,
Junho
Dear Andreas,
When I use nn23lo1/230000 instead of lhapdf/263000, the gridpack generation is succeeded. (with batch submission)
So, my conclusion is
submit_gridpack_generation_local.sh
script has a LHAPDF6-related problem.
I will check my submission code whether there's any missing thing, and also test the submission script with 2jet(for much faster job).
In summary, the symptom is that
gridpack generation with submit_gridpack_generation_local.sh / v261 / lhapdf-263000 /
crashes.
Best,
Junho
Please wait :) It could be my stupid mistake when using the submission script. Let me back after a few hours later. Sorry..
Best, Junho
Dear Andreas,
I failed to find why I cannot make the gridpack using
LSF batch(by submit_gridpack_generation_local.sh) v261 lhapdf
I thought it was due to some missing arguments in submit_gridpack_generation_local.sh but it isn't.
Which script is recommended when submitting gridpack generations?
Hi @soarnsoar:
I can reproduce your problem even with a very simple process wplustest_4f_LO
. This confirms to me that the problem is really with LHAPDF and not some other feature of the process.
Regarding the choice of script: Usually, I would use submit_gridpack_generation.sh
, but I confirmed that I see the same error also with submit_gridpack_generation_local.sh
, so that is not the cause of the problem. In any case, the difference between the two is only the location of the working directories, so it should not make a difference either way (Although I would think that using an AFS directory as a working directory may be more error prone since AFS does not like heavy file access).
I will try and see if I can fix this. I will report back here.
Thanks Andreas!
I should have reported this much earlier, sorry.
Anyway, you mean this is related to working directory - LHAPDF6.
Please let me know if you correct this problem! (Could you give me some details?)
Thank you:)
No, it's not related to the working directory. I was saying that the only difference between submit_gridpack_generation.sh
and submit_gridpack_generation_local.sh
is the choice of working directory. However, I confirmed that I see the error with both scripts. Therefore, the working directory is not the source of the problem.
With some further tests, it seems that:
-
Running everything locally on an lxplus machine works:
./gridpack_generation.sh wplustest_4f_LO cards/examples/wplustest_4f_LO/ local
-
Running the master job locally on lxplus and having MG submit its sub-jobs to the batch system works:
./gridpack_generation.sh wplustest_4f_LO cards/examples/wplustest_4f_LO/ 1nh
-
Submitting the master job to the batch system does not work:
./submit_gridpack_generation.sh 2000 2000 1nh wplustest_4f_LO cards/examples/wplustest_4f_LO/ 1nh
So there really seems to be some configuration difference between the worker nodes and the lxplus machines for users.
I will try to see if I can find the difference. However, we should probably not spend too many resources trying to debug this issue as the LSF batch system is on it's way out anyway. Is there any reason to use it at this point? It seems inferior to CMSConnect in every way... @kdlong @agrohsje @khurtado may want to comment
Okay, turns out it does not work on CMSConnect, either...
@soarnsoar OK, I think it is solved in this test branch. Could you give it a try to see if it works for you? It's actually the same fix I prepared in the branch I sent you previously, but I had not propagated it right the first time.
@AndreasAlbert thanks a lot! that also fixes the problem on htcondor ?
I checked that it works for wplustest_4f_LO
on the lxplus LSF batch and CMSConnect. For the CERN HTCondor, I don't think we have had widespread testing yet, anyway, right? We would first have to port #1662
That's great. Is anybody still working on 1662 or should we port and test?
@AndreasAlbert Thanks! I want to submit the jobs at cmsconnect. I did
submit_cmsconnect_gridpack_generation.sh <card_name> <card_directory>
. Is it right way to submit a job?
(at lxplus, I checked it works!)
@agrohsje I think #1662 should be good to go. While there may still be improvements to be made, I think that it cannot break anything else. I can port it to mg261 since we maybe don't want to touch the 26 branch for now.
@soarnsoar yes, that should work. If you confirm that it works now, I will prepare a PR.
@kdlong This problem was a result of https://github.com/cms-sw/genproductions/commit/97cffd79829f170a678e4b381828495e22d9845a, which I simply reverted to fix the issue. Was there any specific reason we have to keep this commit other than general clean-up?
but #1662 seems to have conflicts.
@efeyazgan yes, that's why it should be ported to this branch (and the conflicts resolved) rather than merged into mg26x. While we are talking about this: Where are we regarding a merge of mg26x into master? As long as that is not finalized, there's probably no point resolving conflicts in the mg261 branch, as that will have to be merged later anyway.
maybe we could have a direct chat about that and collect all relevant points. kenneth and myself diff a while ago but I fear this is again obsolete. maybe we could quickly discuss next week? mefg + you + l2 ?
I agree with @agrohsje to have a MG dedicated meeting. It is better to switch to 26X or 261 the sooner the better, considering we might change to sl7 from sl6 possibly early next year and many gridpacks need to be regenerated.
Possibly MEFG can organize such a meeting. If it is not too late, we can use the Oct 11 GEN meeting slot and room: https://indico.cern.ch/event/746830/
yes that would be very good to have that meeting soon.
@AndreasAlbert
I ran w+012j gridpack generation but it seems failed. When I ran it with submit_cmsconnect_gridpack_generation.sh <card_name> <card_directory>, prompt terminal shows messages even it is a batch job I think..
Also, what is "CODEGEN" step? Could you give me a brief explanation?
The part of error messages are here and fulll log files are at the end of this post.
:
WARNING: resubmit job (for the 9 times) INFO: Idle: 1, Running: 0, Completed: 34 [ 49m 59s ]
INFO: ClusterId 4552745 was held with code 13, subcode 2. Releasing it. INFO: ClusterId 4552746 was held with code 13, subcode 2. Releasing it. INFO: ClusterId 4552748 was held with code 13, subcode 2. Releasing it. INFO: Idle: 0, Running: 8, Completed: 27 [ 52m 0s ] WARNING: resubmit job (for the 2 times) WARNING: resubmit job (for the 2 times) WARNING: resubmit job (for the 2 times) WARNING: resubmit job (for the 2 times) WARNING: resubmit job (for the 2 times) WARNING: resubmit job (for the 2 times) WARNING: resubmit job (for the 2 times) WARNING: resubmit job (for the 10 times) INFO: Idle: 11, Running: 0, Completed: 32 [ 54m 26s ] INFO: ClusterId 4552748 was held with code 13, subcode 2. Releasing it. INFO: ClusterId 4552753 was held with code 13, subcode 2. Releasing it. INFO: ClusterId 4552754 was held with code 13, subcode 2. Releasing it. INFO: ClusterId 4552755 was held with code 13, subcode 2. Releasing it. INFO: ClusterId 4552756 was held with code 13, subcode 2. Releasing it. INFO: ClusterId 4552757 was held with code 13, subcode 2. Releasing it. INFO: ClusterId 4552758 was held with code 13, subcode 2. Releasing it. INFO: ClusterId 4552759 was held with code 13, subcode 2. Releasing it. INFO: Idle: 0, Running: 3, Completed: 40 [ 56m 27s ]
WARNING: resubmit job (for the 2 times)
WARNING: resubmit job (for the 2 times)
CRITICAL: Fail to run correctly job 4552760.
with option: {'log': None, 'stdout': None, 'argument': ['0', '12'], 'nb_submit': 10, 'stderr': None, 'prog': '/home/jhchoi/work/gridpack_validation/wtoenu012j_5f_LO/v261/false_pdfwgt/mg261_2018-09-14_forJunho/genproductions/bin/MadGraph5_aMCatNLO/wtoenu012j_5f_LO_261_false_pdfwgt/wtoenu012j_5f_LO_261_false_pdfwgt_gridpack/work/processtmp/SubProcesses/survey.sh', 'output_files': ['G12'], 'time_check': 1537196199.165137, 'cwd': '/home/jhchoi/work/gridpack_validation/wtoenu012j_5f_LO/v261/false_pdfwgt/mg261_2018-09-14_forJunho/genproductions/bin/MadGraph5_aMCatNLO/wtoenu012j_5f_LO_261_false_pdfwgt/wtoenu012j_5f_LO_261_false_pdfwgt_gridpack/work/processtmp/SubProcesses/P2_qq_lvlqq', 'required_output': ['G12/results.dat'], 'input_files': ['madevent', 'input_app.txt', 'symfact.dat', 'iproc.dat', 'dname.mg', '/home/jhchoi/work/gridpack_validation/wtoenu012j_5f_LO/v261/false_pdfwgt/mg261_2018-09-14_forJunho/genproductions/bin/MadGraph5_aMCatNLO/wtoenu012j_5f_LO_261_false_pdfwgt/wtoenu012j_5f_LO_261_false_pdfwgt_gridpack/work/processtmp/SubProcesses/randinit', '']}
file missing: /home/jhchoi/work/gridpack_validation/wtoenu012j_5f_LO/v261/false_pdfwgt/mg261_2018-09-14_forJunho/genproductions/bin/MadGraph5_aMCatNLO/wtoenu012j_5f_LO_261_false_pdfwgt/wtoenu012j_5f_LO_261_false_pdfwgt_gridpack/work/processtmp/SubProcesses/P2_qq_lvlqq/G12/results.dat
Fails 10 times
No resubmition.
INFO: Idle: 10, Running: 0, Completed: 35 [ 58m 35s ]
INFO: All jobs finished
INFO: Idle: 0, Running: 0, Completed: 0 [ 1h 0m ]
Error when reading /home/jhchoi/work/gridpack_validation/wtoenu012j_5f_LO/v261/false_pdfwgt/mg261_2018-09-14_forJunho/genproductions/bin/MadGraph5_aMCatNLO/wtoenu012j_5f_LO_261_false_pdfwgt/wtoenu012j_5f_LO_261_false_pdfwgt_gridpack/work/processtmp/SubProcesses/P2_gg_lvlqq/G1/results.dat
Command "generate_events pilotrun" interrupted with error:
IOError : [Errno 2] No such file or directory: '/home/jhchoi/work/gridpack_validation/wtoenu012j_5f_LO/v261/false_pdfwgt/mg261_2018-09-14_forJunho/genproductions/bin/MadGraph5_aMCatNLO/wtoenu012j_5f_LO_261_false_pdfwgt/wtoenu012j_5f_LO_261_false_pdfwgt_gridpack/work/processtmp/SubProcesses/P2_gg_lvlqq/G1/results.dat'
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/home/jhchoi/work/gridpack_validation/wtoenu012j_5f_LO/v261/false_pdfwgt/mg261_2018-09-14_forJunho/genproductions/bin/MadGraph5_aMCatNLO/wtoenu012j_5f_LO_261_false_pdfwgt/wtoenu012j_5f_LO_261_false_pdfwgt_gridpack/work/processtmp/pilotrun_wjets_debug.log'.
Please attach this file to your report.
quit
INFO:
wtoenu012j_5f_LO_261_false_pdfwgt.log wtoenu012j_5f_LO_261_false_pdfwgt_codegen.log
@soarnsoar if you don't want to have any output on your console, you can pipe everything into a log conveniently with a call like [1]. Just replace folder
and name
as you wish.
The codegen job is basically just the first half of gridpack_generation.sh
, the generation is simply split in two parts with the first one, the codegen jobs, exiting at [2]. The codegen jobs does everything that happens up to and including the "output" statement in madgraph, i.e. figuring out the relevant diagrams and translating them into calls to HELAS, which is the underlying software to calculate the amplitudes for each diagram, and then writing all of it into a folder.
Regarding the error you see, it is hard to diagnose from the logs you sent because it does not really say how the sub jobs fail. However, seeing how this is a complicated process and you seem to have all sub-jobs failing, I figure it may be related to the available memory. You can request more memory by using [3] (note the "8Gb" bit. If that does not work, you can also increase this to larger values).
Anyway, could you re-run with the lxplus batch system? Since you used that in the beginning, it would be good to see whether it works now.
[1] folder="cards/examples/"; name="wplustest_4f_LO"; nohup ./submit_cmsconnect_gridpack_generation.sh ${name} ${folder}/${name} > ${name}.debug 2>&1 &
[2] https://github.com/cms-sw/genproductions/blob/mg261/bin/MadGraph5_aMCatNLO/gridpack_generation.sh#L281-L284
[3] folder="cards/examples/"; name="wplustest_4f_LO"; nohup ./submit_cmsconnect_gridpack_generation.sh ${name} ${folder}/${name} > ${name}.debug "" "8 Gb" 2>&1 &
Dear @AndreasAlbert I checked it works well with 'submit_gridpack_generation_local.sh' at lxplus.
I'll check the cms connect running with setting larger memory :)
Thank you1!!
@AndreasAlbert
Hi Andreas :) I want to ask about CMS Connect.
When I submitted a gridpack generation job, the job submitted other jobs on condor batch system. But some of the jobs were failed and changed its status into Idle OR Held.. So actual running time is so small.
Let me show you an example ::
-- Schedd: login.uscms.org : <192.170.227.118:9618?... @ 09/25/18 06:22:55 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4597530.0 jhchoi 9/25 05:53 0+00:00:00 I 0 0.0 proxy_checker.sh 4597554.0 jhchoi 9/25 06:00 0+00:03:17 I 0 0.0 connect_wrapper.sh survey.sh 0 1 3 4597555.0 jhchoi 9/25 06:00 0+00:03:48 I 0 0.0 connect_wrapper.sh survey.sh 0 4 7 4597556.0 jhchoi 9/25 06:00 0+00:05:05 I 0 0.0 connect_wrapper.sh survey.sh 0 1 2 4597557.0 jhchoi 9/25 06:01 0+00:02:40 I 0 0.0 connect_wrapper.sh survey.sh 0 3 4 4597558.0 jhchoi 9/25 06:01 0+00:03:44 I 0 0.0 connect_wrapper.sh survey.sh 0 5 6 4597559.0 jhchoi 9/25 06:01 0+00:03:25 I 0 0.0 connect_wrapper.sh survey.sh 0 7 8 4597563.0 jhchoi 9/25 06:02 0+00:07:56 R 0 0.0 connect_wrapper.sh survey.sh 0 1 2 4597564.0 jhchoi 9/25 06:02 0+00:04:15 I 0 0.0 connect_wrapper.sh survey.sh 0 3 4 4597565.0 jhchoi 9/25 06:02 0+00:03:47 I 0 0.0 connect_wrapper.sh survey.sh 0 9 11 4597567.0 jhchoi 9/25 06:02 0+00:02:20 I 0 3.0 connect_wrapper.sh survey.sh 0 1 2 4597617.0 jhchoi 9/25 06:22 0+00:00:00 I 0 0.0 connect_wrapper.sh survey.sh 0 12
12 jobs; 0 completed, 0 removed, 11 idle, 1 running, 0 held, 0 suspended [jhchoi@login ~]$ date Tue Sep 25 06:23:05 CDT 2018
Youcan see that even it's 20 min, the actual running time is so small and jobs are changed into I/H status!
Is it okay? And my overall jobs are failed.. I want to know whether this contious changing of job status is normal OR not.
Best regards,
Junho Choi
@soarnsoar Does it keep failing after multiple tries? Without any error messages, it is hard to say if there's something wrong. Jobs being idle is normal, they should run at some point. Do they? If not, that sounds like more of a system problem rather than MG.
Yep. When I submitted a job of very simple process(taking a few minutes in local job), it also failed.
IDLE status was resubmitted 10times and failed.
I'll ask it to Kenyi for CMS Connect system.
Thank you!
@soarnsoar @agrohsje I thought about this some more and I don't think this solution makes sense. As of LHAPDF 6.2, there should be no boost dependency anymore. How can there be a boost related problem? There must be something else wrong. Maybe one of the other patches?