cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Stageout failures at sites using SRM with rhel8

Open haozturk opened this issue 2 years ago • 14 comments

Dear experts,

We see that many production workflows have failed at sites which use SRM during stageout. With a small local test, we can see that calls using SRM within this container fails while it succeeds on lxplus:

[haozturk@lxplus708 ~]$ singularity shell --bind /cvmfs --bind /afs --contain --ipc --pid /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel8
Singularity> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/
gfal-ls error: 70 (Communication error on send) - srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://dcache-se-cms.desy.de:8443/srm/managerv2: CGSI-gSOAP running on lxplus708.cern.ch reports Error reading token data header: Connection closed

[haozturk@lxplus708 ~]$ gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
joroemer
tpook
...

This is an example production workflow which failed at T1_FR_CCIN2P3 during stageout

Can you please look into the issue w/ this container?

haozturk avatar Oct 04 '22 10:10 haozturk

A new Issue was created by @haozturk Hasan ztrk.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Oct 04 '22 10:10 cmsbuild

assign core

makortel avatar Oct 04 '22 13:10 makortel

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Oct 04 '22 13:10 cmsbuild

@aandvalenzuela @iarspider Do you know what to check, or do we need to wait for @smuzaffar to come back?

makortel avatar Oct 04 '22 13:10 makortel

@makortel we have to wait for @smuzaffar.

iarspider avatar Oct 04 '22 13:10 iarspider

@haozturk , I think problem is with OSG software stack e.g. running same command under opensciencegrid/osg-wn:3.5-el8 container also hangs and then fails

> singularity shell -B /home -B /tmp  --contain --ipc --pid docker://opensciencegrid/osg-wn:3.5-el8
Apptainer> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
gfal-ls error: 70 (Communication error on send) - srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://dcache-se-cms.desy.de:8443/srm/managerv2: CGSI-gSOAP running on cmsdev32.cern.ch reports Error reading token data header: Connection closed

I have rebuilt cmssw/cms:rhel8 to get the latest versions of packages and I noticed that gfal-ls works if /cvmfs is not mounted

> singularity shell -B /home -B /tmp  --contain --ipc --pid docker://cmssw/cms:tmp-rhel8-cms-20221009
INFO:    Using cached SIF image
Apptainer> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/

but if I mount /cvmfs then it fails

> singularity shell -B /home -B /tmp  -B /cvmfs --contain --ipc --pid docker://cmssw/cms:tmp-rhel8-cms-20221009
INFO:    Using cached SIF image
Apptainer> gfal-ls -v srm://dcache-se-cms.desy.de/pnfs/desy.de/cms/tier2/store/user/
gfal-ls error: 70 (Communication error on send) - srm-ifce err: Communication error on send, err: [SE][Ls][] httpg://dcache-se-cms.desy.de:8443/srm/managerv2: CGSI-gSOAP running on cmsdev32.cern.ch reports Error reading token data header: Connection closed

smuzaffar avatar Oct 10 '22 12:10 smuzaffar

note that cmssw/cms:rhel8 containers are based on opensciencegrid/osg-wn:3.5-el8

smuzaffar avatar Oct 10 '22 12:10 smuzaffar

Is this a side effect of OSG dropping GSI support (which SRM/gsi/gridftp use), i.e. do we need to decouple from the OSG worker node clients and use the EGI, WLCG , or our own tools for this? (I thought OSG 3.5 had GSI support but maybe this was removed for CentOS 8? Shall we ask OSG support to comment?)

  • Stephan

stlammel avatar Oct 10 '22 13:10 stlammel

No @stlammel , I think the GSI support was dropped only in OSG 3.6. @jblomer pointed out that there might be something which changes PATH/LD_LIBRARY_PATH when /cvmfs is mounted and I think the issue is with https://github.com/cms-sw/cms-docker/blob/master/cms/osg-wn-client-setup.sh script which sources /cvmfs/oasis.opensciencegrid.org/osg-software/osg-wn-client/3.5/current/el8-x86_64/setup.sh and changes the LD_LIBRARY_PATH. Note that this script is sourced when singularity is started. Looks like the software installed in the container and the packages available via /cvmfs/oasis.opensciencegrid.org/osg-software/osg-wn-client/3.5/current/el8-x86_64/setup.sh are not compatible.

smuzaffar avatar Oct 10 '22 14:10 smuzaffar

Hallo Shahzad, Thanks! We had this "mixed" and broken environment before and this was one of the reasons for using the OSG WN client environment. Looking at the setup script, it puts the OSG location ahead in the PATH/LD_LIBRARY_PATH as it should. The OSG 3.5 WN environment for CentOS 8 switched to python3 but /cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.62-1/el8-x86_64/usr/bin contains only python2. The python3 is picked up from the OS in the image and seems not to work/then mixes the gfal2 1.8/1.7 environments. I would ping OSG support. Thanks,

  • Stephan

stlammel avatar Oct 10 '22 15:10 stlammel

Thanks for following this issue. @stlammel did you contact OSG support? If not who should do it?

haozturk avatar Oct 12 '22 13:10 haozturk

Hallo Hasan @haozturk , no, i did not open a OSG ticket. I can if we agree this is the direction we want to go.

  • Stephan

stlammel avatar Oct 12 '22 13:10 stlammel

Thanks @stlammel I rely on your and @smuzaffar's judgement on this as I'm not an expert on the issue. I just want to highlight that more and more el8 workflows are coming to production and we're banning more than 30 sites for such workflows. This might increase the delivery time of the requests and lower the utilization of the banned sites. So, the sooner we fix it, the less trouble we'll have. I'm happy to do anything that I can do.

haozturk avatar Oct 14 '22 13:10 haozturk

Are there any downsides of doing this? if not can we make it happen ASAP?
Thanks, Jen

jenimal avatar Oct 14 '22 14:10 jenimal

@stlammel any followup? As Hasan says above, we are going to trust your expertise on this. We currently do not see the downside of switching so if there is one we need to know. Otherwise lets get moving on it.

Jen

jenimal avatar Oct 18 '22 17:10 jenimal

Yes, we decided to involve OSG. Hasan is in the loop/should get copies/updates. I can't give you a timeline though. I would expect at least several days. There is also a discussion about the origin of the OSG WN client use. Given that the issue was first encountered about a month ago, i would give it a few days for things to be better understood.

  • Stephan

stlammel avatar Oct 18 '22 19:10 stlammel