dcache
dcache copied to clipboard
Grid CA cert update caused nullPointerException; required gPlazma restart
Dear dCache devs,
At the request of Tigran here some info on this issue discussed on the mailing list.
We're running dCache 6.0.9. Yesterday we upgraded our Grid CA root cert bundle to support the new Terena/Sectigo certificates. We had 1.104, and we went to:
[root@dcmain /etc/grid-security/certificates]# rpm -q ca-policy-egi-core
ca-policy-egi-core-1.106-1.noarch
Since then, we had users from various VOs with authentication problems:
09:56 ui.grid.surfsara.nl:/home/onno
onno$ voms-proxy-init -voms projects.nl:/projects.nl/geomodel
09:58 ui.grid.surfsara.nl:/home/onno
onno$ globus-url-copy gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/geomodel/d_Mesh/meshfile.msh file:///$(pwd)/geomodel-test/test1
error: globus_ftp_client: the server responded with an error
530 Login denied
And:
10:02 ui.grid.surfsara.nl:/home/onno
onno$ voms-proxy-init -voms lsgrid:/lsgrid
10:04 ui.grid.surfsara.nl:/home/onno
onno$ uberftp -ls gsiftp://gridftp.grid.surfsara.nl:2811/pnfs/grid.sara.nl/data/lsgrid
530 Login denied
10:05 ui.grid.surfsara.nl:/home/onno
onno$ srmls srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/lsgrid
Wed May 13 10:09:12 CEST 2020: Return status:
- Status code: SRM_AUTHORIZATION_FAILURE
- Explanation: login failed
srm ls response path details array is null!
But with dteam, which is hosted by another VOMS server, there was no problem:
09:58 ui.grid.surfsara.nl:/home/onno
onno$ voms-proxy-init -voms dteam
10:02 ui.grid.surfsara.nl:/home/onno
onno$ srmls srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/dteam
This produced a normal listing.
The problem seemed to happen to users from VOs on voms.grid.surfsara.nl. I contacted the admin of that server, and he told me he had updated the Grid CA root certs on that host yesterday morning, before I updated them on our dCache nodes.
On the dCache server side, we saw this in the gPlazma logs:
13 May 2020 03:41:28 (gPlazma) [door:GFTP-shark12-AAWlfa3CuOA@ftp-shark12Domain GFTP-shark12-AAWlfa3CuOA Login AUTH voms] Bug in plugin:
java.lang.NullPointerException: null
13 May 2020 03:41:28 (gPlazma) [door:GFTP-shark16-AAWlfa3Jqjg@ftp-shark16Domain GFTP-shark16-AAWlfa3Jqjg Login AUTH voms] Bug in plugin:
java.lang.NullPointerException: null
13 May 2020 03:41:31 (gPlazma) [10Y:27003294:srm2:rm SRM-srm Login AUTH voms] Bug in plugin:
java.lang.NullPointerException: null
13 May 2020 03:41:32 (gPlazma) [door:GFTP-guppy4-AAWlfa4Ba6g@ftp-guppy4Domain GFTP-guppy4-AAWlfa4Ba6g Login AUTH voms] Bug in plugin:
java.lang.NullPointerException: null
In yesterday's gPlazma log, traces like these (this one was the first):
12 May 2020 16:34:08 (gPlazma) [door:GFTP-guppy4-AAWldFs3z9g@ftp-guppy4Domain GFTP-guppy4-AAWldFs3z9g Login AUTH voms] Bug in plugin:
java.lang.NullPointerException: null
at eu.emi.security.authn.x509.helpers.pkipath.bc.RFC3280CertPathUtilities.processCRLF(RFC3280CertPathUtilities.java:505)
at eu.emi.security.authn.x509.helpers.pkipath.bc.RFC3280CertPathUtilitiesCanl.processCRLF2(RFC3280CertPathUtilitiesCanl.java:564)
at eu.emi.security.authn.x509.helpers.pkipath.bc.RFC3280CertPathUtilitiesCanl.checkCRL(RFC3280CertPathUtilitiesCanl.java:305)
at eu.emi.security.authn.x509.helpers.pkipath.bc.RFC3280CertPathUtilitiesCanl.checkCRLs2(RFC3280CertPathUtilitiesCanl.java:149)
at eu.emi.security.authn.x509.helpers.revocation.CRLRevocationChecker.checkRevocation(CRLRevocationChecker.java:53)
at eu.emi.security.authn.x509.helpers.pkipath.bc.FixedBCPKIXCertPathReviewer.checkRevocation(FixedBCPKIXCertPathReviewer.java:1746)
at eu.emi.security.authn.x509.helpers.pkipath.bc.FixedBCPKIXCertPathReviewer.checkSignatures(FixedBCPKIXCertPathReviewer.java:739)
at eu.emi.security.authn.x509.helpers.pkipath.bc.FixedBCPKIXCertPathReviewer.doChecks(FixedBCPKIXCertPathReviewer.java:218)
at org.bouncycastle.x509.PKIXCertPathReviewer.getErrors(Unknown Source)
at eu.emi.security.authn.x509.helpers.pkipath.BCCertPathValidator.checkNonProxyChain(BCCertPathValidator.java:322)
at eu.emi.security.authn.x509.helpers.pkipath.BCCertPathValidator.validate(BCCertPathValidator.java:138)
at eu.emi.security.authn.x509.helpers.pkipath.AbstractValidator.validate(AbstractValidator.java:134)
at eu.emi.security.authn.x509.impl.OpensslCertChainValidator.validate(OpensslCertChainValidator.java:227)
at org.italiangrid.voms.ac.impl.DefaultVOMSValidationStrategy.validateCertificateChain(DefaultVOMSValidationStrategy.java:358)
at org.italiangrid.voms.ac.impl.DefaultVOMSValidationStrategy.checkLSCSignature(DefaultVOMSValidationStrategy.java:183)
at org.italiangrid.voms.ac.impl.DefaultVOMSValidationStrategy.checkSignature(DefaultVOMSValidationStrategy.java:204)
at org.italiangrid.voms.ac.impl.DefaultVOMSValidationStrategy.validateAC(DefaultVOMSValidationStrategy.java:325)
at org.italiangrid.voms.ac.impl.DefaultVOMSValidator.internalValidate(DefaultVOMSValidator.java:155)
at org.italiangrid.voms.ac.impl.DefaultVOMSValidator.validateWithResult(DefaultVOMSValidator.java:144)
at org.dcache.gplazma.plugins.VomsPlugin.authenticate(VomsPlugin.java:101)
at org.dcache.gplazma.plugins.GPlazmaAuthenticationPlugin.authenticate(GPlazmaAuthenticationPlugin.java:49)
at org.dcache.gplazma.strategies.DefaultAuthenticationStrategy.lambda$authenticate$0(DefaultAuthenticationStrategy.java:67)
at org.dcache.gplazma.strategies.PAMStyleStrategy.callPlugins(PAMStyleStrategy.java:98)
at org.dcache.gplazma.strategies.DefaultAuthenticationStrategy.authenticate(DefaultAuthenticationStrategy.java:57)
at org.dcache.gplazma.GPlazma$Setup.doAuthPhase(GPlazma.java:565)
at org.dcache.gplazma.GPlazma.login(GPlazma.java:276)
at org.dcache.gplazma.GPlazma.login(GPlazma.java:235)
at org.dcache.auth.Gplazma2LoginStrategy.login(Gplazma2LoginStrategy.java:156)
at org.dcache.services.login.MessageHandler.messageArrived(MessageHandler.java:69)
at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.dcache.cells.CellMessageDispatcher$LongReceiver.deliver(CellMessageDispatcher.java:304)
at org.dcache.cells.CellMessageDispatcher.call(CellMessageDispatcher.java:201)
at org.dcache.cells.AbstractCell.messageArrived(AbstractCell.java:326)
at dmg.cells.nucleus.CellAdapter.messageArrived(CellAdapter.java:890)
at dmg.cells.nucleus.CellNucleus$DeliverMessageTask.run(CellNucleus.java:1226)
at org.dcache.util.BoundedExecutor$Worker.run(BoundedExecutor.java:251)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at dmg.cells.nucleus.CellNucleus.lambda$wrapLoggingContext$4(CellNucleus.java:756)
at java.lang.Thread.run(Thread.java:748)
And:
13 May 2020 07:53:50 (gPlazma) [10Y:27150285:srm2:bringOnline SRM-srm Login AUTH voms] Bug in plugin:
java.lang.NullPointerException: null
13 May 2020 07:53:50 (gPlazma) [10Y:27150285:srm2:bringOnline SRM-srm Login] Login attempt failed; detailed explanation follows:
LOGIN FAIL
| in: Origin[195.169.155.172]
| X509 Certificate chain:
| |
| +--CN=proxy,CN=proxy,CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org [17112265448961627630926914932032212362]
| | |
| | +--Issuer: CN=proxy,CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org
| | +--Validity: OK for 5 days, 20 hours, 26 minutes and 10.3 seconds
| | +--Algorithm: SHA-256 with RSA
| | +--Public key: RSA 2048 bits
| | +--Key usage: digital signature, key encipherment, data encipherment
| |
| +--CN=proxy,CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org [17112265448961627630926914932032212362]
| | |
| | +--Issuer: CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org
| | +--Validity: OK for 5 days, 20 hours, 26 minutes and 10.3 seconds
| | +--Algorithm: SHA-256 with RSA
| | +--Public key: RSA 2048 bits
| | +--Key usage: digital signature, key encipherment, data encipherment
| |
| +--CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org [17112265448961627630926914932032212362]
| | |
| | +--Issuer: CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org
| | +--Validity: OK for 5 days, 20 hours, 26 minutes and 10.3 seconds
| | +--Algorithm: SHA-256 with RSA
| | +--Public key: RSA 2048 bits
| | +--Attribute certificates:
| | | |
| | | +--CN=voms.grid.sara.nl,OU=sara.nl,O=hosts,O=dutchgrid
| | | +--Validity: OK for 5 days, 20 hours, 26 minutes and 10.3 seconds
| | | +--Extensions:
| | | | |
| | | | +--Issuer: CN=NIKHEF medium-security certification auth,O=NIKHEF,C=NL
| | | | +--No revocation info
| | | | +--Authority key identifier
| | | +--Algorithm: SHA-256 with RSA
| | | +--FQANs: /lofar/ops, /lofar, ...
| | +--Key usage: digital signature, key encipherment, data encipherment
| |
| +--CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org [17112265448961627630926914932032212362]
| |
| +--Issuer: CN=TERENA eScience Personal CA 3,O=TERENA,L=Amsterdam,ST=Noord-Holland,C=NL
| +--Validity: OK for 352 days, 6 hours, 6 minutes and 9.3 seconds
| +--Algorithm: SHA-256 with RSA
| +--Public key: RSA 2048 bits
| +--Subject alternative names: email: [email protected]
| +--Key usage: digital signature, key encipherment, data encipherment, SSL client, email protection
|
| out: EntityDefinitionPrincipal[Robot]
| Origin[195.169.155.172]
| /DC=org/DC=terena/DC=tcs/C=NL/O=ASTRON/CN=Robot - [email protected]
| EmailAddressPrincipal[[email protected]]
| LoAPrincipal[IGTF-AP:Classic]
|
+--AUTH OK
| | added: EntityDefinitionPrincipal[Robot]
| | /DC=org/DC=terena/DC=tcs/C=NL/O=ASTRON/CN=Robot - [email protected]
| | EmailAddressPrincipal[[email protected]]
| | LoAPrincipal[IGTF-AP:Classic]
| |
| +--x509 OPTIONAL:OK => OK
| | added: EntityDefinitionPrincipal[Robot]
| | /DC=org/DC=terena/DC=tcs/C=NL/O=ASTRON/CN=Robot - [email protected]
| | EmailAddressPrincipal[[email protected]]
| | LoAPrincipal[IGTF-AP:Classic]
| |
| +--voms OPTIONAL:FAIL (null) => OK
| |
| +--kpwd OPTIONAL:FAIL (no username and password) => OK
| |
| +--jaas OPTIONAL:FAIL (no login name) => OK
|
+--MAP OK
| |
| +--vorolemap OPTIONAL:FAIL (no record) => OK
| |
| +--gridmap OPTIONAL:FAIL (no mapping) => OK
| |
| +--mutator OPTIONAL:OK => OK
| |
| +--authzdb SUFFICIENT:FAIL (no mappable principal) => OK
| |
| +--kpwd SUFFICIENT:FAIL (no login name) => OK
| |
| +--ldap SUFFICIENT:OK => OK (ends the phase)
|
+--ACCOUNT OK
| |
| +--banfile REQUISITE:OK => OK
| |
| +--kpwd SUFFICIENT:OK => OK (ends the phase)
|
+--SESSION OK
| |
| +--roles REQUIRED:OK => OK
| |
| +--authzdb SUFFICIENT:FAIL (no username principal) => OK
| |
| +--kpwd SUFFICIENT:FAIL (no record found) => OK
| |
| +--ldap SUFFICIENT:OK => OK (ends the phase)
|
+--VALIDATION FAIL (no username, no UID, no primary GID)
We heard from our Nikhef colleagues that they had a similar problem for the Xenon VO, which they fixed by restarting some of their dCache domains. So I restarted our gPlazma domain and sure enough that fixed the issue.
I have captured also half a minute of gPlazma log at DEBUG level, before I restarted it. I can share that if you're interested.
Cheers, Onno
This sounds very familiar: see eu-emi/canl-java#92.
Previously, the problem was triggered by a CA changing their signing algorithm (from SHA-1 to SHA-256) without changing other aspected. This resulted in CaNL holding both the old and new certificates in different caches, which caused "problems".
At the time, I spent considerable time investigating this problem, but now don't remember all the details.
IIRC, the above bug was triggered by another bug (see eu-emi/canl-java#94), so fixing this second bug should result in the first bug not being triggered.
Although the first bug is closed, IIRC the problem was not actually fixed, so it isn't completely surprising this has reappeared.
Hi Paul, all, what is the state of affairs in dCache regarding these matters? Is the latest CAnL version being used? As of which dCache versions? Does it actually prevent the various issues we have seen in practice?
Hoi Maarten,
I happened to stumble into this commit which seems to answer your question. https://github.com/dCache/dcache/commit/17968a82bee11154cc26c6c93cfd756342fb2810
I don't see this commit yet in any release notes, but by the looks of it it should appear soon in versions 6.2 and newer. So basically, all currently supported releases.
Groetjes Onno
The fix involves a major version upgrade for CaNL.
We are (naturally) being rather cautious about back-porting this to our stable branches.
The current status is that the upgrade has been committed to our "master" branch a few weeks ago. We're watching that this doesn't break anything.
Once we're happy, we will then back-porting the change to our 7.2 version. This should happen "soon".
This back-port will give us real-world experience, with the few sites that have upgraded to 7.2.
Once that 7.2-backport is shown to be "safe" we'll then look at back-porting to the earlier branches.
Hi Paul, thanks for clarifying that sound strategy!
Hi Paul,
I've updated our test server dolphin12.grid.surfsara.nl to the latest snapshot, in case you'd like to run some tests. I may test a root cert downgrade & upgrade if I have time.
Cheers, Onno
That's great @onnozweers, many thanks!
I don't have any specific tests I would run. The main thing is to check it isn't obviously broken before back-porting to 7.2.
So, please run your usual battery of functional tests and/or any manual testing you would normally do.
Having an extra data-point on whether the new version fixes the certificate roll-over problem would also be useful.
Cheers, Paul.
Hi all,
we at DESY-HH are also happy to test it as well on our pre-prod cluster that we've upgraded to 7.2 already. We could also deploy a similar test to Onno. Should we try ti build and deploy the master branch? In general, we've by now two ticket from ATLAS encouraging us to restart to whole instance and one from CMS. Currently we plan to update the ATLAS dCache at the beginning of December and would appreciate a release by then.
I might come with some questions on the build, but I remember I was able to build it following Onno's very comprehensive talk from the Madrid-Workshop.
Thanks a lot, Christian
The Dutchgrid CA root cert changed from 1.105 to 1.106, so that put us Dutch users in a good position to test this bug fix. We know this change caused problems in May 2020. https://dist.eugridpma.info/distribution/igtf/1.106/CHANGES
CA root cert downgrade command (Dutchgrid = Nikhef):
rpm -Uvh --oldpackage --nodeps https://dist.eugridpma.info/distribution/igtf/1.105/accredited/RPMS/ca_NIKHEF-1.105-1.noarch.rpm
SRM test commands (with a Dutchgrid user cert):
voms-proxy-init --voms dteam
srmls srm://dolphin12.grid.surfsara.nl:8443/groups/dteam
Verified DN in the SRM access file: user.dn="/O=dutchgrid/O=users/O=sara/CN=<name>"
WebDAV test command:
curl --capath /etc/grid-security/certificates/ --cert $X509_USER_PROXY --cacert $X509_USER_PROXY --fail --location https://dolphin12.grid.surfsara.nl:2884/groups/dteam/
Test steps and results:
Nr | Step | Result |
---|---|---|
1 | Verify newest CA certs are installed | ca_policy_igtf-slcs-1.113-1.noarch |
2 | Test an srmls with Dutchgrid user cert | OK |
3 | Downgrade Dutchgrid CA cert to 1.105 | ca_NIKHEF-1.105-1.noarch |
4 | Test srmls with Dutchgrid cert | OK |
5 | systemctl restart dcache.target | done |
6 | Test srmls with Dutchgrid cert | OK |
7 | Test WebDAV with curl | OK |
8 | Upgrade CA certs to latest version | ca_NIKHEF-1.113-1.noarch |
9 | Test srmls with Dutchgrid cert | OK |
10 | Test WebDAV with curl | OK |
11 | systemctl restart dcache.target | done |
12 | Test srmls with Dutchgrid cert | OK |
13 | Test WebDAV with curl | OK |
Looks perfect!
It looks like we're ready to backport this fix to 7.2.
https://github.com/dCache/dcache/pull/6243
If this doesn't yield any problems, I think it's reasonable to backport the change to the other supported branches in December 2021.
Hi Paul and Onno,
thanks a lot! Due other burning issues we're unable to build the master and deploy it. Bu Onno seems to confirm, that the fix works as expected.
Thanks a lot for your work! Christian
So is it safe to say that dcache can handle a CA update without needing to restart gplazma anymore, in >= v7.2 at least?
Hi @rptaylor, we still have an issue with xroot+yls, which will be fixed in the next few days.