dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Grid CA cert update caused nullPointerException; required gPlazma restart

Open onnozweers opened this issue 4 years ago • 13 comments

Dear dCache devs,

At the request of Tigran here some info on this issue discussed on the mailing list.

We're running dCache 6.0.9. Yesterday we upgraded our Grid CA root cert bundle to support the new Terena/Sectigo certificates. We had 1.104, and we went to:

[root@dcmain /etc/grid-security/certificates]# rpm -q ca-policy-egi-core
ca-policy-egi-core-1.106-1.noarch

Since then, we had users from various VOs with authentication problems:

09:56 ui.grid.surfsara.nl:/home/onno 
onno$ voms-proxy-init -voms projects.nl:/projects.nl/geomodel

09:58 ui.grid.surfsara.nl:/home/onno 
onno$ globus-url-copy gsiftp://gridftp.grid.sara.nl:2811/pnfs/grid.sara.nl/data/projects.nl/geomodel/d_Mesh/meshfile.msh file:///$(pwd)/geomodel-test/test1

error: globus_ftp_client: the server responded with an error
530 Login denied

And:

10:02 ui.grid.surfsara.nl:/home/onno 
onno$ voms-proxy-init -voms lsgrid:/lsgrid

10:04 ui.grid.surfsara.nl:/home/onno 
onno$ uberftp -ls gsiftp://gridftp.grid.surfsara.nl:2811/pnfs/grid.sara.nl/data/lsgrid
530 Login denied

10:05 ui.grid.surfsara.nl:/home/onno 
onno$ srmls srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/lsgrid
Wed May 13 10:09:12 CEST 2020: Return status:
 - Status code:  SRM_AUTHORIZATION_FAILURE
 - Explanation:  login failed
srm ls response path details array is null!

But with dteam, which is hosted by another VOMS server, there was no problem:

09:58 ui.grid.surfsara.nl:/home/onno 
onno$ voms-proxy-init -voms dteam

10:02 ui.grid.surfsara.nl:/home/onno 
onno$ srmls srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/data/dteam

This produced a normal listing.

The problem seemed to happen to users from VOs on voms.grid.surfsara.nl. I contacted the admin of that server, and he told me he had updated the Grid CA root certs on that host yesterday morning, before I updated them on our dCache nodes.

On the dCache server side, we saw this in the gPlazma logs:

13 May 2020 03:41:28 (gPlazma) [door:GFTP-shark12-AAWlfa3CuOA@ftp-shark12Domain GFTP-shark12-AAWlfa3CuOA Login AUTH voms] Bug in plugin: 
java.lang.NullPointerException: null
13 May 2020 03:41:28 (gPlazma) [door:GFTP-shark16-AAWlfa3Jqjg@ftp-shark16Domain GFTP-shark16-AAWlfa3Jqjg Login AUTH voms] Bug in plugin: 
java.lang.NullPointerException: null
13 May 2020 03:41:31 (gPlazma) [10Y:27003294:srm2:rm SRM-srm Login AUTH voms] Bug in plugin: 
java.lang.NullPointerException: null
13 May 2020 03:41:32 (gPlazma) [door:GFTP-guppy4-AAWlfa4Ba6g@ftp-guppy4Domain GFTP-guppy4-AAWlfa4Ba6g Login AUTH voms] Bug in plugin: 
java.lang.NullPointerException: null

In yesterday's gPlazma log, traces like these (this one was the first):

12 May 2020 16:34:08 (gPlazma) [door:GFTP-guppy4-AAWldFs3z9g@ftp-guppy4Domain GFTP-guppy4-AAWldFs3z9g Login AUTH voms] Bug in plugin: 
java.lang.NullPointerException: null
	at eu.emi.security.authn.x509.helpers.pkipath.bc.RFC3280CertPathUtilities.processCRLF(RFC3280CertPathUtilities.java:505)
	at eu.emi.security.authn.x509.helpers.pkipath.bc.RFC3280CertPathUtilitiesCanl.processCRLF2(RFC3280CertPathUtilitiesCanl.java:564)
	at eu.emi.security.authn.x509.helpers.pkipath.bc.RFC3280CertPathUtilitiesCanl.checkCRL(RFC3280CertPathUtilitiesCanl.java:305)
	at eu.emi.security.authn.x509.helpers.pkipath.bc.RFC3280CertPathUtilitiesCanl.checkCRLs2(RFC3280CertPathUtilitiesCanl.java:149)
	at eu.emi.security.authn.x509.helpers.revocation.CRLRevocationChecker.checkRevocation(CRLRevocationChecker.java:53)
	at eu.emi.security.authn.x509.helpers.pkipath.bc.FixedBCPKIXCertPathReviewer.checkRevocation(FixedBCPKIXCertPathReviewer.java:1746)
	at eu.emi.security.authn.x509.helpers.pkipath.bc.FixedBCPKIXCertPathReviewer.checkSignatures(FixedBCPKIXCertPathReviewer.java:739)
	at eu.emi.security.authn.x509.helpers.pkipath.bc.FixedBCPKIXCertPathReviewer.doChecks(FixedBCPKIXCertPathReviewer.java:218)
	at org.bouncycastle.x509.PKIXCertPathReviewer.getErrors(Unknown Source)
	at eu.emi.security.authn.x509.helpers.pkipath.BCCertPathValidator.checkNonProxyChain(BCCertPathValidator.java:322)
	at eu.emi.security.authn.x509.helpers.pkipath.BCCertPathValidator.validate(BCCertPathValidator.java:138)
	at eu.emi.security.authn.x509.helpers.pkipath.AbstractValidator.validate(AbstractValidator.java:134)
	at eu.emi.security.authn.x509.impl.OpensslCertChainValidator.validate(OpensslCertChainValidator.java:227)
	at org.italiangrid.voms.ac.impl.DefaultVOMSValidationStrategy.validateCertificateChain(DefaultVOMSValidationStrategy.java:358)
	at org.italiangrid.voms.ac.impl.DefaultVOMSValidationStrategy.checkLSCSignature(DefaultVOMSValidationStrategy.java:183)
	at org.italiangrid.voms.ac.impl.DefaultVOMSValidationStrategy.checkSignature(DefaultVOMSValidationStrategy.java:204)
	at org.italiangrid.voms.ac.impl.DefaultVOMSValidationStrategy.validateAC(DefaultVOMSValidationStrategy.java:325)
	at org.italiangrid.voms.ac.impl.DefaultVOMSValidator.internalValidate(DefaultVOMSValidator.java:155)
	at org.italiangrid.voms.ac.impl.DefaultVOMSValidator.validateWithResult(DefaultVOMSValidator.java:144)
	at org.dcache.gplazma.plugins.VomsPlugin.authenticate(VomsPlugin.java:101)
	at org.dcache.gplazma.plugins.GPlazmaAuthenticationPlugin.authenticate(GPlazmaAuthenticationPlugin.java:49)
	at org.dcache.gplazma.strategies.DefaultAuthenticationStrategy.lambda$authenticate$0(DefaultAuthenticationStrategy.java:67)
	at org.dcache.gplazma.strategies.PAMStyleStrategy.callPlugins(PAMStyleStrategy.java:98)
	at org.dcache.gplazma.strategies.DefaultAuthenticationStrategy.authenticate(DefaultAuthenticationStrategy.java:57)
	at org.dcache.gplazma.GPlazma$Setup.doAuthPhase(GPlazma.java:565)
	at org.dcache.gplazma.GPlazma.login(GPlazma.java:276)
	at org.dcache.gplazma.GPlazma.login(GPlazma.java:235)
	at org.dcache.auth.Gplazma2LoginStrategy.login(Gplazma2LoginStrategy.java:156)
	at org.dcache.services.login.MessageHandler.messageArrived(MessageHandler.java:69)
	at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.dcache.cells.CellMessageDispatcher$LongReceiver.deliver(CellMessageDispatcher.java:304)
	at org.dcache.cells.CellMessageDispatcher.call(CellMessageDispatcher.java:201)
	at org.dcache.cells.AbstractCell.messageArrived(AbstractCell.java:326)
	at dmg.cells.nucleus.CellAdapter.messageArrived(CellAdapter.java:890)
	at dmg.cells.nucleus.CellNucleus$DeliverMessageTask.run(CellNucleus.java:1226)
	at org.dcache.util.BoundedExecutor$Worker.run(BoundedExecutor.java:251)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at dmg.cells.nucleus.CellNucleus.lambda$wrapLoggingContext$4(CellNucleus.java:756)
	at java.lang.Thread.run(Thread.java:748)

And:

13 May 2020 07:53:50 (gPlazma) [10Y:27150285:srm2:bringOnline SRM-srm Login AUTH voms] Bug in plugin: 
java.lang.NullPointerException: null
13 May 2020 07:53:50 (gPlazma) [10Y:27150285:srm2:bringOnline SRM-srm Login] Login attempt failed; detailed explanation follows:
LOGIN FAIL
 |    in: Origin[195.169.155.172]
 |        X509 Certificate chain:
 |          |
 |          +--CN=proxy,CN=proxy,CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org [17112265448961627630926914932032212362]
 |          |    |
 |          |    +--Issuer: CN=proxy,CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org
 |          |    +--Validity: OK for 5 days, 20 hours, 26 minutes and 10.3 seconds
 |          |    +--Algorithm: SHA-256 with RSA
 |          |    +--Public key: RSA 2048 bits
 |          |    +--Key usage: digital signature, key encipherment, data encipherment
 |          |
 |          +--CN=proxy,CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org [17112265448961627630926914932032212362]
 |          |    |
 |          |    +--Issuer: CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org
 |          |    +--Validity: OK for 5 days, 20 hours, 26 minutes and 10.3 seconds
 |          |    +--Algorithm: SHA-256 with RSA
 |          |    +--Public key: RSA 2048 bits
 |          |    +--Key usage: digital signature, key encipherment, data encipherment
 |          |
 |          +--CN=proxy,CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org [17112265448961627630926914932032212362]
 |          |    |
 |          |    +--Issuer: CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org
 |          |    +--Validity: OK for 5 days, 20 hours, 26 minutes and 10.3 seconds
 |          |    +--Algorithm: SHA-256 with RSA
 |          |    +--Public key: RSA 2048 bits
 |          |    +--Attribute certificates:
 |          |    |    |
 |          |    |    +--CN=voms.grid.sara.nl,OU=sara.nl,O=hosts,O=dutchgrid
 |          |    |         +--Validity: OK for 5 days, 20 hours, 26 minutes and 10.3 seconds
 |          |    |         +--Extensions:
 |          |    |         |    |
 |          |    |         |    +--Issuer: CN=NIKHEF medium-security certification auth,O=NIKHEF,C=NL
 |          |    |         |    +--No revocation info
 |          |    |         |    +--Authority key identifier
 |          |    |         +--Algorithm: SHA-256 with RSA
 |          |    |         +--FQANs: /lofar/ops, /lofar, ...
 |          |    +--Key usage: digital signature, key encipherment, data encipherment
 |          |
 |          +--CN=Robot - [email protected],O=ASTRON,C=NL,DC=tcs,DC=terena,DC=org [17112265448961627630926914932032212362]
 |               |
 |               +--Issuer: CN=TERENA eScience Personal CA 3,O=TERENA,L=Amsterdam,ST=Noord-Holland,C=NL
 |               +--Validity: OK for 352 days, 6 hours, 6 minutes and 9.3 seconds
 |               +--Algorithm: SHA-256 with RSA
 |               +--Public key: RSA 2048 bits
 |               +--Subject alternative names: email: [email protected]
 |               +--Key usage: digital signature, key encipherment, data encipherment, SSL client, email protection
 |        
 |   out: EntityDefinitionPrincipal[Robot]
 |        Origin[195.169.155.172]
 |        /DC=org/DC=terena/DC=tcs/C=NL/O=ASTRON/CN=Robot - [email protected]
 |        EmailAddressPrincipal[[email protected]]
 |        LoAPrincipal[IGTF-AP:Classic]
 |
 +--AUTH OK
 |   |    added: EntityDefinitionPrincipal[Robot]
 |   |           /DC=org/DC=terena/DC=tcs/C=NL/O=ASTRON/CN=Robot - [email protected]
 |   |           EmailAddressPrincipal[[email protected]]
 |   |           LoAPrincipal[IGTF-AP:Classic]
 |   |
 |   +--x509 OPTIONAL:OK => OK
 |   |      added: EntityDefinitionPrincipal[Robot]
 |   |             /DC=org/DC=terena/DC=tcs/C=NL/O=ASTRON/CN=Robot - [email protected]
 |   |             EmailAddressPrincipal[[email protected]]
 |   |             LoAPrincipal[IGTF-AP:Classic]
 |   |
 |   +--voms OPTIONAL:FAIL (null) => OK
 |   |
 |   +--kpwd OPTIONAL:FAIL (no username and password) => OK
 |   |
 |   +--jaas OPTIONAL:FAIL (no login name) => OK
 |
 +--MAP OK
 |   |
 |   +--vorolemap OPTIONAL:FAIL (no record) => OK
 |   |
 |   +--gridmap OPTIONAL:FAIL (no mapping) => OK
 |   |
 |   +--mutator OPTIONAL:OK => OK
 |   |
 |   +--authzdb SUFFICIENT:FAIL (no mappable principal) => OK
 |   |
 |   +--kpwd SUFFICIENT:FAIL (no login name) => OK
 |   |
 |   +--ldap SUFFICIENT:OK => OK (ends the phase)
 |
 +--ACCOUNT OK
 |   |
 |   +--banfile REQUISITE:OK => OK
 |   |
 |   +--kpwd SUFFICIENT:OK => OK (ends the phase)
 |
 +--SESSION OK
 |   |
 |   +--roles REQUIRED:OK => OK
 |   |
 |   +--authzdb SUFFICIENT:FAIL (no username principal) => OK
 |   |
 |   +--kpwd SUFFICIENT:FAIL (no record found) => OK
 |   |
 |   +--ldap SUFFICIENT:OK => OK (ends the phase)
 |
 +--VALIDATION FAIL (no username, no UID, no primary GID)

We heard from our Nikhef colleagues that they had a similar problem for the Xenon VO, which they fixed by restarting some of their dCache domains. So I restarted our gPlazma domain and sure enough that fixed the issue.

I have captured also half a minute of gPlazma log at DEBUG level, before I restarted it. I can share that if you're interested.

Cheers, Onno

onnozweers avatar May 13 '20 12:05 onnozweers

This sounds very familiar: see eu-emi/canl-java#92.

Previously, the problem was triggered by a CA changing their signing algorithm (from SHA-1 to SHA-256) without changing other aspected. This resulted in CaNL holding both the old and new certificates in different caches, which caused "problems".

At the time, I spent considerable time investigating this problem, but now don't remember all the details.

IIRC, the above bug was triggered by another bug (see eu-emi/canl-java#94), so fixing this second bug should result in the first bug not being triggered.

Although the first bug is closed, IIRC the problem was not actually fixed, so it isn't completely surprising this has reappeared.

paulmillar avatar May 13 '20 18:05 paulmillar

Hi Paul, all, what is the state of affairs in dCache regarding these matters? Is the latest CAnL version being used? As of which dCache versions? Does it actually prevent the various issues we have seen in practice?

maarten-litmaath avatar Nov 02 '21 14:11 maarten-litmaath

Hoi Maarten,

I happened to stumble into this commit which seems to answer your question. https://github.com/dCache/dcache/commit/17968a82bee11154cc26c6c93cfd756342fb2810

I don't see this commit yet in any release notes, but by the looks of it it should appear soon in versions 6.2 and newer. So basically, all currently supported releases.

Groetjes Onno

onnozweers avatar Nov 02 '21 15:11 onnozweers

The fix involves a major version upgrade for CaNL.

We are (naturally) being rather cautious about back-porting this to our stable branches.

The current status is that the upgrade has been committed to our "master" branch a few weeks ago. We're watching that this doesn't break anything.

Once we're happy, we will then back-porting the change to our 7.2 version. This should happen "soon".

This back-port will give us real-world experience, with the few sites that have upgraded to 7.2.

Once that 7.2-backport is shown to be "safe" we'll then look at back-porting to the earlier branches.

paulmillar avatar Nov 02 '21 16:11 paulmillar

Hi Paul, thanks for clarifying that sound strategy!

maarten-litmaath avatar Nov 02 '21 16:11 maarten-litmaath

Hi Paul,

I've updated our test server dolphin12.grid.surfsara.nl to the latest snapshot, in case you'd like to run some tests. I may test a root cert downgrade & upgrade if I have time.

Cheers, Onno

onnozweers avatar Nov 03 '21 16:11 onnozweers

That's great @onnozweers, many thanks!

I don't have any specific tests I would run. The main thing is to check it isn't obviously broken before back-porting to 7.2.

So, please run your usual battery of functional tests and/or any manual testing you would normally do.

Having an extra data-point on whether the new version fixes the certificate roll-over problem would also be useful.

Cheers, Paul.

paulmillar avatar Nov 03 '21 17:11 paulmillar

Hi all,

we at DESY-HH are also happy to test it as well on our pre-prod cluster that we've upgraded to 7.2 already. We could also deploy a similar test to Onno. Should we try ti build and deploy the master branch? In general, we've by now two ticket from ATLAS encouraging us to restart to whole instance and one from CMS. Currently we plan to update the ATLAS dCache at the beginning of December and would appreciate a release by then.

I might come with some questions on the build, but I remember I was able to build it following Onno's very comprehensive talk from the Madrid-Workshop.

Thanks a lot, Christian

christianvoss avatar Nov 09 '21 08:11 christianvoss

The Dutchgrid CA root cert changed from 1.105 to 1.106, so that put us Dutch users in a good position to test this bug fix. We know this change caused problems in May 2020. https://dist.eugridpma.info/distribution/igtf/1.106/CHANGES

CA root cert downgrade command (Dutchgrid = Nikhef):

rpm -Uvh --oldpackage --nodeps https://dist.eugridpma.info/distribution/igtf/1.105/accredited/RPMS/ca_NIKHEF-1.105-1.noarch.rpm

SRM test commands (with a Dutchgrid user cert):

voms-proxy-init --voms dteam
srmls srm://dolphin12.grid.surfsara.nl:8443/groups/dteam

Verified DN in the SRM access file: user.dn="/O=dutchgrid/O=users/O=sara/CN=<name>"

WebDAV test command:

curl --capath /etc/grid-security/certificates/ --cert $X509_USER_PROXY --cacert $X509_USER_PROXY --fail --location https://dolphin12.grid.surfsara.nl:2884/groups/dteam/

Test steps and results:

Nr Step Result
1 Verify newest CA certs are installed ca_policy_igtf-slcs-1.113-1.noarch
2 Test an srmls with Dutchgrid user cert OK
3 Downgrade Dutchgrid CA cert to 1.105 ca_NIKHEF-1.105-1.noarch
4 Test srmls with Dutchgrid cert OK
5 systemctl restart dcache.target done
6 Test srmls with Dutchgrid cert OK
7 Test WebDAV with curl OK
8 Upgrade CA certs to latest version ca_NIKHEF-1.113-1.noarch
9 Test srmls with Dutchgrid cert OK
10 Test WebDAV with curl OK
11 systemctl restart dcache.target done
12 Test srmls with Dutchgrid cert OK
13 Test WebDAV with curl OK

Looks perfect!

onnozweers avatar Nov 10 '21 11:11 onnozweers

It looks like we're ready to backport this fix to 7.2.

https://github.com/dCache/dcache/pull/6243

If this doesn't yield any problems, I think it's reasonable to backport the change to the other supported branches in December 2021.

paulmillar avatar Nov 10 '21 12:11 paulmillar

Hi Paul and Onno,

thanks a lot! Due other burning issues we're unable to build the master and deploy it. Bu Onno seems to confirm, that the fix works as expected.

Thanks a lot for your work! Christian

christianvoss avatar Nov 10 '21 12:11 christianvoss

So is it safe to say that dcache can handle a CA update without needing to restart gplazma anymore, in >= v7.2 at least?

rptaylor avatar Feb 20 '24 19:02 rptaylor

Hi @rptaylor, we still have an issue with xroot+yls, which will be fixed in the next few days.

kofemann avatar Feb 20 '24 19:02 kofemann