dcache
dcache copied to clipboard
[Tape REST API] Stage call failed: HTTP 500
Hello,
It was reported to me the following transfer failure involving the TAPE REST API
https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/7c956118-c58c-11ee-9d7e-fa163e5a92fb
Error reason: STAGING [22] [Tape REST API] Stage call failed: HTTP 500 : Unexpected server error: 500 : {"detail":"Internal error: java.lang.NullPointerException","title":"Internal Server Error","status":"500"}
The file at the storage element is available during this transfer, other transfers successfully retrieve the file
02.07 04:11:36 [door:WebDAV2-dcdoor31-internalipv6@webdav2-dcdoor31_httpsDomain:request] ["usatlas4":6439:69907:2620:0:210:9800:0:0:0:37] [0000D0A4CA2616BA43618DEAD6188496E3C3,2718524222] [/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc12_8TeV/NTUP_JETMET/e1126_s1469_s1470_r3542_r3549_p1344/mc12_8TeV.147913.Pythia8_AU2CT10_jetjet_JZ3W.merge.NTUP_JETMET.e1126_s1469_s1470_r3542_r3549_p1344_tid01115537_00/NTUP_JETMET.01115537._000467.root.1] MCTAPE:MC@osm 1590 0 {0:""}
All the best, Carlos
Hi Carlos, could you please try to find and post the entire stack trace of the thrown NPE? Also, just to make sure: which dCache version are you running? Thanks. Lea
Hi Lea,
9.2.6 dCache java 11
I will post NPE related traces if produced.
All the best, Carlos
With dcache 9.2.11:
#curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u0 --cacert /tmp/x509up_u0 --key /tmp/x509up_u0 -X POST https://dcfrontend.usatlas.bnl.gov:3880/api/v1/tape/stage -H "Content-Type: application/json" -d '{"files": [ {"diskLifetime":"P1D","path":"/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3"}]}'
{"detail":"Internal error: java.lang.NullPointerException","title":"Internal Server Error","status":"500"}
gives:
==> /var/log/dcache/dcfrontend01FrontGplazDomain.log <==
12 Mar 2024 10:34:56 (Frontend-dcfrontend01) [] Uncaught exception in thread jetty-458571
java.lang.NullPointerException: null
at org.dcache.restful.resources.tape.StageResources.submit(StageResources.java:296)
at jdk.internal.reflect.GeneratedMethodAccessor377.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167)
at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:391)
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80)
at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:253)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:232)
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:679)
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:392)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:365)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:318)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:554)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.dcache.http.AuthenticationHandler.access$001(AuthenticationHandler.java:55)
at org.dcache.http.AuthenticationHandler.lambda$handle$0(AuthenticationHandler.java:157)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
at org.dcache.http.AuthenticationHandler.handle(AuthenticationHandler.java:154)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.dcache.http.AbstractLoggingHandler.handle(AbstractLoggingHandler.java:110)
at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:555)
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:410)
at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:164)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
at java.base/java.lang.Thread.run(Thread.java:829)
and
==> /var/log/dcache/dcfrontend02Domain.log <==
12 Mar 2024 10:34:56 (bulk) [] 044134c0-48a3-44a4-9b8c-f543d067917f - fetchAttributes, callback failure for TARGET [15393521, INITIAL, null][null][CREATED: (C 2024-03-12 10:34:56.862)(S null)(U 2024-03-12 10:34:56.862)(ret 0)][null] null : /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3 (err null null).
One more observation:
# curl --capath /etc/grid-security/certificates --cert /tmp/x509up_u0 --cacert /tmp/x509up_u0 --key /tmp/x509up_u0 https://dcfrontend.usatlas.bnl.gov:3880/api/v1/tape/stage/7cf32cfb-7c15-42f7-9e9b-0cbb1b08a37c
{
"id" : "7cf32cfb-7c15-42f7-9e9b-0cbb1b08a37c",
"createdAt" : 1710253683266,
"startedAt" : 1710253683278,
"completedAt" : 1710253683300,
"files" : [ {
"path" : "/fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3",
"finishedAt" : 1710253683269,
"startedAt" : 1710253683269,
"error" : "CacheException(rc=10001;msg=No such file or directory /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r29",
"state" : "FAILED"
} ]```
please not the truncated path.
I am pretty sure the file earth is truncated due to db text width limit set on DB end.
I will investigate the NPE.
Carlos,
Question. Attempts to call stage on any other files in this directory also causes NPE or not?
Dmitry
The path appears also truncated in the path column in the request_target table on the bulk db.
yes, It is limited, but it should not fail because of that. That's why I asked if it fails on some particular files on any files
Carlos,
Question. Attempts to call stage on any other files in this directory also causes NPE or not?
Dmitry
yes, there is a same error for another file in this directory:
# curl -L --capath /etc/grid-security/certificates --cert /tmp/x509up_u0 --cacert /tmp/x509up_u0 --key /tmp/x509up_u0 -X POST https://dcrontend.usatlas.bnl.gov:3880/api/v1/tape/stage -H "Content-Type: application/json" -d '{"files": [ {"diskLifetime":"P1D","path":"/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000019.root.2"}]}'
{"detail":"Internal error: java.lang.NullPointerException","title":"Internal Server Error","status":"500"}
OK,
Lets try low hanging fruit. Of course Al has tested the patch where the truncation is used on long paths. AND it was not supposed to matter. But just in case. I can revert this and update DB to accept longer path name.
Will you be able to test the RPM I will make for you? Do you have test / integration system where you can test it independently from production?
Dmitry
P.S.: rationale for path truncation was to keep DB smaller.
Will you be able to test the RPM I will make for you? Do you have test / integration system where you can test it independently from production?
yes, sure
OK, hers is he rpm
https://drive.google.com/file/d/1VoYKz1MCrI5_TCgZ8BXpVh5bmhqZSoVk/view?usp=sharing
This is top of 9.2. It need to be deployed only where you run bulk
Having looked the code I see that :
"error" : "CacheException(rc=10001;msg=No such file or directory /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r29"
the error string is simply truncated to 256 characters :
>>> p="CacheException(rc=10001;msg=No such file or directory /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r29"
>>> len(p)
256
>>>
So the truncated path is_not a problem.
BUT , could you check if :
/fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3
indeed exists.
I suspect the path should start with /pnfs
.
What is absolute path of this file? I think issue has to do with prefix handling. Prefix handling has been fixed some time ago and is contained in RPM above. So test the RPM please.
What will help is this
path
/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3
what is its PNSID
and then, with that PNFSID
\sn pathfinder <PNFSID>
This will help to understand what is happening.
BUT , could you check if :
/fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3
indeed exists.
No, it doesn't.
I suspect the path should start with
/pnfs
. yes
> \sn pathfinder 00003BE34500448543C9A3E7CC23149D846C
/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3
OK, give this RPM a try please.
Any progress on this?
Hi,
We experienced the same issue yesterday while testing the upgrade to dCache 9.2.15 in our pre-production instance. We started an RT ticket and @lemora pointed out to this ticket. So, that is the reason why I jump in ;-) We have tested Dmitry's rpm with two configurations:
- One bulk domain and the RPM installed on the servers running the bulk and frontend domains.
- Two bulk domains and the RPM installed on the servers running the bulk and frontend domains.
With just one bulk domain, the error has disappeared, but with the second configuration, the error persists.
If we can be of any further help, please let us know.
Cheers, Samuel
As suggested by @DmitryLitvintsev, we have also tested one single bulk domain, with the standard 9.2.15 rpm and in that case the issue has disappeared.
Yes, we moved to a single bulk domain as well, and this addressed the issue. Now, we're waiting to have support for multiple bulk domains ;)