dcache icon indicating copy to clipboard operation
dcache copied to clipboard

[Tape REST API] Stage call failed: HTTP 500

Open cfgamboa opened this issue 1 year ago • 19 comments

Hello,

It was reported to me the following transfer failure involving the TAPE REST API

https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/7c956118-c58c-11ee-9d7e-fa163e5a92fb

Error reason: STAGING [22] [Tape REST API] Stage call failed: HTTP 500 : Unexpected server error: 500 : {"detail":"Internal error: java.lang.NullPointerException","title":"Internal Server Error","status":"500"}

The file at the storage element is available during this transfer, other transfers successfully retrieve the file

02.07 04:11:36 [door:WebDAV2-dcdoor31-internalipv6@webdav2-dcdoor31_httpsDomain:request] ["usatlas4":6439:69907:2620:0:210:9800:0:0:0:37] [0000D0A4CA2616BA43618DEAD6188496E3C3,2718524222] [/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc12_8TeV/NTUP_JETMET/e1126_s1469_s1470_r3542_r3549_p1344/mc12_8TeV.147913.Pythia8_AU2CT10_jetjet_JZ3W.merge.NTUP_JETMET.e1126_s1469_s1470_r3542_r3549_p1344_tid01115537_00/NTUP_JETMET.01115537._000467.root.1] MCTAPE:MC@osm 1590 0 {0:""}

All the best, Carlos

cfgamboa avatar Feb 07 '24 14:02 cfgamboa

Hi Carlos, could you please try to find and post the entire stack trace of the thrown NPE? Also, just to make sure: which dCache version are you running? Thanks. Lea

lemora avatar Feb 07 '24 14:02 lemora

Hi Lea,

9.2.6 dCache java 11

I will post NPE related traces if produced.

All the best, Carlos

cfgamboa avatar Feb 07 '24 15:02 cfgamboa

With dcache 9.2.11:

#curl -L  --capath /etc/grid-security/certificates --cert  /tmp/x509up_u0 --cacert  /tmp/x509up_u0 --key /tmp/x509up_u0 -X POST https://dcfrontend.usatlas.bnl.gov:3880/api/v1/tape/stage -H "Content-Type: application/json" -d '{"files": [ {"diskLifetime":"P1D","path":"/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3"}]}'
{"detail":"Internal error: java.lang.NullPointerException","title":"Internal Server Error","status":"500"}

gives:

==> /var/log/dcache/dcfrontend01FrontGplazDomain.log <==
12 Mar 2024 10:34:56 (Frontend-dcfrontend01) [] Uncaught exception in thread jetty-458571
java.lang.NullPointerException: null
	at org.dcache.restful.resources.tape.StageResources.submit(StageResources.java:296)
	at jdk.internal.reflect.GeneratedMethodAccessor377.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167)
	at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:391)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80)
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:253)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:232)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:679)
	at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:392)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:365)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:318)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:554)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.dcache.http.AuthenticationHandler.access$001(AuthenticationHandler.java:55)
	at org.dcache.http.AuthenticationHandler.lambda$handle$0(AuthenticationHandler.java:157)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
	at org.dcache.http.AuthenticationHandler.handle(AuthenticationHandler.java:154)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.dcache.http.AbstractLoggingHandler.handle(AbstractLoggingHandler.java:110)
	at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:516)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:555)
	at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:410)
	at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:164)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
	at java.base/java.lang.Thread.run(Thread.java:829)

and

==> /var/log/dcache/dcfrontend02Domain.log <==
12 Mar 2024 10:34:56 (bulk) [] 044134c0-48a3-44a4-9b8c-f543d067917f - fetchAttributes, callback failure for TARGET [15393521, INITIAL, null][null][CREATED: (C 2024-03-12 10:34:56.862)(S null)(U 2024-03-12 10:34:56.862)(ret 0)][null] null : /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3 (err null null).

vingar avatar Mar 12 '24 14:03 vingar

One more observation:

# curl --capath /etc/grid-security/certificates --cert /tmp/x509up_u0 --cacert /tmp/x509up_u0 --key /tmp/x509up_u0 https://dcfrontend.usatlas.bnl.gov:3880/api/v1/tape/stage/7cf32cfb-7c15-42f7-9e9b-0cbb1b08a37c 
{
  "id" : "7cf32cfb-7c15-42f7-9e9b-0cbb1b08a37c",
  "createdAt" : 1710253683266,
  "startedAt" : 1710253683278,
  "completedAt" : 1710253683300,
  "files" : [ {
    "path" : "/fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3",
    "finishedAt" : 1710253683269,
    "startedAt" : 1710253683269,
    "error" : "CacheException(rc=10001;msg=No such file or directory /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r29",
    "state" : "FAILED"
  } ]```

please not the truncated path. 

vingar avatar Mar 12 '24 14:03 vingar

I am pretty sure the file earth is truncated due to db text width limit set on DB end.

I will investigate the NPE.

DmitryLitvintsev avatar Mar 12 '24 15:03 DmitryLitvintsev

Carlos,

Question. Attempts to call stage on any other files in this directory also causes NPE or not?

Dmitry

DmitryLitvintsev avatar Mar 12 '24 15:03 DmitryLitvintsev

The path appears also truncated in the path column in the request_target table on the bulk db.

vingar avatar Mar 12 '24 15:03 vingar

yes, It is limited, but it should not fail because of that. That's why I asked if it fails on some particular files on any files

DmitryLitvintsev avatar Mar 12 '24 15:03 DmitryLitvintsev

Carlos,

Question. Attempts to call stage on any other files in this directory also causes NPE or not?

Dmitry

yes, there is a same error for another file in this directory:

# curl -L  --capath /etc/grid-security/certificates --cert  /tmp/x509up_u0 --cacert  /tmp/x509up_u0 --key /tmp/x509up_u0 -X POST https://dcrontend.usatlas.bnl.gov:3880/api/v1/tape/stage -H "Content-Type: application/json" -d '{"files": [ {"diskLifetime":"P1D","path":"/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000019.root.2"}]}'
{"detail":"Internal error: java.lang.NullPointerException","title":"Internal Server Error","status":"500"}

vingar avatar Mar 12 '24 15:03 vingar

OK,

Lets try low hanging fruit. Of course Al has tested the patch where the truncation is used on long paths. AND it was not supposed to matter. But just in case. I can revert this and update DB to accept longer path name.

Will you be able to test the RPM I will make for you? Do you have test / integration system where you can test it independently from production?

Dmitry

P.S.: rationale for path truncation was to keep DB smaller.

DmitryLitvintsev avatar Mar 12 '24 16:03 DmitryLitvintsev

Will you be able to test the RPM I will make for you? Do you have test / integration system where you can test it independently from production?

yes, sure

vingar avatar Mar 12 '24 17:03 vingar

OK, hers is he rpm

https://drive.google.com/file/d/1VoYKz1MCrI5_TCgZ8BXpVh5bmhqZSoVk/view?usp=sharing

This is top of 9.2. It need to be deployed only where you run bulk

DmitryLitvintsev avatar Mar 12 '24 18:03 DmitryLitvintsev

Having looked the code I see that :

"error" : "CacheException(rc=10001;msg=No such file or directory /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r29"

the error string is simply truncated to 256 characters :

>>> p="CacheException(rc=10001;msg=No such file or directory /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r29"
>>> len(p)
256
>>> 

So the truncated path is_not a problem.

BUT , could you check if :

/fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3

indeed exists.

I suspect the path should start with /pnfs.

What is absolute path of this file? I think issue has to do with prefix handling. Prefix handling has been fixed some time ago and is contained in RPM above. So test the RPM please.

What will help is this

path

/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3

what is its PNSID

and then, with that PNFSID

\sn pathfinder <PNFSID>

This will help to understand what is happening.

DmitryLitvintsev avatar Mar 12 '24 19:03 DmitryLitvintsev

BUT , could you check if :

/fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3

indeed exists.

No, it doesn't.

I suspect the path should start with /pnfs. yes

> \sn pathfinder 00003BE34500448543C9A3E7CC23149D846C
/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3

vingar avatar Mar 13 '24 08:03 vingar

OK, give this RPM a try please.

DmitryLitvintsev avatar Mar 13 '24 11:03 DmitryLitvintsev

Any progress on this?

DmitryLitvintsev avatar Mar 20 '24 13:03 DmitryLitvintsev

Hi,

We experienced the same issue yesterday while testing the upgrade to dCache 9.2.15 in our pre-production instance. We started an RT ticket and @lemora pointed out to this ticket. So, that is the reason why I jump in ;-) We have tested Dmitry's rpm with two configurations:

  • One bulk domain and the RPM installed on the servers running the bulk and frontend domains.
  • Two bulk domains and the RPM installed on the servers running the bulk and frontend domains.

With just one bulk domain, the error has disappeared, but with the second configuration, the error persists.

If we can be of any further help, please let us know.

Cheers, Samuel

samuambroj avatar Mar 21 '24 08:03 samuambroj

As suggested by @DmitryLitvintsev, we have also tested one single bulk domain, with the standard 9.2.15 rpm and in that case the issue has disappeared.

samuambroj avatar Mar 21 '24 13:03 samuambroj

Yes, we moved to a single bulk domain as well, and this addressed the issue. Now, we're waiting to have support for multiple bulk domains ;)

vingar avatar Apr 17 '24 08:04 vingar