firefly icon indicating copy to clipboard operation
firefly copied to clipboard

Failed to download batch from IPFS

Open awrichar opened this issue 2 years ago • 3 comments

During a recent performance run, the job failed to start because the orgs were not registered properly.

Node 1 shows this upload:

{"log":"[2022-08-17T21:44:05.067Z]  INFO IPFS published QmZV1813npCEb8USvUJXHmnLNbtQgjLgZA8TPF83P1pJ68 Size=1415 d=pinned_broadcast ns=default opcache=juVW47Fq p=did:firefly:org/org_1| pid=1 role=batchmgr\n","stream":"stderr","time":"2022-08-17T21:44:05.068084982Z"}
{"log":"[2022-08-17T21:44:05.067Z]  INFO Published batch 'f4b9cba6-e30b-4cf9-a143-e9ac425dc2d1' to shared storage: 'QmZV1813npCEb8USvUJXHmnLNbtQgjLgZA8TPF83P1pJ68' d=pinned_broadcast ns=default opcache=juVW47Fq p=did:firefly:org/org_1| pid=1 role=batchmgr\n","stream":"stderr","time":"2022-08-17T21:44:05.068104641Z"}

Node 0 is repeatedly unable to download:

{"log":"[2022-08-17T21:44:07.861Z] DEBUG ==\u003e GET http://ipfs_0:8080/ipfs/QmZV1813npCEb8USvUJXHmnLNbtQgjLgZA8TPF83P1pJ68 breq=BAGcvc3U pid=1 sharedstorage=ipfs\n","stream":"stderr","time":"2022-08-17T21:44:07.862328371Z"}
{"log":"[2022-08-17T21:44:37.862Z] DEBUG \u003c== GET http://ipfs_0:8080/ipfs/QmZV1813npCEb8USvUJXHmnLNbtQgjLgZA8TPF83P1pJ68 [0] (30001.16ms) breq=BAGcvc3U pid=1 sharedstorage=ipfs\n","stream":"stderr","time":"2022-08-17T21:44:37.862481893Z"}
{"log":"[2022-08-17T21:44:37.862Z] DEBUG ipfs updating operation default:f9ffe35b-28de-446b-a849-177db05d3134 status=Pending error=FF10376: Error downloading data with reference 'QmZV1813npCEb8USvUJXHmnLNbtQgjLgZA8TPF83P1pJ68' from shared storage: FF10136: Error from IPFS: : Get \"http://ipfs_0:8080/ipfs/QmZV1813npCEb8USvUJXHmnLNbtQgjLgZA8TPF83P1pJ68\": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ns=default pid=1\n","stream":"stderr","time":"2022-08-17T21:44:37.862524061Z"}
{"log":"[2022-08-17T21:44:37.862Z] ERROR Download operation sharedstorage_download_batch/f9ffe35b-28de-446b-a849-177db05d3134 attempt=1/100 failed: FF10376: Error downloading data with reference 'QmZV1813npCEb8USvUJXHmnLNbtQgjLgZA8TPF83P1pJ68' from shared storage: FF10136: Error from IPFS: : Get \"http://ipfs_0:8080/ipfs/QmZV1813npCEb8USvUJXHmnLNbtQgjLgZA8TPF83P1pJ68\": context deadline exceeded (Client.Timeout exceeded while awaiting headers) downloadworker=dw_007 ns=default pid=1\n","stream":"stderr","time":"2022-08-17T21:44:37.862698724Z"}

awrichar avatar Aug 18 '22 15:08 awrichar

log_firefly_core_0.log.gz log_firefly_core_1.log.gz

Unfortunately did not capture IPFS logs. However, I'm fairly certain IPFS was up and not logging any obvious anomalies.

awrichar avatar Aug 18 '22 15:08 awrichar

I've also seen this locally at least once, so it wasn't a totally isolated incident.

awrichar avatar Aug 18 '22 15:08 awrichar

So from the surface of the issue, the IPFS network seems like it's not healthy.

Each time a download request is made against Node 0, it should reach out to its peers to find the data. And Node 1 should have knowledge of that data in its DAG.

peterbroadhurst avatar Aug 19 '22 12:08 peterbroadhurst