kubo icon indicating copy to clipboard operation
kubo copied to clipboard

[Tracking issue] reframe used in production

Open BigLep opened this issue 3 years ago • 10 comments

Done Criteria

Reframe is enabled on the ipfs.io gateway, hitting Store The Index in parallel to querying the DHT to determine providers for a CID.

Notes

This corresponds with "stage 1" in https://www.notion.so/pl-strflt/Indexer-IPFS-Reframe-Q3-Plan-77d0f3fa193b438784d2d85096e315e7

It doesn't matter if STI gives empty results or if the results are unreachable because they're to Filecoin Storage Providers that don't allow data transfer with Bitswap. Enabling retrieval over Bitswap for Filecoin SP's is a separate effort.

Reframe shipped in Kubo 0.14 as part of https://github.com/ipfs/kubo/issues/8775. That said, it isn't used in production that we know of. The purpose of this tracking issue is to track followups that we have already identified as needing to be done or that come up as a result of attempting production deployment.

2022-08-25: the list below will be updated by EOD.

  • [x] https://github.com/ipfs/kubo/issues/9188 (need to do this first)
  • 2022-08-19: need to update based on the tasks discussed in https://github.com/ipfs/kubo/issues/9150#issuecomment-1220869533
  • [ ] Enable on ipfs.io gateway (production usecase)

2022-08-19: Need to see if these issues still make sense

  • [x] https://github.com/ipfs/kubo/issues/9079
  • [x] https://github.com/ipfs/kubo/issues/9083
  • [x] https://github.com/ipfs/kubo/issues/9157

Other followup issues that should be done, but aren't directly tied to the done criteria.

  • [x] https://github.com/ipfs/go-delegated-routing/issues/27
  • [x] https://github.com/ipfs/go-delegated-routing/issues/30
  • [x] https://github.com/ipfs/ipfs-blog/issues/446

BigLep avatar Jul 25 '22 22:07 BigLep

2022-08-19 conversation:

Tasks involved in finishing this off:

  1. Polish off the design
  • Answer last remaining questions.
  • We didn't estimate this, but it should be low/quick.
  1. New routers (3 dev days)
  • New routers
    • Sequential router
    • Parallel router
    • Router to filter by method
  • May include refractoring
  • Includes unit tests
  • Code will live in https://github.com/libp2p/go-libp2p-routing-helpers
  • Going to use existing routers if possible so we have less code to maintain but if this turns into a headache with Lotus, then may need new version.
  1. Make this work in Kubo (5 dev days)
  • Config parsing
  • Generate Routing instances from config (simple approach)
  • Using FX
  1. Documentation (1 dev days)
  • User-friendly config documentation
  • Move design documentation around
  • We can have a doc like "design decisions for routing routers".
  • Idea: you could create kubo/docs/exploration-reposts dir, and put it there (ipld folks did that in past: https://github.com/ipld/specs/tree/master/design/history/exploration-reports, and then moved content to the proper website doics/specs )
  1. Sharness smoke test (1 dev day)
  • This will be adjusting the existing Sharness test we have.
  1. Operational preparation (1 dev days)
  • Need to answer how much time each of the routers are taking
  • Edelweiss: define metrics we want, RPC metrics are already there
  1. Testing on gateways (2 dev days)

Total estimate: 13 ideal dev days

Assuming a velocity of .7 ideal dev day to actual day, that puts us at 19 calendar days assuming just one person working on this.

Next steps here:

  • [x] @BigLep communicate worst-case calendar date in #reframe
  • [ ] Clean up the existing issues
  • [ ] Create new issues that PRs can be linked against

BigLep avatar Aug 19 '22 16:08 BigLep

@ajnavarro : I moved over the estimates from https://github.com/ipfs/kubo/issues/9188#issuecomment-1220886223 to here.

Can you help with the issue cleanup here? Specifically:

  1. I think we need a new issue in go-libp2p-routing-helpers for the new routers.
  2. I assume we'll update https://github.com/ipfs/kubo/issues/9079 to account for the design changes. This is "Make this work in Kubo" above right or are you going to create another issue?
  3. I assume https://github.com/ipfs/kubo/issues/9083 and https://github.com/ipfs/kubo/issues/9157 can be closed in favor of the work happening in # 2 above, is that right?

BigLep avatar Aug 20 '22 00:08 BigLep

  • New issue for parallel and sequential routers https://github.com/libp2p/go-libp2p-routing-helpers/issues/56
  • Updated issue https://github.com/ipfs/kubo/issues/9079
  • What issue are you referring to? I think it is a typo.

ajnavarro avatar Aug 22 '22 09:08 ajnavarro

@ajnavarro : thanks. I updated "# 3" above. I believe the confusion stemmed from the fact that I wrote "#2" which got translated to issue "#2" in the repo.

BigLep avatar Aug 22 '22 19:08 BigLep

@BigLep yep, we can close #9083

ajnavarro avatar Aug 23 '22 09:08 ajnavarro

As communicated in Slack:

Total estimate is 13 ideal dev days Assuming a velocity of .7 ideal dev day to actual day, that puts us at 19 calendar days assuming just one person working on this. That feels high to me, but I think Bedrock should plan around that for now given we'll likely experience a jolt when Adin heads out, there is ongoing Kubo maintenance efforts that the team gets pulled into, and I don't think we'll be able to have an extra engineer on this beyond helping with code reviews.

BigLep avatar Aug 23 '22 14:08 BigLep

@ajnavarro : should https://github.com/ipfs/kubo/issues/9157 combine into https://github.com/ipfs/kubo/issues/9079 or will the config work be a separate PR. If it will be separate then I agree we should have two issues.

BigLep avatar Aug 23 '22 14:08 BigLep

@BigLep I think we can do them separately.

ajnavarro avatar Aug 23 '22 14:08 ajnavarro

  • Config parser with tests: Done
  • Create Router instances from validated config: Done
  • Integrate with fx: In progress

ajnavarro avatar Sep 12 '22 13:09 ajnavarro

  • First draft PR uploaded: https://github.com/ipfs/kubo/pull/9274.
  • Waiting for a new review on https://github.com/libp2p/go-libp2p-routing-helpers/pull/58.
  • Now updating go-delegated-routing to the latest release. There were some changes in method signatures.

ajnavarro avatar Sep 13 '22 13:09 ajnavarro

I was not sure where "productizing" https://routing.delegate.ipfs.io/reframe should live, so I filled https://github.com/protocol/bifrost-infra/issues/2142

lidel avatar Oct 10 '22 13:10 lidel

@guseggert : I added the important area to the done criteria we were missing last week of "ipfs.io gateway operators are able to assess reframe call health with metrics". Feel free to create a separate tracking issue for this work or do it here.

BigLep avatar Oct 17 '22 18:10 BigLep

I've expanded the done checklist to account for the migration from reframe to the updated HTTP delegated routing API: https://github.com/ipfs/go-delegated-routing/issues/63 . I have updated the title as well.

BigLep avatar Nov 29 '22 00:11 BigLep

Pending verification that clientside metrics are flowing for ipfs.io gateway and cid.contact in https://github.com/protocol/bifrost-infra/issues/2183

BigLep avatar Mar 24 '23 14:03 BigLep

2023-04-25 maintainer conversation: we're still trying to confirm that the ipfs.io gateway is calling cid.contact.
cid.contact is not seeing any requests from ipfs.io gateways per discussion in https://github.com/ipni/storetheindex/pull/1597#issuecomment-1519923480 and https://filecoinproject.slack.com/archives/C03RQFURZLM/p1682348755904599 .
@guseggert SSH'd to a box and confirmed it is using "autoclient" @guseggert confirmed that when he uses the same version of Kubo with "autoclient" that cid.contact is being called @guseggert is leading the investigation here

BigLep avatar Apr 25 '23 17:04 BigLep

I suspect https://github.com/ipfs/boxo/blob/main/routing/http/client/transport.go#L24-L28 could be an issue:

  • if the response body is not fully drained, the http client will not be able to re-use the connection. eventually you'll run out of the pool of clients and I think the logic then is to fail silently.
  • The defaults will give a pretty limited number of default allowed "simultaneously inflight" to a single host - that likely needs to be tuned for gateways. https://cs.opensource.google/go/go/+/refs/tags/go1.20.3:src/net/http/transport.go;l=201-211

willscott avatar Apr 25 '23 17:04 willscott

if the response body is not fully drained, the http client will not be able to re-use the connection. eventually you'll run out of the pool of clients and I think the logic then is to fail silently.

IIUC the body is not fully drained in the case where the limit is reached, which is indeed a bug, but are there actually records in cid.contact with >1 MiB of providers?

guseggert avatar Apr 25 '23 20:04 guseggert

My other theory is that it is related to the accelerated DHT client. When I run locally with FullRT client, no reqs are made to cid.contact.

guseggert avatar Apr 25 '23 20:04 guseggert

  • there may be records with >1MiB of response

bitswap issues a query for 10 providers and the accelerated DHT returns 20 immediately, terminating the query before it even starts on cid.contact.

I don't think this explains things - we have cases of providers that only are publishing to cid.contact. there wouldn't be 20 peers already in the DHT there, and we don't find them on gateways.

Just because you have some peers from FullRT you don't need to close the channel / cancel the query on the indexer side, right?

(in the same way that when indexers gets requests requesting a cascade to the DHT they don't cancel the dht cascade once we've returned responses from the index.)

willscott avatar Apr 25 '23 21:04 willscott

~30% of queries to kubo on gateways end up timing out with no peers, so we should still see those at minimum reaching the indexers, right?

willscott avatar Apr 25 '23 21:04 willscott

Okay there's a plumbing bug in Kubo's rats nest of content routing plumbing that is causing this issue, when the experimental DHT client is turned on then the cid.contact router is not used. Working on a fix, could take me a few hours to untangle that mess.

guseggert avatar Apr 26 '23 17:04 guseggert

Thanks @guseggert for the update.

As you are working on this, friendly reminder that we plan to move "accelerated DHT" out of experimental in the next release. If there are followups that need to occur, fee free to drop them to https://github.com/ipfs/kubo/issues/9703

In terms of releasing this, I assume we'll have it part of 0.20? (We'll talk during 2023-04-27 standup. Alternative is that Bifrost uses a custom branch until this come in a future release.)

A challenge here will be code review given team members being out. I think our best case is that @lidel take a look at it Thursday, 2023-04-27.

BigLep avatar Apr 26 '23 17:04 BigLep

i would push to get an RC onto bifrost and validate things are working as expected before releasing

willscott avatar Apr 26 '23 18:04 willscott

Agreed we should one-box this on the gateways before releasing to make sure it works, otherwise the feedback loop is brutal

guseggert avatar Apr 26 '23 20:04 guseggert

2023-04-27 maintainer conversation: this is still the top priority of @guseggert that he is actively working on. We don't have a PR yet.

BigLep avatar Apr 27 '23 15:04 BigLep

Okay I have working code now and can confirm that the default HTTP routers are being invoked alongside the FullRT client, will open a PR shortly and the plan is for @Jorropo to take a look on Monday

guseggert avatar Apr 28 '23 18:04 guseggert

Thanks @guseggert . Could you please also give directions on what Bifrost would need to test it? I'd like to get their test/verification happening in parallel.

BigLep avatar Apr 28 '23 19:04 BigLep

2023-05-02 update: The fix has deployed to "one box" on ipfs.io gateway. Integration looks good. These are going to get rolled out to the whole fleet over the next day or two.

BigLep avatar May 02 '23 16:05 BigLep

I'm closing this issue because ipfs.io gateway <> cid.contact integration is rolled out (see https://github.com/protocol/bifrost-infra/issues/2183#issuecomment-1533342956) and confirmed to be working.

BigLep avatar May 05 '23 20:05 BigLep