Overall plans for the project

Open njsmith opened this issue 4 years ago • 9 comments

Now that we're transitioning out of "stealth mode" and have some new contributors joining, we should probably make sure we're all on the same page about goals, and start making some plans. Here's some notes to kick that off; discussion is very much welcome.

Goal/overall vision

For me, the goal is to create a Python HTTP client library that works for everyone. Like, nothing against other client libraries, we're not enemies or anything, but we want to eventually reach a point where our quality and features are so great that the maintainers of urllib3/requests/aiohttp/asks/treq/hyper/etc. are all comfortable recommending their users switch to our thing, or rebasing their libraries on top of ours. Of course that will take a lot of work, but it seems like in the long run it will take less work than everyone duplicating effort on maintaining all these HTTP stacks in parallel. Maintaining a production-quality HTTP client is hard.

Scope

We originally started out with a fairly narrow goal of "replace urllib3", possibly even keeping the name (as "urllib3 v2"). Over time, my feelings on this have shifted, due to a few factors:

It turns out that urllib3's public API exposes a lot of internal details, which is a major challenge for maintenance, because it's difficult to change things without accidentally breaking backwards compatibility. (This is a common problem in libraries that grew organically and then got really popular – numpy has similar problems.) And in particular, it's not actually possible to switch from httplib → h11 without breaking public API. And that's a precondition to pretty much everything else we want to do. So, keeping the urllib3 name and full public API simply isn't possible.
As we've gotten more familiar with urllib3's internals, I've found more places that I think could benefit from some cleanup :-)
The more I studied the requests/urllib3 split, the more it seemed to me that the separation doesn't make a lot of technical sense, at least in retrospect. The "high level" features that requests adds are not terribly complicated, they're pretty much always useful to have available, and they can be done in a way that doesn't add much or any overhead/complexity if you don't need them.

So... there's a tricky balance here, because we don't want to go wild trying to fix every frustration we've ever had with any HTTP library. We need to keep our schedule manageable and we want to make it easy for existing code to migrate to our new library. But I do think we should:

take this opportunity to review and pare down the public API and make sure that we're not exposing more internal details than we want to, and
increase our scope to target roughly requests+urllib3's feature set, rather than just urllib3 alone.

Decision-making (aka "governance")

So far we've been doing this all in the python-trio org, which I guess de facto makes me BDFL. I'm not particularly attached to that, but I do have some of the relevant skills and it's a simple solution to keeping things organized, so I'm happy to continue in that role for now if that's what we want. Keeping stuff in the python-trio org also lets us lean on the shared org infrastructure (e.g. code of conduct, contribution guidelines, automatic invitation bot, chat, forum, etc.). It might also make sense to switch to some other org like python-http or similar, but as of today that's more of an aspirational placeholder than a functioning organization. Or there's python-hyper, but that's effectively moribund. So I don't really know of a better place than python-trio to put stuff.

To be explicit though: even if it stays in python-trio for now, that doesn't mean that trio has some special status; we'll still have first-class support for asyncio etc., and the goal is to create something that belongs to the Python community as a whole.

Next steps

So far we've been trying to stay close to upstream urllib3 to make it easy to merge changes. But I think we've taken that about as far as it can go, so it's time to pull the trigger and start diverging for real. So some things we want to start doing soon:

Switch to our new name (which we should hopefully get ownership of any time now...)
Update the README and docs to include the critical information on what the new project is (at least the rough goals and that the public API is very much NOT stable, and how it compares to httpx)
Upload a v0.0.1 to pypi and announce it to the world

That's kind of needs to happen in that order, I guess. There are also some next steps that are more parallelizable:

Start seriously extending the test suite to cover the async code
Start reviewing the public API and making decisions about what it should look like (there are some other issues open for parts of this)
I know @sethmlarson wants to start refactoring the test suite to make it simpler/less fragile, which would be awesome
Probably more stuff I'm not thinking of right now

Again this is just a brain dump to prompt discussion, not the final word on anything, but hopefully it's a good start.

What do you all think? Any additions, concerns you want to raise, tweaks you want to make?

Nov 23 '19 04:11 njsmith

Oh, I guess we also need to rename the repo, which is a bit annoying. I think we'll need to break the github metadata that says that this is a "fork" of urllib3 upstream, because we're splitting off to become an independent project and if we don't do that then github will do things like assume that PRs intended for us should instead be directed back to urllib3/urllib3. And github doesn't actually offer any way to do that except to create a new repo. Transferring over the git history to a new repo is trivial, fortunately; the bigger problem is transferring over the issues. Maybe we should make a new hip repo, and then just go through the ~30 issues on here manually, and for each one click "Transfer issue" and send it to the new repo? I think that might be the best option that github offers.

Regarding orgs: it just occurred to me that the urllib3/ org might also make sense as a possible home? Dunno what y'all think.

Nov 23 '19 04:11 njsmith

Oh, look what @pquentin found:

To detach the fork and turn it into a standalone repository on GitHub, contact GitHub Support or GitHub Premium Support. If the fork has forks of its own, let support know if the forks should move with your repository into a new network or remain in the current network.

https://help.github.com/en/github/setting-up-and-managing-your-github-profile/why-are-my-contributions-not-showing-up-on-my-profile#commit-was-made-in-a-fork

So I guess we'll do that, maybe after we get confirmation on ownership of our new name?

Nov 23 '19 05:11 njsmith

Thank you for doing this. I wholeheartedly agree with 1/ the vision 2/ the governance with you as BDFL 3/ the next steps. But my agreement is also loosely held, in the sense that if someone feels strongly about some specific aspect I'd be happy to discuss to get consensus.

Nov 23 '19 18:11 pquentin

I'd want you as BDFL and for the project to live under python-trio for the foreseeable future. I agree with everything else here in terms of direction. :)

Nov 23 '19 18:11 sethmlarson

new name acquired
sent request to github support to detach us from the urllib3/urllib3 "fork network"

Nov 25 '19 20:11 njsmith

Hey Nathaniel, Thank you for writing in to GitHub Support! We've detached python-trio/hip and made it a standalone repository.

I think this means we all need to make new forks.

Nov 26 '19 08:11 njsmith

Hey folks,

So, we've poured a huge amount of work into getting a "high level" API squared away with httpx. From my point of view we've put in the legwork there, and I'd love to see us working together, rather than replicating that work. 😌🤞

At least from my POV, what would make a huge amount of sense would be for hip to focus squarely on providing a rock solid low-level API, as @njsmith mentions...

class AbstractAsyncHTTPTransport(ABC):
    async def request(self, verb: bytes, url: bytes, headers: List[Tuple[bytes, bytes]], body: AsyncIterator[bytes]):
        ...
        return (status_code, status_message, headers, response_body)

No redirect handling. No cookie cleverness. No fancy HTTP retries. No multipart handling. No authentication or .netrc lookups. No response decoding. No "are we able to rewind the request body on redirection?" etc.etc. Just the raw basic "here's a connection pool, make a request with it" or "here's a proxy, make a request via it".

If hip provided that, we can plug httpx into it directly / potentially switch to it, since it's dispatch interface is exactly the same kind of abstraction layer.

Personally I think it'd probably be a good thing to have the low-level networking and the high-level model abstractions strictly seperated into independent nicely-scoped packages, since it helps keep different concerns strictly and clearly isolated.

Some finer details: I think you'd also want a timeout argument on that interface, to support independently-set-per-request timeouts, but other stuff like requests/httpx's equivelents of verify/cert ought be configured on a per-pool instantiation basis, rather than per-request, since you don't know/care if you'll actually be connecting or reacquiring an existing connection. There's probably also some thinking to do wrt. either making sure both levels use the same URL parsing.

We do still have an open question on httpx wrt. the best way to support both sync and async variants. Version 0.8 just down-scoped to only supporting async, since our bridging approach lead to a profileration of interface types, and sync vs. async classes both subclassing from base classes was leading to less comprehensible code. I think we're likely to consider an unasync approach instead, at a later date.

Nov 27 '19 13:11 tomchristie

@tomchristie Yeah, we definitely need to figure out more about how hip/httpx relate. For me the biggest difference is the visions; as you wrote as you wrote in https://github.com/encode/httpx/issues/522:

The big design goals of HTTPX have been to meet two features lacking in requests...

HTTP/2 (and eventually HTTP/3) support.

Async support.

[...] Given that requests already provides a battle tested HTTP/1.1 client for the threaded concurrency masses, my inclination is that rather than trying to meet all possible use cases we should focus on httpx being a "the right tool for the right job" rather than "one size fits all".

OTOH, for hip, like I wrote above, the goal is explicitly to "meet all possible use cases" and be "one size fits all" :-). So that's a pretty clear difference in focus, and I think it's been somewhat in the background of a lot of conversations we've had, and it's great that we've all managed to articulate it clearly now.

We could talk about the motivation for our approach; for me it's closely related to what @pquentin wrote here:

can explain why I am working on the urllib3 fork since January 2018: I don't want to waste the huge amount of real world experience urllib3 has. It seems wasteful to have to relearn this by waiting for a slow trickle of bug reports.

For me, the number one challenge for building a production quality HTTP library is figuring out how handle all these edge cases that we don't even know about. This motivates us building on top of urllib3 (versus httpx's start-from-scratch strategy), but it also motivates the "one http client for everyone" philosophy, because that way we only have to address each weirdo edge case once, instead of every project having to rediscover them independently.

Of course the downside is that we inherit all of urllib3's technical debt, but by the time httpx is as mature as urllib3 I'm sure it will have plenty of technical debt too, because HTTP is just intrinsically messy :-) And technical debt is something we can tackle incrementally on our own schedule, instead of being stuck in reactive mode waiting for reports to come in.

I think the different visions also lead to different short-term priorities: we've certainly thought some about HTTP/2, but our primary focus for now is 100% on HTTP/1.1, because that's what we need for an MVP to replace the old libraries. OTOH, httpx has focused more on the shiny new features that older libraries lack. Both approaches make sense given the respective visions, but they're different.

So I have absolutely no animosity towards httpx; it just seems focused on solving a different problem than the one I'm interested in. If we can find ways to collaborate though, then that would be excellent; there's no point in reinventing the wheel. So let's think a bit about what that might look like.

Like you say, the high-level interface is probably where we have the most to learn from httpx.

One my concerns before was about httpx's layered design, with redirects/cookies/etc. handled as separate concerns through the "middleware"/dispatcher mechanism. As a design, it's super elegant! But HTTP is not very elegant :-). And in particular, when I sat down to study requests, my conclusion was that the core "high-level" features are redirects, cookies, retries, and auth handling – and that these are all tightly coupled in a way that doesn't fit well with a layered architecture like this. (In fact every CVE in requests's history has been because of trying to de-couple redirects/auth.) So my feeling has been that the "high level" concerns are best handled as a single piece of tightly-integrated code. But it sounds like you're going in the same direction, so I guess that's no longer an issue?

I don't think I agree about separating the high-level and low-level stuff into separate packages; but that doesn't actually affect too much. Regardless of whether hip starts growing higher-level features, we'll still have the lower-level networking stuff that httpx can use if that makes sense. And if hip does start growing its own higher-level features, we'll obviously look at httpx when doing that. So either way, I don't think any work will be wasted.

I do think that simply adopting httpx might be difficult, because I think hip's approach to handling sync+async has a pretty fundamental effect on API design, so we'll need to think those issues through while adopting httpx's ideas. (And fwiw delaying unasync for later seems risky to me, but I guess you know what you're doing...)

Some finer details: I think you'd also want a timeout argument on that interface, to support independently-set-per-request timeouts, but other stuff like requests/httpx's equivelents of verify/cert ought be configured on a per-pool instantiation basis, rather than per-request, since you don't know/care if you'll actually be connecting or reacquiring an existing connection

I think urllib3's idea for handling this is pretty elegant: it makes stuff like verify/cert part of the connection cache key, so you can have a single connection pool, but you won't reuse a connection unless these settings are compatible. (The exact details of urllib3's code have some unnecessary awkwardness here. But I like the core idea.) One consequence is that you can have a global connection pool so top-level requests like hip.get/httpx.get can re-use connections, but can also accept verify=... arguments. And then you can reduce a Session to just being a bundle of preconfigured kwargs, rather than a substantive object in its own right.

Nov 29 '19 08:11 njsmith

From my side I probably don't think that we've got any fundamental differences in goals, and the any failure to collaborate between our two sets of work would probably be down to a failure to communicate, rather than due to any technical blockers.

My opening salvo in #522 isn't actually where I ended up, and was triggered by us having an underlying approach to async+sync that wasn't working for the project, which we've now removed with 0.8. There's a bunch of places where I think we're on either the same, or similar pages:

For sync support, we'll be switching over to the same code-gen style as hip. (We'll end up with exactly the same issue to resolve as your #168 wrt. how we name/scope our Sync vs. Async interfaces)
We'll likely switch our concurrency backends to the same interface & implementation as hip.
Having taken a look at the sync backend that hip uses, I think we're actually pretty likely to be providing a sync interface again sooner, rather than later.
I think our HTTP/2 support ought to be optional, and off by default.

I don't think I agree about separating the high-level and low-level stuff into separate packages; but that doesn't actually affect too much.

Sure, so let me frame that differently...

We could trivally drop everything in httpx's dispatch implementation, and plug it into hip's PoolManager & ProxyManager classes instead. Where I think there's some really productive scope for collaboration would be if the hip team are able to clearly specify what interface they're aiming for at that layer, then I think we've got a huge slice of great work that we can contribute to providing all the model layers that sit above that.

it makes stuff like verify/cert part of the connection cache key, so you can have a single connection pool, but you won't reuse a connection unless these settings are compatible.

That's smart. Worth noting that it is actually functionally identical behavior to "verify/cert should be per-connection-pool settings, not per-request settings". If they're not sharing the same cache keys, then they are essentially just different connection pools.

Nov 30 '19 14:11 tomchristie

hip hip copied to clipboard

Overall plans for the project

Goal/overall vision

Scope

Decision-making (aka "governance")

Next steps

hip
hip copied to clipboard