ALPaCA: Abstract Light-weight Producer and Consumer API
Context
The consumer/producer API model currently supported by Stream Registry follows a factory pattern. This creates enriched clients that are closely coupled to the underlying streaming platform (Kafka, Kinesis, etc.) Benefits of this approach include:
- allowing full access to the capabilities of the target streaming platform
- avoiding a 'lowest common denominator' approach whereby a client can expose only those features supported by all target streaming platforms
- easing adoption of Stream Registry by existing applications as the client interface remains unchanged
- enables use of tightly integrated technologies such as KStreams, KSQL, and other mature 3rd party integrations such as Apache Spark's Kafka receiver or the Apache Flink Kinesis connectors
However, this approach does little to simplify scenarios where we must integrate multiple diverse streaming platforms (Kafka, Kinesis, etc.) with consuming or producing applications. This capability is of significant value when building core streaming data applications that must act on a majority or all or an organisations streams. Typically these could include: anomaly detection, data quality, and potentially many others.
I propose therefore that it would be beneficial to also include in Stream Registry a set of stream platform agnostic consumer/producer APIs. These would allow the consumption/production of events in a unified manner so that applications can be built that can simply and seamlessly interact with an organisations streams, irrespective of the underlying streaming platform in which they are persisted.
To facilitate wide adoption and integration, APIs would be extremely light-weight and use only open and massively adopted technologies: HTTP, REST, Json, WebSockets, etc.
To be clear: I suggest that these APIs are provided in addition to the factory approach that is already supported, whose differentiating value was outlined earlier.
Desired Behaviour
As a consumer/producer, I can chose to integrate my application/service with the 'vendor' agnostic streaming API (ALPaCA). This API provides me with an abstraction to to produce and consumer events while decoupling me from the platform specific API provided by my streaming platform. It thus allows me to target other streaming platforms in my data ecosystem with no additional effort, and seamlessly support new platforms as they are adopted into the stream registry ecosystem.
The stream registry already assists me in understanding the events that I consume/produce via integration with schema registries, it can also provide metadata that describes how I should serialise and deserialise said events. However, it does not provide me with the machinery to do so - the difficult part. ALPaCA would provide simple standardised message encodings (Json, GZIPped), transports (HTTP + WebSockets), and protocols (REST), enabling me to simply and consistently read and write events with any stream, with minimal dependencies and little coupling.
Benefits
- enables the development/deployment of organisation-wide stream-based applications/service/capabilities that can integrate with and operate on any stream in the organisation, irrespective of the underlying streaming platform on which the stream resides
- eliminates the need to integrate each core system with
Nstreaming platforms - eliminates duplicate development and maintenance of Kinesis/Kafka/etc. integration code across such systems
- eliminates vendor specific capability gaps forming in the data platform: "we can perform anomaly detection on your Kafka streams, but not your Kinesis streams as we haven't yet had a chance to build an integration"
- useful internal Stream Registry plumbing: can act as a bridge between streaming platforms - i.e. push from any one supported platform into any other by using the platform agnostic consumer/producer APIs. This solution then works for any combinations of platform, even those adopted in the future
- the relocation of streams from one platform (example: Kinesis) to another (Kafka) does not impact ALPaCA consumers or producers, thus minimising the overall impact of such a migration. This then allows us to think more freely regarding stream placement, and migrate them between systems based on cost, performance, etc.
While benefits have been described, it is important to underline the cases where the use of ALPaCA is disadvantageous. The sweet spot for ALPaCA is any producing/consuming application intended to be applied to many streams and across multiple streaming platforms - so really core data platform capabilities. It is not a good fit in cases where only one platform is targeted, and where excellent mature integrations already exist.
Comparable technologies
- JDBC: Unified connectivity and interoperability with disparate RDBMSes
- Data Access Layer pattern
Love this idea.
Glad you weighed in on this @teabot.
Something along these lines is where we'd like to go. Kinda like a `stream4j' (except not just java ;-) ).
As you know we are very early on this project and don't even have the clients OSS'ed yet. This is all coming soon. Hopefully we can collaborate to bring ideas like this to fruition ... faster.
Will be interesting how we navigate the tension between "common denominator" and leveraging key features of leading streaming platforms. That being said, its happened before: e.g. ODBC/JDBC.
Why not an SBC? stream-basic-connector ? ;-)
As an aconym, possibly okay, but aloud, conflicts with Akka's Alpakka
Noted, new name required.
lol was that were you going for @teabot ?? Or total coincidence?
At a high level, the pitch sounds similar to Apache Camel's idea of components denoted with URI's.
https://github.com/apache/camel/blob/master/components/readme.adoc
lol was that were you going for @teabot ?? Or total coincidence?
Total coincidence! I think I possess a nerdy desire to find contrived acronyms.
I'll try and me more concrete. Note that I've selected particular technologies for this example, but other open and vendor agnostic technologies could be used and may be preferable.
Example
One might have (contrived) endpoints like so:
https://stream.registry.host/namespace/<ns>/stream/<name>/produce
https://stream.registry.host/namespace/<ns>/stream/<name>/consume
Already you can perhaps see some similarities with Confluent's REST service. Hitting these opens a WebSocket by which callers can send or receive events onto the named 'stream'. The Stream Registry would perform the resolution and lookup of the target, be it a topic in a particular Kafka cluster, a Kinesis stream, or something else supported by SR. The caller can then produce/consume events in some known format (Json for the sake of argument). I expect that we'll also wish to provide a means to identify the user and apply RBAC in these endpoints.
Producer
I'll start with the producer side as it is simpler, however, I believe that a consumer endpoint delivers the most value (more on this in the summary). When producing, these events are validated against the schema (perhaps an optional behaviour) and then serialised into the format described in the SR for the target stream, and then forwarded to the target topic/stream in whichever system it resides. Upon successful receipt of an event by the target system, an acknowledgment is returned to the caller.
Consumer
Consuming is slightly more complex as we need to deal with event consumption by groups of consumers, offsets, checkpointing, rebalances. However, the fundamental principle is the same. Consumers identify themselves while opening a WebSocket to the stream, the SR determines the target topic/stream from which to consume, is allocated some partitions/shards, optionally requests offsets, and then finally begins to receive events from the target. Here too the endpoint is taking the raw event and converting this into a Json payload which is what the caller will see. Clearly we must design some protocol to orchestrate consumption over the WebSocket - we should aim for something vendor agnostic but with sufficient fidelity to allow integration with major streaming platforms.
Implementation
Clearly something will need to know how to interact with each supported streaming platform, and how to [de]serialise various formats from/to Json. These are the responsibilities of the endpoints delivering this abstract consumer/producer API. It is effectively adapting each vendor specific consumer/producer API into a common and abstract API, building on the capabilities and metadata to achieve this.
Alternative implementations
So far I've envisaged these endpoints are operating as Stream Registry services, running on SR infrastructure. Other options considered:
- One alternative approach would be to implement an abstract client library. In this event, applications would integrate with an abstract client which would take care SerDe and connectivity concerns in-process. However, I believe this realises far fewer advantages: it risks pulling in many unnecessary transitive dependencies into the applications domain and creates a maintenance coupling with each and every application that uses the library.
- I expect the service(s) providing the endpoints could also be deployed as a sidecar to the applications that use them. This could localise the costs incurred by the endpoints (both resource and financial) to the application that they serve.
Both of these options would also potentially increase networking and security complexity, as each application may potentially need to connect to every streaming platform managed by the SR rather than a single set of SR endpoints.
Summary
These APIs will allow seamless vendor and SerDe agnostic consumption and production of events into a range of target streaming platforms. The specific target platform and data format is intentionally opaque to the caller - they don't need to think about it. Use of mature and massively adopted protocols and encodings enables simple light-weight client integrations with few dependencies - you could imagine consuming messages by hitting the endpoint with a curl command for example.
My belief is that this is primarily important for platform-wide consumer applications and services such as anomaly detection. In these cases we might wish to connect to a diverse range of streams, with various encodings, residing in heterogeneous sets of streaming platforms provided by different vendors. The proposed API would allow these applications to perform this integration once, freeing them of direct couplings to multiple specific clients, and perhaps even multiple versions of those clients.
The case for a standardised and open producer API is perhaps not as strong. Typically we do not have applications that need to send the same events to multiple different streams. However, the producer side does enable interesting stream-to-stream and source-to-stream plumbing applications:
- stream-to-stream - replicate a stream from A to B without having to understand the technicalities of what A and B actually are (e.g. Kinesis to Kafka).
- source-to-stream - common components that might need to push to a variety of different stream platforms in different use-cases. For example a CDC capture component for a RDBMS - we might wish for events for one table to go onto a Kafka topic, and a Kinesis stream for another. The producer abstraction allows us to build once, for one API, yet target both platforms.
Ok. If we actually did this, would it make sense to make it a separate module that integrates with stream registry? E.g. a platform agnostic producer (and consumer) library that handles certain use cases?
If yes, wouldn't that be separate from stream-registry? I know we are planning on open sourcing a stream-registry-clients repo soon. Would a generic-stream-registry-client as you describe here make sense ? (looking for concrete next action on this one)
Would ❤️ to see a top level repo that uses this signature:
https://github.com/HotelsDotCom/data-highway/blob/86704be5d268b8e898959bb1c8fe9cff6ab84fc0/client/onramp/src/main/java/com/hotels/road/client/AsyncRoadClient.java#L28-L30