jina
jina copied to clipboard
Introduce a Background or Listener Executor to support Extract, Transform and Load (ETL) pattern
Describe the feature
The ETL pattern is popular currently due to the availability of various platforms that generate data, transform the data and then load the data into the required database. The ETL pattern is also widely used in large organizations with multiple teams that specialize in each aspect leveraging messaging platforms such as Kafka, SQS, RabbitMQ, JMS etc. The core of the ETL pattern is to decouple multiple tasks/problems and empower small teams to focus on a subset of problems. The engineering team can focus on the extract and load tasks while one or more AI team/s can focus on enriching the data. This supports the cross-modal use case.
Your proposal
The JINA framework currently supports the push
mechanism which accepts Docarray and pushes the data into subsequent Executors and eventually loading the data into the databases. The pull
client for extracting data currently needs to be created outside the Flow. The Background or Listener Executor can support the pull
mechanism from within the Flow without having to maintain it outside the JINA ecosystem. The pull or listening
mechanism could be a cron schedule, message queue listener or any other mechanism.
Environment
Screenshots
I recently tried to create a kakfa to kafka ETL and ran into some limitations or incompatibilities.
I have the feeling that this does not belong to Jina Core
.
My explanation: Jina Core focuses on the serving
part where we have the clear contract (DocArray IN, DocArray OUT) and in the core we did not implement any event-based
things or any other pattern.
However, we believe these things can be provided by the user in the driver
program (where the client lives) or perhaps in the Gateway if we expose the custom gateway
concept.
Where I think we could have it fit is to have helper static
functions on docarray
to liston to topics
and these kind of stuff so that users can use in their driver
program or gateway
more easily.
I agree with @JoanFM to a degree, but on the other hand floating Executors already allow for a paradigm that is not too far from what is discussed here.
If we were to enrich floating Executors with the ability to be triggered by external events, then I think we're badically already there. So maybe this would not be such a big step?
But yes, we should keep our focus on the serving - it is probably not a good idea to re-invent all these other event bases MLOps frameworks.
I understand the arguments and after going through the documentation I believe that most of the features required for the Background or Listener Executor exists in the current Flow. Some missing features:
- Without any
@requests
binding, the requests from the gateway will still pass through the executor. We can provide ability to properly create standalone executor that doesn't participate directly on each request to the Gateway. - Extend the Executor k8s conversion to support CronJob type workload.
- With specific k8s labels, the autoscaler can be customised to support scale up/down deployments using external metrics/events.
What would this Executor
do? I feel that we are abusing the Executor
concept if we get away from this basic requests
DocArray IN and OUT interface?
What I think of the motivation of this is to have the gateway
or your driver
program send DocArrays
responding to some event-based
framework right? I do not see how an Executor
would help here.
I think that with Custom Gateway
and providing helper functions in docarray
is the best way to handle this.
As per the K8s things, I agree once we have CustomGateway
many new doors may open for us to provide such features.
@jina-ai/product This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 14 days
I think this can be done in CustomGateway, and we can see if we can superpower docarray
after v2