jina icon indicating copy to clipboard operation
jina copied to clipboard

Introduce a Background or Listener Executor to support Extract, Transform and Load (ETL) pattern

Open girishc13 opened this issue 2 years ago • 5 comments

Describe the feature

The ETL pattern is popular currently due to the availability of various platforms that generate data, transform the data and then load the data into the required database. The ETL pattern is also widely used in large organizations with multiple teams that specialize in each aspect leveraging messaging platforms such as Kafka, SQS, RabbitMQ, JMS etc. The core of the ETL pattern is to decouple multiple tasks/problems and empower small teams to focus on a subset of problems. The engineering team can focus on the extract and load tasks while one or more AI team/s can focus on enriching the data. This supports the cross-modal use case.

Your proposal

The JINA framework currently supports the push mechanism which accepts Docarray and pushes the data into subsequent Executors and eventually loading the data into the databases. The pull client for extracting data currently needs to be created outside the Flow. The Background or Listener Executor can support the pull mechanism from within the Flow without having to maintain it outside the JINA ecosystem. The pull or listening mechanism could be a cron schedule, message queue listener or any other mechanism.


Environment

Screenshots

girishc13 avatar Sep 15 '22 12:09 girishc13

I recently tried to create a kakfa to kafka ETL and ran into some limitations or incompatibilities.

girishc13 avatar Sep 15 '22 13:09 girishc13

I have the feeling that this does not belong to Jina Core.

My explanation: Jina Core focuses on the serving part where we have the clear contract (DocArray IN, DocArray OUT) and in the core we did not implement any event-based things or any other pattern.

However, we believe these things can be provided by the user in the driver program (where the client lives) or perhaps in the Gateway if we expose the custom gateway concept.

Where I think we could have it fit is to have helper static functions on docarray to liston to topics and these kind of stuff so that users can use in their driver program or gateway more easily.

JoanFM avatar Sep 15 '22 14:09 JoanFM

I agree with @JoanFM to a degree, but on the other hand floating Executors already allow for a paradigm that is not too far from what is discussed here.

If we were to enrich floating Executors with the ability to be triggered by external events, then I think we're badically already there. So maybe this would not be such a big step?

But yes, we should keep our focus on the serving - it is probably not a good idea to re-invent all these other event bases MLOps frameworks.

JohannesMessner avatar Sep 15 '22 15:09 JohannesMessner

I understand the arguments and after going through the documentation I believe that most of the features required for the Background or Listener Executor exists in the current Flow. Some missing features:

  • Without any @requests binding, the requests from the gateway will still pass through the executor. We can provide ability to properly create standalone executor that doesn't participate directly on each request to the Gateway.
  • Extend the Executor k8s conversion to support CronJob type workload.
  • With specific k8s labels, the autoscaler can be customised to support scale up/down deployments using external metrics/events.

girishc13 avatar Sep 15 '22 15:09 girishc13

What would this Executor do? I feel that we are abusing the Executor concept if we get away from this basic requests DocArray IN and OUT interface?

What I think of the motivation of this is to have the gateway or your driver program send DocArrays responding to some event-based framework right? I do not see how an Executor would help here.

I think that with Custom Gateway and providing helper functions in docarray is the best way to handle this.

As per the K8s things, I agree once we have CustomGateway many new doors may open for us to provide such features.

JoanFM avatar Sep 15 '22 15:09 JoanFM

@jina-ai/product This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 14 days

jina-bot avatar Dec 15 '22 00:12 jina-bot

I think this can be done in CustomGateway, and we can see if we can superpower docarray after v2

JoanFM avatar Dec 15 '22 17:12 JoanFM