eventmesh [ISSUE #4824] Add HTTP Sink Connector

Fixes #4824

Motivation

[Feature] Add HTTP Sink Connector #4824

Modifications

Add the relevant code for the HTTP Sink Connector.
Write HttpSinkConnectorTest to test the functionality of the HTTP Sink Connector.
Upgrade and add dependencies to ensure the code runs successfully.

Documentation

Does this pull request introduce a new feature? (yes)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
If a feature is not applicable for documentation, explain why?
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

Apr 14 '24 10:04 cnzakii

Please consider supporting Webhook and HTTPS/SSL, and provide corresponding options for users in yml file.

OK, I will continue to improve the relevant functions. Does a new issue need to be created for enhancement or is it done in this PR ?

Apr 14 '24 10:04 cnzakii

Does a new issue need to be created for enhancement or is it done in this PR ?

You may directly add more commits in this PR.

Apr 14 '24 10:04 Pil0tXia

@Pil0tXia

I understand that a webhook is a callback function in HTTP (where a server sends a request to a client). Why does an Http Sink Connector need this capability? At the moment, I'm also struggling to think of a universal function that can handle different callback data.

Apr 14 '24 12:04 cnzakii

In order to display Webhook callback response body, you may need to store them in a memory queue, and leave a comment indicating its future use.

@Pil0tXia I have another question, the data passed must be different for different usage scenarios, so it seems difficult to specify a generic function that handles the callback data efficiently. Or do we just need to store the callback data for now and leave the rest of the work to other parts?

Apr 14 '24 14:04 cnzakii

@cnzakii

the data passed must be different for different usage scenarios

Are you referring to the ConnectRecord's data field being of type Object? You can serialize the data as String. vertx has type inference to further determine if the text is formatted as json or plain text if you need to.

so it seems difficult to specify a generic function that handles the callback data efficiently

May you please describe in more detail the problems encountered in this part? If you're referring to the "Webhook has a lot of protocols" issue, you may check out the org.apache.eventmesh.webhook.receive.protocol package, considering supporting some default protocols, and providing a template for users to implement their own protocols.

However, does the HTTP Sink Connector, as an HTTP Client, need to implement a special protocol? That should be the job of an HTTP Server (HTTP Source Connector).

Apr 14 '24 14:04 Pil0tXia

The way I envision or understand for a client with webhook functionality is as follows:

The web client sends a request containing a ConnectRecord and a callback URL to the server, then listens on this callback URL. When changes occur on the server, data is passed to the client through this callback URL.

Based on this concept, the data structure passed from the server to the client via the callback URL is unknown, thus it's not feasible to establish a uniform data processing function.

I'm unsure whether my concept aligns with the intended goal.😶‍🌫️

Apr 14 '24 15:04 cnzakii

@cnzakii

Based on this concept, the data structure passed from the server to the client via the callback URL is unknown, thus it's not feasible to establish a uniform data processing function.

May I ask that why do you want to process data HTTP Sink Connector received? Basically we just need to store and display it to users and it is the user who should be the one to understand the response body.

Apr 14 '24 16:04 Pil0tXia

May I ask that why do you want to process data HTTP Sink Connector received? Basically we just need to store and display it to users and it is the user who should be the one to understand the response body.

OK, I get it, do both "storage" and "display" need to be done? Or should we store the data in the memory queue for now, and the "display" function will be implemented in the future or in other parts?

This is the webhook implementation I envisioned screenshot-20240415-113809

Apr 15 '24 01:04 cnzakii

@cnzakii

do both "storage" and "display" need to be done?

You can leave an HTTP endpoint to get all Webhook delivery response results (which is a memory object as I mentioned above).

This is the webhook implementation I envisioned

Regarding NoResponseQueue and ResponseQueue, I don't think it's necessary to read and write to NoResponseQueue frequently. You can put responses of a 200 status code received into a queue called successfulDeliveries and others into failedDeliveries, and retry failed deliveries later.

Apr 15 '24 13:04 Pil0tXia

Regarding NoResponseQueue and ResponseQueue, I don't think it's necessary to read and write to NoResponseQueue frequently. You can put responses of a 200 status code received into a queue called successfulDeliveries and others into failedDeliveries, and retry failed deliveries later.

@Pil0tXia

Regarding NoResponseQueue and ResponseQueue, my original intention was for NoResponseQueue to record requests that have not yet received callback responses, while ResponseQueue is intended to record callback responses. Currently, NoResponseQueue is only used for recording purposes and can likely be removed.

As for retry mechanisms, I believe that requests encountering errors at the network level can be retried, but retrying requests that have already received error status codes doesn't make much sense because servers typically respond identically to the same request.

Apr 15 '24 14:04 cnzakii

@cnzakii

That's a good idea, should we retry any failed request?

I would recommend this as a useful feature, but it is not a necessary task.

Re-initiating a request that failed at the network level may work, but re-initiating a request that returned an error status code will usually not change the result

You've found a good point. In that case, you can put network error requests into the third queue and only retry network errors. This also applies to ordinary HTTP requests.

The number of retry attempts can be set as a YAML parameter, and when its value is set to 0, it means no retries.

Additionally, we can introduce another parameter that allows the user to decide whether to only retry network errors or retry all requests with non-200 status codes (such as 429, 503 indicating that server is busy, while 400, 422, 500 may get the same results). By default, only network errors are retried.

Apr 15 '24 14:04 Pil0tXia

@Pil0tXia

In order to implement the webhook function, we must add the callback url to the request so that the server knows where it should send the callback data, then we need to modify the original request. Now I have several options:

Add request header: Callback-Url: http://example.com/callback
Modification request body, the following are several modification methods:

What do you think about this ?

Apr 16 '24 03:04 cnzakii

@cnzakii

callback url belongs to webhook function, and it isn't part of event data. So the former one is definitely better.

Apr 16 '24 06:04 Pil0tXia

callback url belongs to webhook function, and it isn't part of event data. So the former one is definitely better.

@Pil0tXia

Can you clarify which option it is? Because I proposed three options.

In terms of code implementation, it is simpler to add the callback url to the request header. You only need to use if to determine whether you need to add this request header, without modifying the original function or creating a new function to adapt to different request bodies.

Apr 16 '24 06:04 cnzakii

@cnzakii

You can add callback url both in header and body to achieve best compatibility.

If you believe that only header or body is needed to add, then one of them is enough.

Apr 16 '24 06:04 Pil0tXia

You can add callback url both in header and body to achieve best compatibility.

If you believe that only header or body is needed to add, then one of them is enough.

The header is carried by every HTTP request, so I think passing the callback url through the header is enough. The redundant design of declaring the callback url again in the body does not seem to improve compatibility, unless faced with A server that completely ignores HTTP headers.

Apr 16 '24 06:04 cnzakii

@cnzakii May I ask that what do you want to put in callbackUrl? It is the same concept as the payload server url and should be configured in YAML.

Apr 16 '24 07:04 Pil0tXia

May I ask that what do you want to put in callbackUrl? It is the same concept as the payload server url and should be configured in YAML.

@Pil0tXia

I think these are two different concepts because they have fundamental differences. The request address points to the server, while the callback address points to the client. Typically, the client and server are in a many-to-one relationship, where the client already knows where the server is before making the request. However, the server is waiting for requests from unknown clients. Therefore, when the server needs to pass callback data to an unknown client, it must know where the client is. This is achieved by notifying the server of the client's location via the callback address.

Adding the client's callback address through the server's configuration file can also achieve this purpose. However, we cannot add the callback addresses of all possible clients to the configuration file.

Apr 16 '24 07:04 cnzakii

I think these are two different concepts because they have fundamental differences.

"payload server url" means "the url of the server that is ready to receive the payload data from the sink connector", which is the same concept as callbackUrl. Relevant practice can be found at https://docs.github.com/en/webhooks/using-webhooks/creating-webhooks.

Adding the client's callback address through the server's configuration file can also achieve this purpose. However, we cannot add the callback addresses of all possible clients to the configuration file.

It depends on whether you want to add the callbackUrl when the sink connector is running, or configure the callbackUrl when the sink connector starts. The former way may be like this: https://help.smartling.com/hc/en-us/articles/1260805504109-Webhooks-and-Callbacks.

For the sink connector, EventMesh will use the latter approach (which is the GitHub way, configure callbackUrl in settings). This is because the sink connector can be started at a low cost with multiple nodes, each responsible for delivering messages of one topic. If users want to edit configurations (including callbackUrl), another connector can be deployed in 10s.

For the same sink connector, multiple callbackUrl can be configured as a list in YAML. Events sent to each callbackUrl are treated as one delivery. i.e., if the user configures 10 callbackUrls, for one event received by the sink connector, 10 deliveries will occur.

As a result, the callbackUrl header should not be sent to the event's delivery target, because the server (target) always responds to the client's request address (sink connector's address) in webhook.

Apr 16 '24 08:04 Pil0tXia

As a result, the callbackUrl header should not be sent to the event's delivery target, because the server (target) always responds to the client's request address (sink connector's address) in webhook.

Based on this usage scenario, we indeed don't need to pass the callback URL information. I will proceed to modify the relevant code accordingly. Thank you for your explanation.

Apr 16 '24 08:04 cnzakii

@Pil0tXia

You can do a code review first, and later I'll write unit tests to test it and commit it.

Apr 16 '24 14:04 cnzakii

This is an unexpected error. I encountered the same issue when running CI operations in my own repository. The error occurs consistently in the Linux environment but succeeds in macOS. Moreover, the failure happens specifically in the HttpSourceConnectorTest > testPoll() function, despite no modifications being made to the HttpSourceConnector code.

Apr 17 '24 03:04 cnzakii

@Pil0tXia

I am very confused, what is the wenhook function that you or the eventmesh community need HttpSinkConnector to have? According to your answer, I re-understood that the webhook you expected seems to be no different from the ordinary webClient. It just uses the put operation as the trigger condition of the event. When the event is triggered, the data is sent to the webhook server. This seems to be the function of sending requests to multiple servers. It seems that there is no need for a queue to store data, because HttpSinkConnector always acts as the data sender.

If convenient, you can use a flow chart to assist in presentation to avoid falling into the quagmire caused by inconsistent ideas.

Apr 17 '24 07:04 cnzakii

Webhook Sink Connector_2024-04-07-2134

Webhook is a special implementation of HTTP. As I said in https://github.com/apache/eventmesh/pull/4837#pullrequestreview-1999660259, if you think that it can be combined with HTTP in one implementation class, then you can use a normal HTTP client to do Webhook jobs.

However, Webhook connections often have a longer timeout (e.g. 10s for GitHub Webhook), some even support keep-alive connections to reduce performance overhead. So it might be hard to combine Webhook with normal HTTP requests.

I thought that Webhook already has a lot of mature designs, and excuse me for any confusion.

Apr 17 '24 08:04 Pil0tXia

@Pil0tXia

I believe I've figured it out, I've found a specification for a standard webhook (https://www.standardwebhooks.com/ , https://github.com/standard-webhooks/standard-webhooks/blob/main/spec/standard-webhooks.md) .

I'm not sure if it's really universal, but I think it will help me with my current job. It mentions payload, signature, failure retry, request timeout, etc. Are these all currently required by HttpSinkConnector ?

If we only need failure retry and request timeout, we can indeed share a webClient with ordinary HTTP. We only need to set different timeouts and failure retry mechanisms for different situations, and add a queue to store data (I'm not sure if responses from plain HTTP need to be stored as well), if a more complete webhook is needed, such as different payloads, security signatures, etc., then we should indeed build a separate WebhookSinkConnector.

I thought that Webhook already has a lot of mature designs

Are you referring to the implementation of open source webhook? Or something else related to webhook?

Apr 17 '24 09:04 cnzakii

@cnzakii

Glad I could help.

It mentions payload, signature, failure retry, request timeout, etc. Are these all currently required by HttpSinkConnector ?

[x] payload: Of course. Normal HTTP request needs it too.
[ ] signature: If you are interested in it, you can refer to org.apache.eventmesh.webhook.receive.protocol.GithubProtocol, which is a client-side signature verification implement. The Webhook server (such as GitHub and the HTTP Sink Connector) would define a specification, and the client (such as HTTP Source Connector and the existing eventmesh-webhook module) has to fit it. I personally think that SSL is secure enough.
[x] failure retry: Great feature. Usually a HTTP service should have retry ability.
[x] request timeout: Of course. Normal HTTP request needs it too.

Are you referring to the implementation of open source webhook? Or something else related to webhook?

I ever used many Webhook services, and their user experience are similar and great. I also built a uptime status monitor https://status.pil0txia.com/ with Webhook notifications.

Apr 17 '24 09:04 Pil0tXia

I also want to discuss with you about the structure of the payload, whether the payload structure of webhook requests is consistent with that of ordinary requests. I checked some webhook specifications, and they all add some additional information. The request body of a normal request is as follows:

Should webhook requests be the same as above, or should some additional information be added as below?

What information should be added to the payload if needed.

Apr 17 '24 10:04 cnzakii

Should webhook requests be the same as above, or should some additional information be added as below?

We can use the version below for both normal HTTP requests and Webhook payloads. More information is always better and it keeps data structure unified.

Apr 17 '24 10:04 Pil0tXia

For the retry mechanism, I'm going to use resilience4j-retry to implement it, do you have any better suggestions?

I see that there is also org.apache.eventmesh.retry, but this one seems to be for the Server section

Apr 17 '24 13:04 cnzakii

I'm going to use resilience4j-retry to implement it, do you have any better suggestions?

I think this artifact is easy to use and suitable for a connector.

org.apache.eventmesh.retry is not yet fully implemented. In EventMesh, ideally retries should be handled by the Runtime, not the Connector.You can raise an issue if you're interested and implement this in the next PR.

Apr 17 '24 14:04 Pil0tXia

eventmesh eventmesh copied to clipboard

[ISSUE #4824] Add HTTP Sink Connector

Motivation

Modifications

Documentation

eventmesh
eventmesh copied to clipboard