seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Improve][HTTP Connector] Add specified field function for all HTTP connector

Open TaoZex opened this issue 2 years ago • 3 comments

Search before asking

  • [X] I had searched in the feature and found no similar feature requirement.

Description

So far, some http requests return data that cannot be parsed, such as array<Object> nested type data.

642a62a1f1a1b9eadc9b0b3952e187c

We need to implement the function to specify a field, such as users in the figure above, so that we can configure the schema for users.

Usage Scenario

No response

Related issues

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

TaoZex avatar Nov 24 '22 17:11 TaoZex

Please describe in detail your requirements.

liugddx avatar Nov 25 '22 01:11 liugddx

How to specify the field? What's the parameter name of it? Can you describe the scope of influence of this design? What modifications do we need to make?

EricJoy2048 avatar Nov 25 '22 03:11 EricJoy2048

I have some questions:

  1. As you shown, add a new parameter to tell connector get a part of upstream data. But this requirement only works on json format, for text format, how to deal with this feature?
  2. If user want to get more than one parts of upstream data, this feature how to work?
  3. If the part that user want to get is only a string, a integer, a double etc... not a list, or the every item in list also contains objects array, schema how to work on data?

TyrantLucifer avatar Nov 25 '22 14:11 TyrantLucifer

I have some questions:

  1. As you shown, add a new parameter to tell connector get a part of upstream data. But this requirement only works on json format, for text format, how to deal with this feature?
  2. If user want to get more than one parts of upstream data, this feature how to work?
  3. If the part that user want to get is only a string, a integer, a double etc... not a list, or the every item in list also contains objects array, schema how to work on data?

I understand the problem you pointed out, but there is no good plan at present. I suggest that we discuss it together at the next weekly meeting.

TaoZex avatar Nov 28 '22 10:11 TaoZex

I have some questions:

  1. As you shown, add a new parameter to tell connector get a part of upstream data. But this requirement only works on json format, for text format, how to deal with this feature?
  2. If user want to get more than one parts of upstream data, this feature how to work?
  3. If the part that user want to get is only a string, a integer, a double etc... not a list, or the every item in list also contains objects array, schema how to work on data?
  1. I think the most of SaaS API will return data use json format. It is rare to use the text format. If it appears, we will deal with it separately.
  2. In fact, each content field to be read is equivalent to a table. I this case the user can be see a read table, and we can config the schema about the it. Now SeaTunnel only support read one table data in one connector, so we only support config one content field once. If we support reading data from multiple tables in one connector in the future, we can also support defining multiple content field in one connector.
  3. User can use schema to define the schema of content field. If use didn't config schema, We can think that users only want to read basic data such as string/integer/long and the column name same as content field.

I suggest we only support basic array type in the case. This is a example.

{
"xxx":"xxx",
"users":[
    {
       "id":1,
       "name":"n1",
       "int_list": [1,2,3],
       "json_list":[{"n1":"v1", "n2":"v2"}]
    }
]
}
schema: {
    id: int,
    name: string,
    int_list: array<int>,
    json_lsit: array<string>
}

EricJoy2048 avatar Nov 28 '22 11:11 EricJoy2048

Hi, I think we should focus on the data we need. json-path can extract the data we need and block out the unnecessary data. This can reduce our workload of configuring the schema #3510 please see this pr @TaoZex

liugddx avatar Nov 28 '22 12:11 liugddx

Hi, I think we should focus on the data we need. json-path can extract the data we need and block out the unnecessary data. This can reduce our workload of configuring the schema #3510 please see this pr @TaoZex

Thanks.

TaoZex avatar Nov 28 '22 13:11 TaoZex

Hi, I think we should focus on the data we need. json-path can extract the data we need and block out the unnecessary data. This can reduce our workload of configuring the schema #3510 please see this pr @TaoZex

json-path is a good way to read irregular json node from a json. However, json-path requires users to understand regular expressions and make a lot of configurations. If there are many json nodes to read, this method is not friendly. For those who only need to read the data of a certain json node and its child nodes, the content-field method will be more friendly and simple.

EricJoy2048 avatar Nov 29 '22 04:11 EricJoy2048

Hi, I think we should focus on the data we need. json-path can extract the data we need and block out the unnecessary data. This can reduce our workload of configuring the schema #3510 please see this pr @TaoZex

json-path is a good way to read irregular json node from a json. However, json-path requires users to understand regular expressions and make a lot of configurations. If there are many json nodes to read, this method is not friendly. For those who only need to read the data of a certain json node and its child nodes, the content-field method will be more friendly and simple.

1.Regarding the difficulty of using,we can use publicly available tools to help parse like https://jsonpath.com/ 2.Regarding getting some json nodes, I think this solution can also be done. We can configure the expression of the node. image

liugddx avatar Nov 29 '22 05:11 liugddx

Hi, I think we should focus on the data we need. json-path can extract the data we need and block out the unnecessary data. This can reduce our workload of configuring the schema #3510 please see this pr @TaoZex

json-path is a good way to read irregular json node from a json. However, json-path requires users to understand regular expressions and make a lot of configurations. If there are many json nodes to read, this method is not friendly. For those who only need to read the data of a certain json node and its child nodes, the content-field method will be more friendly and simple.

1.Regarding the difficulty of using,we can use publicly available tools to help parse like https://jsonpath.com/ 2.Regarding getting some json nodes, I think this solution can also be done. We can configure the expression of the node. image

Regarding the second point, it can be completed in the next step.

liugddx avatar Nov 29 '22 05:11 liugddx

Hi, I think we should focus on the data we need. json-path can extract the data we need and block out the unnecessary data. This can reduce our workload of configuring the schema #3510 please see this pr @TaoZex

json-path is a good way to read irregular json node from a json. However, json-path requires users to understand regular expressions and make a lot of configurations. If there are many json nodes to read, this method is not friendly. For those who only need to read the data of a certain json node and its child nodes, the content-field method will be more friendly and simple.

1.Regarding the difficulty of using,we can use publicly available tools to help parse like https://jsonpath.com/ 2.Regarding getting some json nodes, I think this solution can also be done. We can configure the expression of the node. image

Regarding the second point, it can be completed in the next step.

Thanks, jsonpath is a good tools, we can use it. Another question is how to let connector to know $.phoneNumbers is read as a string or a table have columns type, number?

EricJoy2048 avatar Nov 29 '22 06:11 EricJoy2048

This needs to be parsed through the schema, json-path does not need to care about the returned type, it is only responsible for simplifying the returned data.In addition, we can get the fields in phoneNumbers.for example:

source {
  Http {
    url = "http://mockserver:1080/jsonpath/mock"
    method = "GET"
    format = "json"
    json_field = {
      type = $.phoneNumbers[*].type
	  number = $.phoneNumbers[*].number
    }
    schema = {
      fields {
        type = string
		number = string
      }
    }
  }
}

liugddx avatar Nov 29 '22 06:11 liugddx

This needs to be parsed through the schema, json-path does not need to care about the returned type, it is only responsible for simplifying the returned data.In addition, we can get the fields in phoneNumbers.for example:

source {
  Http {
    url = "http://mockserver:1080/jsonpath/mock"
    method = "GET"
    format = "json"
    json_field = {
      type = $.phoneNumbers[*].type
	  number = $.phoneNumbers[*].number
    }
    schema = {
      fields {
        type = string
		number = string
      }
    }
  }
}

Configure the type by schema.

liugddx avatar Nov 29 '22 06:11 liugddx

$.phoneNumbers

How can we config when we want to read $.phoneNumbers as a table?

json_field= {
    $.phoneNumbers
}

EricJoy2048 avatar Nov 29 '22 06:11 EricJoy2048

Hi, @liugddx Can you sync the result of this discuss here? And sync to the email list is better.

EricJoy2048 avatar Nov 30 '22 09:11 EricJoy2048

Hi, @liugddx Can you sync the result of this discuss here? And sync to the email list is better.

After discussion,There are now two solutions

  1. Provide flexible json-path placeholder configuration json_field [Config].This parameter helps you configure the schema,so this parameter must be used with schema.If your data looks something like this:
{
  "store": {
    "book": [
      {
        "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95
      },
      {
        "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99
      }
    ],
    "bicycle": {
      "color": "red",
      "price": 19.95
    }
  },
  "expensive": 10
}

You can get the contents of 'book' by configuring the task as follows:

source {
  Http {
    url = "http://mockserver:1080/jsonpath/mock"
    method = "GET"
    format = "json"
    json_field = {
      category = "$.store.book[*].category"
      author = "$.store.book[*].author"
      title = "$.store.book[*].title"
      price = "$.store.book[*].price"
    }
    schema = {
      fields {
        category = string
        author = string
        title = string
        price = string
      }
    }
  }
}
  1. Provides the ability to get partial json.content_json [String] This parameter can get some json data.If you only need the data in the 'book' section, configure content_field = "$.store.book.* .If your return data looks something like this.
{
        "store": {
          "book": [
            {
              "category": "reference",
              "author": "Nigel Rees",
              "title": "Sayings of the Century",
              "price": 8.95
            },
            {
              "category": "fiction",
              "author": "Evelyn Waugh",
              "title": "Sword of Honour",
              "price": 12.99
            }
          ],
          "bicycle": {
            "color": "red",
            "price": 19.95
          }
        },
        "expensive": 10
      }

You can configure content_field = "$.store.book.*" and the result returned looks like this:

[
            {
              "category": "reference",
              "author": "Nigel Rees",
              "title": "Sayings of the Century",
              "price": 8.95
            },
            {
              "category": "fiction",
              "author": "Evelyn Waugh",
              "title": "Sword of Honour",
              "price": 12.99
            }
          ]

Then you can get the desired result with a simpler schema,like

Http {
  url = "http://mockserver:1080/contentjson/mock"
  method = "GET"
  format = "json"
  content_field = "$.store.book.*"
  schema = {
    fields {
      category = string
      author = string
      title = string
      price = string
    }
  }
}

liugddx avatar Nov 30 '22 09:11 liugddx

Closed by https://github.com/apache/incubator-seatunnel/issues/3500

TaoZex avatar Dec 06 '22 06:12 TaoZex