seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Bug] [Connector-Kudu] Fix Doris auto table from Kudu STRING being created as CHAR(16)

Open yzeng1618 opened this issue 2 weeks ago • 1 comments

https://github.com/apache/seatunnel/issues/10174

Purpose of this pull request

This pull request fixes an incorrect type mapping in the Kudu → Doris pipeline when Doris tables are auto-created from Kudu catalogs.

Previously, connector-kudu used Kudu’s internal typeSize for all columns as the logical columnLength. For Type.STRING, typeSize is typically 16. When this value was propagated to Doris, Doris sink treated these columns as short fixed-length strings and created CHAR(16) columns instead of STRING (unbounded) columns. This breaks real-world Kudu tables where STRING columns often contain values much longer than 16 characters.

This PR:

  1. Updates KuduCatalog so that Kudu STRING columns no longer use typeSize as columnLength. Only non-STRING types keep using typeSize.
  2. Adds a unit test for KuduCatalog to ensure STRING columns are reported with columnLength = null.
  3. Adds an e2e assertion for Doris catalog to ensure that an upstream “unbounded string” column is created as Doris STRING (not CHAR(16)), and sourceType is string.

Does this PR introduce any user-facing change?

Yes.

Previous behavior

  • When using Kudu as source and Doris as sink with schema auto-creation (schema_save_mode = "RECREATE_SCHEMA" or CREATE_SCHEMA_WHEN_NOT_EXIST), Kudu STRING columns were created in Doris as CHAR(16).
  • This could lead to:
    • Truncation or write failures for values longer than 16 characters.
    • Mismatched schema between Kudu and Doris: developers expect STRING or large VARCHAR, but get fixed-length CHAR(16).

New behavior

  • Kudu STRING columns are now exposed from KuduCatalog with no logical length (columnLength = null).
  • Doris sink maps these columns to Doris STRING type (internally using Doris’ MAX_STRING_LENGTH), not CHAR(16).
  • Existing Doris tables are not modified by this PR; the change only affects how new tables are auto-created from Kudu catalogs.

How was this patch tested?

  1. Unit test

    • Added KuduCatalogTest in connector-kudu:

      • Mocks KuduClient and KuduTable with:
        • One INT32 column (id)
        • One STRING column (val_string)
      • Calls KuduCatalog.getTable and verifies:
        • Non-STRING column id keeps a non-null columnLength.
        • STRING column val_string has columnLength == null.
  2. Doris e2e test

    • Extended DorisCatalogIT in connector-doris-e2e with testCreateTableWithUnboundedStringColumn:
      • Builds an upstream CatalogTable with:
        • k1 as INT primary key.
        • k2 as STRING with columnLength = null (simulating KuduCatalog’s behavior).
      • Uses Doris sink schema_save_mode to auto-create test.unbounded_string.
      • Reads the table via DorisCatalog and asserts that:
        • Column name is k2.
        • Logical type is BasicType.STRING_TYPE.
        • sourceType is string (case-insensitive).
        • If columnLength is present, it is greater than 16 (preventing regression to CHAR(16)).
  3. Existing e2e

    • Ran existing connector Kudu/Doris e2e suites locally to ensure no regressions in other scenarios.

Check list

  • [ ] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
  • [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
  • [ ] If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
  • [x] This PR only touches existing connectors (Kudu, Doris) and does not add new connector jars:
    • No changes needed for plugin-mapping.properties.
    • No changes needed for seatunnel-dist/pom.xml.
    • No new CI label required in .github/workflows/labeler/label-scope-conf.yml.
    • E2E tests have been added/extended under seatunnel-e2e/seatunnel-connector-v2-e2e/.

yzeng1618 avatar Dec 10 '25 06:12 yzeng1618

@yzeng1618 Will other fields in Kudu also have this issue, such as: BINARY

zhangshenghang avatar Dec 11 '25 14:12 zhangshenghang

@yzeng1618 Will other fields in Kudu also have this issue, such as: BINARY

Kudu BINARY is handled differently:

  • KuduTypeMapper maps Type.BINARY to SeaTunnel BYTES (PrimitiveByteArrayType), not to STRING.

  • In AbstractDorisTypeConverter.sampleReconvert, BYTES is always converted directly to Doris STRING and does not go through the length‑based logic that turns things into CHAR(16) / VARCHAR(16).

So BINARY columns will be created as STRING in Doris, not CHAR(16), and they don’t suffer from the same “fixed length 16” issue.

yzeng1618 avatar Dec 12 '25 02:12 yzeng1618