[Bug] [Connector-Kudu] Fix Doris auto table from Kudu STRING being created as CHAR(16)
https://github.com/apache/seatunnel/issues/10174
Purpose of this pull request
This pull request fixes an incorrect type mapping in the Kudu → Doris pipeline when Doris tables are auto-created from Kudu catalogs.
Previously, connector-kudu used Kudu’s internal typeSize for all columns as the logical columnLength. For Type.STRING, typeSize is typically 16. When this value was propagated to Doris, Doris sink treated these columns as short fixed-length strings and created CHAR(16) columns instead of STRING (unbounded) columns. This breaks real-world Kudu tables where STRING columns often contain values much longer than 16 characters.
This PR:
- Updates
KuduCatalogso that KuduSTRINGcolumns no longer usetypeSizeascolumnLength. Only non-STRINGtypes keep usingtypeSize. - Adds a unit test for
KuduCatalogto ensureSTRINGcolumns are reported withcolumnLength = null. - Adds an e2e assertion for Doris catalog to ensure that an upstream “unbounded string” column is created as Doris
STRING(notCHAR(16)), andsourceTypeisstring.
Does this PR introduce any user-facing change?
Yes.
Previous behavior
- When using Kudu as source and Doris as sink with schema auto-creation (
schema_save_mode = "RECREATE_SCHEMA"orCREATE_SCHEMA_WHEN_NOT_EXIST), KuduSTRINGcolumns were created in Doris asCHAR(16). - This could lead to:
- Truncation or write failures for values longer than 16 characters.
- Mismatched schema between Kudu and Doris: developers expect
STRINGor largeVARCHAR, but get fixed-lengthCHAR(16).
New behavior
- Kudu
STRINGcolumns are now exposed fromKuduCatalogwith no logical length (columnLength = null). - Doris sink maps these columns to Doris
STRINGtype (internally using Doris’MAX_STRING_LENGTH), notCHAR(16). - Existing Doris tables are not modified by this PR; the change only affects how new tables are auto-created from Kudu catalogs.
How was this patch tested?
-
Unit test
-
Added
KuduCatalogTestinconnector-kudu:- Mocks
KuduClientandKuduTablewith:- One
INT32column (id) - One
STRINGcolumn (val_string)
- One
- Calls
KuduCatalog.getTableand verifies:- Non-
STRINGcolumnidkeeps a non-nullcolumnLength. STRINGcolumnval_stringhascolumnLength == null.
- Non-
- Mocks
-
-
Doris e2e test
- Extended
DorisCatalogITinconnector-doris-e2ewithtestCreateTableWithUnboundedStringColumn:- Builds an upstream
CatalogTablewith:k1asINTprimary key.k2asSTRINGwithcolumnLength = null(simulating KuduCatalog’s behavior).
- Uses Doris sink
schema_save_modeto auto-createtest.unbounded_string. - Reads the table via
DorisCatalogand asserts that:- Column name is
k2. - Logical type is
BasicType.STRING_TYPE. sourceTypeisstring(case-insensitive).- If
columnLengthis present, it is greater than 16 (preventing regression toCHAR(16)).
- Column name is
- Builds an upstream
- Extended
-
Existing e2e
- Ran existing connector Kudu/Doris e2e suites locally to ensure no regressions in other scenarios.
Check list
- [ ] If any new Jar binary package adding in your PR, please add License Notice according New License Guide
- [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
- [ ] If necessary, please update
incompatible-changes.mdto describe the incompatibility caused by this PR. - [x] This PR only touches existing connectors (Kudu, Doris) and does not add new connector jars:
- No changes needed for
plugin-mapping.properties. - No changes needed for
seatunnel-dist/pom.xml. - No new CI label required in
.github/workflows/labeler/label-scope-conf.yml. - E2E tests have been added/extended under
seatunnel-e2e/seatunnel-connector-v2-e2e/.
- No changes needed for
@yzeng1618 Will other fields in Kudu also have this issue, such as: BINARY
@yzeng1618 Will other fields in Kudu also have this issue, such as: BINARY
Kudu BINARY is handled differently:
-
KuduTypeMapper maps Type.BINARY to SeaTunnel BYTES (PrimitiveByteArrayType), not to STRING.
-
In AbstractDorisTypeConverter.sampleReconvert, BYTES is always converted directly to Doris STRING and does not go through the length‑based logic that turns things into CHAR(16) / VARCHAR(16).
So BINARY columns will be created as STRING in Doris, not CHAR(16), and they don’t suffer from the same “fixed length 16” issue.