pinot icon indicating copy to clipboard operation
pinot copied to clipboard

What's the root cause to wiki's recommendation: "Pause consumption when add a new column"?

Open wirybeaver opened this issue 2 years ago • 4 comments

In the ingestion transformation wiki page, it mentioned that

If a new column is added to table or schema configuration during ingestion, incorrect data may appear in the consuming segment.

To ensure accurate values are reloaded, do the following: Pause consumption (and wait for pause status success)

Is it only limited to table which has ingestion transform config? If not, it seems a breaking change since we need to pause consumption for every schema update.

Another question: In which scenario the data will incorrectly appears?

wirybeaver avatar Mar 14 '24 16:03 wirybeaver

I think the doc might be outdated. When the table config or schema is updated (e.g. new column/index added), a reload is required for segments to pick up the new config and generate the new column/index as needed. For consuming segment, reload (with includingConsuming flag set) will try force committing the segment and start a new consuming segment so that the config change can be picked up. Do you want to help try this out and revise the documentation? cc @kelseiv

Jackie-Jiang avatar Mar 18 '24 06:03 Jackie-Jiang

Thanks Jackie! I will revise the doc when I have bandwidth to do the test.

wirybeaver avatar Mar 20 '24 04:03 wirybeaver

@Jackie-Jiang what's the best practice for the upsert tables. Is it required to restart servers for both full upsert and partial upsert tables?

for partial upsert table i think it's required since we need to re-initialize the upsert data manager so newly added column can be appeared in the upsertHandler.

deemoliu avatar Mar 29 '24 21:03 deemoliu

For upsert table, currently we have to restart the server in order to pick up the table config changes because it is associated with table data manager instead of segment data manager.

Jackie-Jiang avatar Apr 04 '24 23:04 Jackie-Jiang