hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-9505] Hudi 1.1 blocker code change of index look up

Open Davis-Zhang-Onehouse opened this issue 5 months ago • 16 comments

Change Logs

  1. Make sure the index lookup API is wrapped around HoodieData or HoodiePairData, which is in preparation for scalable batch index lookup where we will switch to RDD based implementation. We only target RLI / secondary index (SI) related look up as of now.
  2. Change the hash partitioning strategy of SI file groups. Previously the storage format is <secondary col value>_<record col value>: null records of SI were hash partitioned by <secondary col value>_<record col value>

Now we change it to hash partitioned by <secondary col value>, the storage format is unchanged.

The motivation is that the index query pattern typically copes with cases where only "secondary col value" are available in the lookup query

select ...
from t1 inner join t2
where t1.secKey = t2.dataColumn

by changing the partitioning strategy, we can do SI file group pruning to ensure we only involve file groups of interest instead of scanning all SI records. The downsides of skewed record distributions of SI records across file groups are well understood.

  1. Because we change the partition strategy of SI, the change cannot go out as it is. The functional requirement is "when the code change is actively functioning, reader and writer path will all function as if SI file groups are partitioned based on <secondary col value>", and we DO NOT want to support old & new modes at the same time. The new strategy does not work with the existing SI data layout if a table already comes with SI and is hash partitioned by <secondary col value>_<record col value>.Without extra protection, SI failed to lookup the records from the file groups it needs due to the file group pruning enforced by the new read pattern, causing data correctness issue. As discussed with hudi PMCs, we agree table version upgrade is indispensable.

https://issues.apache.org/jira/browse/HUDI-9516

Impact

Faster SI lookup with acceptable downsides as explained above.

Risk level (write none, low medium or high below)

None

Documentation Update

Need to work on table version upgrade guide. Tracked by https://issues.apache.org/jira/browse/HUDI-9505

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

Davis-Zhang-Onehouse avatar Jun 09 '25 21:06 Davis-Zhang-Onehouse

I don't think it deserves a new table version based on the fact there is no file format or layout change, maybe just a new metadata config should be enough.

Hi Danny, thanks for chiming in here. I added more details in the PR descriptions. If your concerns persists, let's have a discussion with Ethan and vc regarding the decision making. Thank you!

Davis-Zhang-Onehouse avatar Jun 10 '25 16:06 Davis-Zhang-Onehouse

I don't think it deserves a new table version based on the fact there is no file format or layout change, maybe just a new metadata config should be enough.

There is layout change in the secondary index partition in MDT where the file group mapping from the record key to the file group index is changed. Unless the table version is changed and the mapping or sharding (we can decide a name to use) algorithm is stored in the table config, there is no way to differentiate how secondary index is sharded.

yihua avatar Jun 13 '25 20:06 yihua

In general, I prefer the mapping or sharding (we can decide a name to use) algorithm to be stored in the table config along with the table version upgrade so it's easier to change the mapping if needed and the reader can use that to determine how to look up the index. There are at least two occurrences, this secondary index and partitioned record index, in which having the new table config would help migration and avoid table version upgrade in the future.

yihua avatar Jun 13 '25 20:06 yihua

we DO NOT want to support old & new modes at the same time.

Mulling if we should change this constraint instead. This may be a good point to introduce a index layout version and have that inside the index definition metadata. Similar to log files. This will help readers/writers evolve index storage. for e.g compaction could seamlessly upgrade index partitions.

vinothchandar avatar Jun 16 '25 15:06 vinothchandar

I prefer the mapping or sharding (we can decide a name to use) algorithm to be stored in the table config

Prefer index definition. vs adding a new table config.

vinothchandar avatar Jun 16 '25 15:06 vinothchandar

I prefer the mapping or sharding (we can decide a name to use) algorithm to be stored in the table config

Prefer index definition. vs adding a new table config.

In general I prefer adding the info in the .hoodie folder as part of the table so reader can decide the sharding based on this. Index definition or metadata looks better for having this sharding information stored. Let's do that.

yihua avatar Jun 16 '25 17:06 yihua

@yihua asked for PR owner @Davis-Zhang-Onehouse to work on the following AI:

Index definition looks better. sth to store as part of the table metadata, instead of implicitly inferring it. We should store something in the .index.def.

Proposed change

Introducing a new attribute "partitionStrategy". Possible values:

  • HASH_ON_SECONDARY_KEY: exclusive to SI. It means we can determine SI file group by computing the hash value only using the secondary data column value.
  • HASH_ON_SECONDARY_KEY_RECORD_KEY: exclusive to SI. It means the old behavior.
  • For other indexes we will also come up with proper enum values for their existing strategies.
image

Also from the schema evolution perspective about this json. If we didn't find such field, depends on the index type, we infer the default strategy. For SI, in case we could not find the value, it means using HASH_ON_SECONDARY_KEY_RECORD_KEY.

Implementation details

Read path

Whenever we do lookup using indexes, we will read the index def file and add partitionStrategy value to the context. If it is HASH_ON_SECONDARY_KEY_RECORD_KEY, it will route to prefix lookup code. If it is HASH_ON_SECONDARY_KEY, it is route to the new code

Also for all the other indexes, we need to do the same as they share the same code path. It is an index look up interface change so it impacts code shared by all indexes.

Write path

If it is creating a new interface, we always write partitionStrategy attribute in the strategy. For SI it will be of value HASH_ON_SECONDARY_KEY_RECORD_KEY. For others, we will come up name mapping to their existing behavior.

If it is updating existing indexes due to change of data, follow the "Read path" logic.

Misc

revert the table version related code change.

@yihua please comment on the proposed plan when you get a chance. thanks

Davis-Zhang-Onehouse avatar Jun 16 '25 18:06 Davis-Zhang-Onehouse

@yihua asked for PR owner @Davis-Zhang-Onehouse to work on the following AI:

Index definition looks better. sth to store as part of the table metadata, instead of implicitly inferring it. We should store something in the .index.def.

Proposed change

Introducing a new attribute "partitionStrategy". Possible values:

  • HASH_ON_SECONDARY_KEY: exclusive to SI. It means we can determine SI file group by computing the hash value only using the secondary data column value.
  • HASH_ON_SECONDARY_KEY_RECORD_KEY: exclusive to SI. It means the old behavior.
  • For other indexes we will also come up with proper enum values for their existing strategies.
image Also from the schema evolution perspective about this json. If we didn't find such field, depends on the index type, we infer the default strategy. For SI, in case we could not find the value, it means using `HASH_ON_SECONDARY_KEY_RECORD_KEY`.

Implementation details

Read path

Whenever we do lookup using indexes, we will read the index def file and add partitionStrategy value to the context. If it is HASH_ON_SECONDARY_KEY_RECORD_KEY, it will route to prefix lookup code. If it is HASH_ON_SECONDARY_KEY, it is route to the new code

Also for all the other indexes, we need to do the same as they share the same code path. It is an index look up interface change so it impacts code shared by all indexes.

Write path

If it is creating a new interface, we always write partitionStrategy attribute in the strategy. For SI it will be of value HASH_ON_SECONDARY_KEY_RECORD_KEY. For others, we will come up name mapping to their existing behavior.

If it is updating existing indexes due to change of data, follow the "Read path" logic.

Misc

revert the table version related code change.

@yihua please comment on the proposed plan when you get a chance. thanks

Looks good to me overall.

yihua avatar Jun 16 '25 19:06 yihua

@yihua @Davis-Zhang-Onehouse

I prefer not to introduce a new concept like partitionStrategy, that may or may not be changing/defined per index.

Instead I propose

  • We introduce a layoutVersion field into the index definition. If the field is not present, then value version 1 is assumed.
  • We handled the differences between SI layouts 1 and 2, inside the reader/write code. Inside the code, the layout version is handled specifically for each index type.
  • We document different index storage layouts in the 1.0-tech-specs (needs a docs PR to asf-site branch)
  • We still need to bump up the table version (to handle a downgrade where SI layout 2, can't be read by releases < 1.1)

vinothchandar avatar Jun 16 '25 21:06 vinothchandar

@yihua @Davis-Zhang-Onehouse

I prefer not to introduce a new concept like partitionStrategy, that may or may not be changing/defined per index.

Instead I propose

  • We introduce a layoutVersion field into the index definition. If the field is not present, then value version 1 is assumed.
  • We handled the differences between SI layouts 1 and 2, inside the reader/write code. Inside the code, the layout version is handled specifically for each index type.
  • We document different index storage layouts in the 1.0-tech-specs (needs a docs PR to asf-site branch)
  • We still need to bump up the table version (to handle a downgrade where SI layout 2, can't be read by releases < 1.1)

Actually this proposed index layout version is more general. I prefer this too.

yihua avatar Jun 16 '25 21:06 yihua

@yihua @vinothchandar new items to factor in: backwards and forward compatibility. I spotted major issues and the PR is blocked on your feedback

Compatibility

Forward compatibility

If SI using version 2 (hash partition on data column value only) and hudi is of old binary, what happens is hudi does not has the concept of index version and will treat the new SI version as if it is the old one. As a result,

Read path

it should be fine since it will use prefix lookup which naturally compatible with the new partition strategy.

Write path

Write path is messed up as the old hudi binary will write to new index version with old partition strategy. What make things worse is the hudi index version is not updated as the old binary do not have such logic. So we end up with a corrupted version 2 hoodie index as the old hudi binary do not conform to the version 2 protocol of updating the index.

Backward compatibility

this should be fine as the new hudi binary will properly recognize the version (or the absence of the version) and adapt properly. Details are covered in the previous thread.

Fundamental limitation of the index version design

Old hudi binary only recognize and respect table version. Introducing index version means user must use a version that recognize and honor this.

In industry the standard procedure is

  • introducing a "compatibility patch" which recognize the version and proper back off it is some future version (old hudi binary will choose not to use the index even there is one)
  • User must be aware of all readers/writers that happens to a hudi table. If the hudi table is of SI version 2, user must make sure all hudi versions are at least >= the compatibility patch.
  • This place a burden on the user side, and failed to do so means SI is silently corrupted and causing correctness issue which is not acceptable. We need a place to guide user and this is way cumbersome than a table version upgrade.
  • If all readers writers are managed by some service provider, this might not be a issue - just introduce the compatibiilty patch and we are all good.

Davis-Zhang-Onehouse avatar Jun 17 '25 17:06 Davis-Zhang-Onehouse

An alternative approach - create SI V2 as a new index category

Instead of creating SI with multiple versions, we can do new version of SI as a completely different index type - we write to different metadata partition path, etc. There is no compatibility issue anymore.

Also with the approch, the old binary does not know the new SI version, so it should not do anything with it (not sure if we follow this compatibility best practice).

The new binary can do the following:

  • It only create SI v2
  • It is capable of reading and writing both new and old SI.

This avoids version management of any form. But the old binary must tolerate unrecognized MDT partitions which I'm not sure

Davis-Zhang-Onehouse avatar Jun 17 '25 17:06 Davis-Zhang-Onehouse

@Davis-Zhang-Onehouse at a high level, both the reader and writer paths should follow the new index layout version added in the index definition (i.e., the index layout version is 1 if not set, and the index layout version is 2, which should be explicitly set for the new secondary index layout in this PR). The table version needs to be upgraded and the secondary index layout should be changed as part of the table upgrade and downgrade process.

Here are more details on how it should work: Index definition and layout:

  • Table version 8: no index layout version; secondary index layout based on hard-coded logic
  • Table version 9: new index layout version introduced; secondary index layout version 2 (different from table version 8)

Upgrade and downgrade:

  • Upgrade from 8 -> 9: add the index layout version to index definitions; remove secondary index if exists
  • Downgrade from 9 -> 8: remove the index layout version; remove secondary index if exists

MDT reader and write compatibility:

  • The writer and reader take the index definition and get the index layout version. If not present (meaning table version <= 8), infer the table layout version as 1. Use the table layout version to determine how the record keys are mapped to file groups (i.e., secondary index has different mapping between layout 1 and 2).

yihua avatar Jun 17 '25 17:06 yihua

An alternative approach - create SI V2 as a new index category

Instead of creating SI with multiple versions, we can do new version of SI as a completely different index type - we write to different metadata partition path, etc. There is no compatibility issue anymore.

Also with the approch, the old binary does not know the new SI version, so it should not do anything with it (not sure if we follow this compatibility best practice).

The new binary can do the following:

  • It only create SI v2
  • It is capable of reading and writing both new and old SI.

This avoids version management of any form. But the old binary must tolerate unrecognized MDT partitions which I'm not sure

A new metadata partition path should be avoided for only layout difference. That's why the layout version plays a role.

yihua avatar Jun 17 '25 17:06 yihua

ok, so the solution still requires table version upgrade, sorry I should have read vc's reply.

Davis-Zhang-Onehouse avatar Jun 17 '25 17:06 Davis-Zhang-Onehouse

CI failures are not relevant to the code changes

Davis-Zhang-Onehouse avatar Jun 20 '25 18:06 Davis-Zhang-Onehouse

@Davis-Zhang-Onehouse When you get a chance, please bring the PR description inline with the approach now implemented

vinothchandar avatar Jun 24 '25 12:06 vinothchandar

@Davis-Zhang-Onehouse When you get a chance, please bring the PR description inline with the approach now implemented

done

Davis-Zhang-Onehouse avatar Jun 24 '25 17:06 Davis-Zhang-Onehouse

CI report:

  • 82effe42fa83a8fd031cb5c1d33e9ad37e0ae299 UNKNOWN
  • 7885a6d2ee08e8bbc6f4f4ce246924d953b9230b UNKNOWN
  • c907925d16ce28d56e5bb31f062c0b7c12cdc4bb UNKNOWN
  • 8ba5c4d488ba3cd01f97217baf781c8113742b99 UNKNOWN
  • 510310881ee31f57f624947c6b371fa8e8a801a1 UNKNOWN
  • e3e7001f819d86fc56b5c85ed1e5fcf240b4b0d6 UNKNOWN
  • feff82f4f2d442bc8a772c87a3d288bc8e18a804 UNKNOWN
  • afa2a898493bbce826cc04a353d9e7bee19a5b61 UNKNOWN
  • faef689d7d4204a3348283a1263a2e506ec8cb6c UNKNOWN
  • 9d86873722d159c44318a4dcd94fe67a23105243 Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Jun 26 '25 01:06 hudi-bot

https://github.com/apache/hudi/pull/13489 PR comments follow up is in https://github.com/apache/hudi/pull/13489.

Davis-Zhang-Onehouse avatar Jun 30 '25 16:06 Davis-Zhang-Onehouse

@danny0405 we need a versioning schema for index storage as well, just like log blocks.. So this is useful foundation work regardless

vinothchandar avatar Jun 30 '25 16:06 vinothchandar