hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Can't query Redshift rows even after downgrade from 0.10

Open nochimow opened this issue 3 years ago • 22 comments

Describe the problem you faced

After the upgrade to Hudi 0.10, I faced the https://github.com/apache/hudi/issues/4283 issue in my environment, so my AWS Glue tables were working fine on AWS Athena, but were returning 0 rows with Redshift Spectrum. After that, I downgraded the tables to version 2 using Hudi-CLI, and returned to Hudi 0.9. I immediately tried to query the tables again after the downgrade, but i kept the same behavior, no errors and no rows returned. But, after doing a new load on this table, some rows appeared on the Redshift Spectrum query, not all of them, basically, i'm seeing a different number of rows on Redshift Spectrum (way less rows) in comparison with Athena (the complete set of rows).

The strange thing is: The table i'm facing this error have lots of upserts on lots of partitions (> than 1000 partitions) per batch, usually the updated partitions are days on the last 2 months, but there is some days on other dates aswell. The most recent partitions, are with the number of records ok, and big difference occurs on the old ones.

Does Hudi only "fix" the rows of the Redshift Spectrum incompatibility when there is a new load on a specific partition? There is any quick workaround to fix it?

To Reproduce

Steps to reproduce the behavior:

  1. Upgrade AWS Glue table to Hudi 0.10
  2. Load the table using Hudi 0.10
  3. Downgrade AWS Glue table to Hudi 0.9 with Hudi-CLI
  4. Load the table using Hudi 0.9
  5. The row count from Athena to Redshift Spectrum does not match anymore.

Expected behavior

All the rows should be visible on redshift spectrum after the downgrade.

Environment Description

  • Hudi version : 0.10/0.9
  • Spark version : 3.1.1
  • Hive version : 2.3.7-amzn-4
  • Hadoop version : 3.2.1
  • Storage (HDFS/S3/GCS..) : S3
  • Running on Docker? (yes/no) : no

nochimow avatar Jan 17 '22 17:01 nochimow

@nochimow this is likely caused by the same issue in https://issues.apache.org/jira/browse/HUDI-3056 where the timeline's time precision is not handled properly. But when you write data with 0.9.0, the old time precision was used then partial data is shown in redshift. While we can't troubleshoot any aws services as we're open-source, I'm adding this issue to the ticket to push up the priority.

xushiyan avatar Jan 17 '22 17:01 xushiyan

@xushiyan I opened a ticket to AWS support about this too to try to get more priority too.

Can you confirm what is the Hudi behaviour in this case? If i write new data into a partition, all rows in this partition will be visible to Redshift again or just the updated/inserted rows?

nochimow avatar Jan 17 '22 18:01 nochimow

@nochimow I can confirm due to time precision change in 0.10.0 there needs to be some fix for redshift. I can't confirm the exact behavior, since the behavior is under the influence of the bug reported in that JIRA, also this is in redshift which i can't verify with. Hope you understand. We just need to prioritize the fix and follow up aws support. Please close this if you don't have further questions.

xushiyan avatar Jan 17 '22 19:01 xushiyan

Probably you might have to do a restore to a older commit. Can you give it a try. add a savepoint to a commit which was created w/ 0.9.0. and then trigger a restore to that savepointed commit. it should restore your entire table to a older snapshot. (its destructive operation though. something to keep in mind) essentially restore will delete all data files and timeline files from now until the savepoint to which you are looking to restore.

nsivabalan avatar Jan 17 '22 23:01 nsivabalan

@nochimow : let us know how did it go. or if you need hudi-cli commands to do restore, let us know.

nsivabalan avatar Jan 20 '22 00:01 nsivabalan

Hello, I did a workaround to fix this issue that i didnt need to restore my table. Apparently, after the hudi table downgrade to 0.9 when I do a Hudi upsert operation into a partition, the rows of the partition become visible for the Redshift Spectrum again, so i forced kind of a fake update for all the partitions in my table. I'm still trying to correct all the rows all my partitions. But 95% of my rows are already normal again, maybe the remaining 5% is something i'm still missing on my side that i'm currently checking.

nochimow avatar Jan 26 '22 20:01 nochimow

So for now the status is that using 0.10.x of Hudi is incompatible with redshift spectrum? We are exploring using Hudi for our data lake and came across this very same problem which confused me a lot. I'll do the downgrade now.

JorgenG avatar Feb 11 '22 15:02 JorgenG

@nsivabalan Considering that we seem to be at AWS mercy here, could there be an option to have some config flag which uses the old precision? Or is there some new features relying on this change to be present?

We have this chicken and egg problem with adopting dbt now. We want to use redshift spectrum for queries and dbt spark for transforms. (Which was added in 0.10) But that renders spectrum unusable.

Best regards, Jørgen

JorgenG avatar Feb 12 '22 06:02 JorgenG

@nsivabalan Would not be useful to list kind of "Know Issues and Limitations" on each release? I think that it's not clear for the users that there is this kind of incompatibility with Redshift Spectrum on 0,10 onwards. For many users the Redshift integration is a core feature.

nochimow avatar Feb 14 '22 14:02 nochimow

yes, makes sense. I will see where we can document this.

nsivabalan avatar Feb 22 '22 03:02 nsivabalan

@xushiyan Is it possible to try to escalate this issue with AWS again? There is still no return from their side?

nochimow avatar Mar 24 '22 21:03 nochimow

@umehrot2 any updates on HUDI-3056 ?

codope avatar Apr 20 '22 11:04 codope

yes, makes sense. I will see where we can document this.

Can we add it here https://hudi.apache.org/releases/release-0.10.0#writer-side-improvements ?

codope avatar Apr 20 '22 11:04 codope

@umehrot2 : hey, is there any progress on this end. Can we bump this again internally if possible.

nsivabalan avatar May 12 '22 14:05 nsivabalan

Hello Guys, any updates on this? :)

rubenssoto avatar Jun 01 '22 16:06 rubenssoto

Hello, any updates ?

ctlgdanielli avatar Jun 03 '22 21:06 ctlgdanielli

Hey guys, is it possible to have a config to keep the compatibility? For example using the old time precision

rubenssoto avatar Jun 14 '22 20:06 rubenssoto

@codope @nsivabalan do you have updates on this subject? We stuck on Hudi 0.9 because of this :(

rubenssoto avatar Jun 24 '22 13:06 rubenssoto

Same here. Still waiting for an update to upgrade our Hudi from 0.9

nochimow avatar Jul 08 '22 12:07 nochimow

@nochimow Thought you had opened an AWS Support ticket. If yes, any update from AWS Support? Curious to know if Amazon Redshift Spectrum team is planning to do something on this issue.

pomaster avatar Aug 10 '22 21:08 pomaster

@pomaster Yes i did, Their reply was that the product team is aware of this issue, but there is no ETA to fix this yet.

nochimow avatar Aug 10 '22 21:08 nochimow

@nochimow Thanks for the info! When was the last update from AWS Support on your ticket? We are planning to reach out to AWS Redshift engineering team for help. Don't want to waste their time if there has already been a recent status update (e.g. < 1 month). Can you share your AWS Support ticket #? You can e-mail me at: [email protected].

pomaster avatar Aug 11 '22 15:08 pomaster

@nochimow @rubenssoto and others: Looks like hudi 0.10.0 is supported from the docs https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html#c-spectrum-column-mapping-hudi Can you folks give it a try and let us know if things are working well w/ hudi 0.10.0.

nsivabalan avatar Aug 16 '22 17:08 nsivabalan

Even with AWS saying that only 0.10.0 is "supported", I did some compatibility tests with Hudi 0.10, 0.11 and 0.12. All versions worked fine, like it wasn't before. (Prior to that, any table with Hudi version > 0.9 was returning 0 rows on Redshift Spectrum) The only detail here is that the Redshift version must be with the patch >=169. (Got this requirement from the AWS support)

nochimow avatar Aug 23 '22 20:08 nochimow

thanks @nochimow for the update. appreciate it.

nsivabalan avatar Aug 24 '22 04:08 nsivabalan

Great to know, I will test, thank you so much guys!

rubenssoto avatar Aug 24 '22 11:08 rubenssoto