hudi
hudi copied to clipboard
[HUDI-4850] Incremental Ingestion from GCS
Change Logs
JIRA issue: https://issues.apache.org/jira/browse/HUDI-4850
This PR brings in an incremental pull from a Google Cloud Storage bucket to Hudi, similar to the existing feature on the S3 side of things (https://hudi.apache.org/blog/2021/08/23/s3-events-source/).
This design document contains more details about the feature: Incremental Ingestion from Google Cloud Storage - Design and Implementation
Impact
Adds a new feature that integrates with parts of Google Cloud Storage. Runs as a separate process, and only when invoked explicitly by the user.
Risk level: low
Contributor's checklist
- [x] Read through contributor's guide
- [x] Change Logs and Impact were stated clearly
- [x] Adequate tests were added if applicable
- [ ] CI passed
Hi @pramodbiligiri, could you create a Jira ticket and attach to the PR title?
Thanks @jtarun for the comments! Will go over these today.
@jtarun Done with responding to your comments. Have tested the latest commit before pushing it.
@jtarun - Latest commit includes renames of GCS configs to a more generic name. See the new class CloudStorageIngestionConfig - https://github.com/apache/hudi/pull/6665/files#diff-a3d82f125d84c4e13ebecc5d27d51fc397400d3fd2441112de6091ab5e5f7c64
/cc @codope
Responded to recent comments by @codope (some Github is not letting me respond inline to some comments):
- Moved getMissingCheckpointStrategy to IncrSourceHelper, changed Config.READ_LATEST etc constants to public
- Removed fs.gs.impl and fs.AbstractFileSystem.gs.impl from the code and moved it back to configs
@hudi-bot run azure
Noting down an issue I noticed where the two tests I've added only work on one of our Spark profiles, and moreover, each of them works on a different Spark profile :|
- If I run TestGCSEventsSource on spark-3.2 instead of spark2 profile, I see below error: $ mvn -Dspark2 -Dscala-2.12 -Dcheckstyle.skip -Drat.skip -Dtest=org.apache.hudi.utilities.sources.TestGcsEventsSource -pl hudi-utilities test
[ERROR] org.apache.hudi.utilities.sources.TestGcsEventsSource Time elapsed: 3.433 s <<< ERROR!
java.lang.InstantiationError: org.apache.hadoop.hdfs.protocol.HdfsFileStatus
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.<clinit>(FSDirectory.java:136)
- If I run TestGcsEventsHoodieIncrSource under spark2 instead of spark3.2, i see the below error: $ mvn -Dspark3.2 -Dscala-2.12 -Dcheckstyle.skip -Drat.skip -Dtest=org.apache.hudi.utilities.sources.TestGcsEventsHoodieIncrSource -pl hudi-utilities test
[ERROR] shouldNotFindNewDataIfCommitTimeOfWriteAndReadAreEqual Time elapsed: 32.22 s <<< ERROR!
org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit time 1
at org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:64)
at org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:45)
at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:113)
at org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:97)
at org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:153)
at org.apache.hudi.utilities.sources.TestGcsEventsHoodieIncrSource.writeGcsMetadataRecords(TestGcsEventsHoodieIncrSource.java:227)
...snipped...
Caused by: java.lang.ClassNotFoundException: org.apache.avro.AvroMissingFieldException
at [java.net](http://java.net/).URLClassLoader.findClass(URLClassLoader.java:382)
CI report:
- 4864b65515d6e9ea5b6ba9d83241cfc310cbf3ee UNKNOWN
- 5ed92a20666863315f41578a905dd6f2681a1363 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
I ran the tests locally. Works for me. Probably you're not building it right. Note that for Spark 3.2, you need to use the following maven command
mvn clean install -DskipTests -Dspark3.2 -Dscala-2.12
and for Spark 2.4, use mavn clean install -DskipTests (spark2 and scala-2.11 are defaults).
Ack. Did more or less the same locally, just now: I'm not able to repro the error. Could have been because I hadn't cleaned up the build and different versions were conflicting. I just ran the below two commands and it worked: $ mvn -DskipTests -Dspark2 -Dscala-2.12 -Dcheckstyle.skip -Drat.skip clean install Followed by: $ mvn -Dspark2 -Dscala-2.12 -Dcheckstyle.skip -Drat.skip -Dtest=org.apache.hudi.utilities.sources.TestGcsEventsSource -pl hudi-utilities test [INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 41.804 s - in org.apache.hudi.utilities.sources.TestGcsEventsSource [INFO] [INFO] Results: [INFO] [INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:07 min