hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] How to skip some partitions in a table when readStreaming in Spark at the init stage

Open lei-su-awx opened this issue 1 year ago • 6 comments

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I have a table partition by operation type and ingestion date(like insert/2023-12-11/, update/2023-12-11/, delete/2023-12-11/), when I read(use spark readStream) this table, I just want to read data under update partition. And I found a config 'hoodie.datasource.read.incr.path.glob', then I use this config and value is update/202*. But I found spark job init very slow, and found job was stuck image

But this parquet file is not under update partition, it is under insert partition, which is very confused. So I want ask is there a config that can only read the target partition and skip others and also does not read other partitions' data files to get schema.

Expected behavior

I want to know is there a config that can only read the target partition and skip others and also does not read other partitions' data files to get schema.

Environment Description

  • Hudi version : 0.14.0

  • Spark version : 3.4.1

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : GCS

  • Running on Docker? (yes/no) : no

Additional context

I use the below configurations to write to table: hudi_write_options = { 'hoodie.table.name': hudi_table_name, 'hoodie.datasource.write.partitionpath.field': 'operation_type, ingestion_dt', 'hoodie.datasource.write.operation': 'insert', 'hoodie.datasource.write.table.type': 'MERGE_ON_READ', 'hoodie.parquet.compression.codec': 'zstd', "hoodie.datasource.write.payload.class": "org.apache.hudi.common.model.DefaultHoodieRecordPayload", "hoodie.datasource.write.reconcile.schema": True, "hoodie.metadata.enable": True }

And I use the configurations to read from table: read_streaming_hudi_options = { 'maxFilesPerTrigger': 5, 'hoodie.datasource.read.incr.path.glob': 'update/202*', 'hoodie.read.timeline.holes.resolution.policy': 'BLOCK', 'hoodie.datasource.read.file.index.listing.partition-path-prefix.analysis.enabled': False, 'hoodie.file.listing.parallelism': 1000, 'hoodie.metadata.enable': True, 'hoodie.datasource.read.schema.use.end.instanttime': True, 'hoodie.datasource.streaming.startOffset': '20231211000000000' }

Stacktrace

Add the stacktrace of the error.

lei-su-awx avatar Dec 12 '23 06:12 lei-su-awx

Did you try to add filter condition with the partition fields?

danny0405 avatar Dec 12 '23 07:12 danny0405

Hi @danny0405 , do you mean like this: image I tried this, but also will read other partitions' data file to resolve schema. And I think kind of filter takes effect after source load all partitions files, but I want a config that can tell source that only reads the partition that in my configs so I do not need to use filter.

lei-su-awx avatar Dec 12 '23 07:12 lei-su-awx

but I want a config that can tell source that only reads the partition that in my configs so I do not need to use filter

That does not follow the common intuition.

danny0405 avatar Dec 12 '23 10:12 danny0405

@lei-su-awx If the table is partitioned then it should only read the files under that partition. Are you seeing any behaviour otherwise if it is reading all files?

ad1happy2go avatar Dec 12 '23 16:12 ad1happy2go

@ad1happy2go I tried to only read files under that partition using spark(spark.readStream), but an error was thrown: no .hoodie file exists in the partition path, and I found only .hoodie folder only exist under the table path.

lei-su-awx avatar Dec 13 '23 02:12 lei-su-awx

@lei-su-awx You dont need to read from that directory. You should give the parent directory only. According to your .filer("operation_type 'update') partition pruning will take place and it will only process that partition data.

ad1happy2go avatar Dec 19 '23 08:12 ad1happy2go

@lei-su-awx Did you see the behaviour I mentioned? Any more updates here please?

ad1happy2go avatar Dec 29 '23 09:12 ad1happy2go

@lei-su-awx Any updates here? Do you still have this issue?

ad1happy2go avatar Jan 10 '24 09:01 ad1happy2go

@ad1happy2go sorry for the late, I did not test this anymore because I meet new issues. But I think .filer("operation_type 'update') can do what I want, so never mind this issue.

lei-su-awx avatar Jan 10 '24 11:01 lei-su-awx

@lei-su-awx Closing this issue then. thanks. Please reopen in case you see issue again.

ad1happy2go avatar Feb 08 '24 06:02 ad1happy2go