airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

Destination GCs: add hive partitioning in paths

Open zbrak opened this issue 3 years ago • 5 comments

Tell us about the problem you're trying to solve

**Edit: the GCS connector 'Bucket Path' setup has similar (but incomplete) functionality to the S3 connector but is undocumented. Critically the S3 connector includes an option to terminate the path with a / - overriding existing ${namespace} and ${stream} values in the path. This screenshot shows the error thrown when trying to perform the same functionality in GCS connector (0.2.10): ** Screen Shot 2022-08-09 at 12 54 24 PM

GCS connector has been really good for us so far, however, when trying to couple with federated queries in BigQuery, it can hurt your optimization because you are unable to partition the underlying table.

I'm hoping to show that Airbyte can support robust partitioning with some simple path manipulation on insert.

Describe the solution you’d like

**Edit: we've confirmed in Airbyte slack much of this custom path functionality exists in the S3 connector and is available but undocumented in the GCS connector. **

I'd like to see a modal in the GCS destination configuration "Enable hive partitioning" with a field name value. Hive partitioning is a GCS external table default option in BigQuery, and could really enhance a lot of lakehouse-type implementations. Link to the hive partition spec.

The current pathing as is:

testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.jsonl
↑              ↑                ↑      ↑     ↑          ↑             ↑ ↑
|              |                |      |     |          |             | format extension
|              |                |      |     |          |             partition id
|              |                |      |     |          upload time in millis
|              |                |      |     upload date in YYYY-MM-DD
|              |                |      stream name
|              |                source namespace (if it exists)
|              bucket path
bucket name

With a slight adjustment we would enable BigQuery to infer a partition (where ${field_name} is the partition field with inferred type date):

testing_bucket/data_output_path/public/users/${field_name}=2021-01-01/1609541171643_0.jsonl
                                             ↑             ↑
                                             |             upload date as partition
                                             Chosen field name as partition field

Describe the alternative you’ve considered or used

The alternative I've been doing to not read my entire bucket every time is to rapidly refresh the external table definition, dynamically creating paths to filter to the day. i.e. testing_bucket/data_output_path/public/users/2021_01_01

Additional context

External table definition link. Hive partitioning spec link.

Link to getOutputFilename in the GCS connector

Are you willing to submit a PR?

We do have a resource we can put on this, however as this is such a core connector I feel like our specific implementation may run into problems when trying to establish consensus.

zbrak avatar Aug 09 '22 15:08 zbrak

Have updated issue with new details presented to me:

**Edit: the GCS connector 'Bucket Path' setup has similar (but incomplete) functionality to the S3 connector but is undocumented. Critically the S3 connector includes an option to terminate the path with a / - overriding existing {stream} values in the path. This screenshot shows the error thrown when trying to perform the same functionality in GCS connector (0.2.10): **

zbrak avatar Aug 09 '22 17:08 zbrak

This would be great for all object storage destinations. Hive partitioning is almost standard practice, developers are used to read from object storages that way and BigQuery, Redshift or Apache Spark all use hive partitioning. We can add hive partitioning as an extra step in our data pipeline but having it done right away is more elegant.

dis-sid avatar Oct 17 '22 08:10 dis-sid

Not directly linked to this issue but probably to be included in a same release as impacting the generated output file path on GCS : https://github.com/airbytehq/airbyte/issues/28528 🙏

kev-datams avatar Jul 20 '23 16:07 kev-datams

At Airbyte, we seek to be clear about the project priorities and roadmap. This issue has not had any activity for 180 days, suggesting that it's not as critical as others. It's possible it has already been fixed. It is being marked as stale and will be closed in 20 days if there is no activity. To keep it open, please comment to let us know why it is important to you and if it is still reproducible on recent versions of Airbyte.

octavia-squidington-iii avatar Apr 13 '24 09:04 octavia-squidington-iii

Hello, looks like this is still open We need for this feature in order to use hive partitioning for incrementally load data limiting data retrieval costs

kev-datams avatar Apr 30 '24 08:04 kev-datams