aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

projection partitions in s3 to_parquet method

Open alekya1024 opened this issue 3 years ago • 1 comments

Describe the bug

I am using s3 to_parquet method to create a table in Glue catalog and store the data in the S3 bucket. I want to maintain partitions in my s3, and for them to be dynamic I need projected partitions.

Requirements/Expectations: 1/ partition format should be yyyy/MM/dd 2/ object prefix --> s3-bucket-name/folder1/folder2/yyyy/MM/dd 3/ My table should have a partition column called 'datepath' and it contains the 'yyyy/MM/dd' value i.e partition folders information. 4/ storage.location.template table parameter should be added to my table and its value should be s3-bucket-name/folder1/folder2/${datepath}

Observations: 1/ to_parquet method is not accepting the partition column format as an input. So I have executed a ALTER command on table to add that table property and format is yyyy/MM/dd 2/ object prefix -> s3-bucket-name/folder1/folder2/datepath=yyyy/MM/dd. Here I didn't expect the folder name to contains the partition column name. 3/ Table is having a partition column "datepath" --> as expected 4/ storage.location.template table property is not added to the table.

Here #1 can be improvement, accept the partition format from user. Please let me know what else I need to do to meet my Requirements.

How to Reproduce

I have used the below code wr.s3.to_parquet(

df = pd.DataFrame(data_dict),
      compression = 'snappy',
      dataset = 'True',
      path = f'{s3_path}/{current_date}',
      partition_cols = ['datepath'],
      mode = 'append',
      projection_enabled = True,
      projection_types = {'datepath':'date'},
      projection_ranges = {'datepath':f'{current_date},NOW+1DAYS'},
      projection_intervals = {'datepath':'1'},
      schema_evolution = 'True',
      database = db_name,
      table = table,
      table_type = 'EXTERNAL_TABLE',
      dtype = cols_dict,
      sanitize_columns = False,
      parameters = {
          'classification':'parquet',
          'compressionType':'none',
          'typeOfData':'file'
      }
)

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Executing the code in AWS glue job

Python version

3.6

AWS SDK for pandas version

1.1.5

Additional context

No response

alekya1024 avatar Sep 16 '22 12:09 alekya1024

Thanks @alekya1024 for opening this!

  1. Yes looks like we assume the format - definitely worth to make that configurable
  2. This is widely accepted Hive partitioning style
  3. Yes looks like we're not passing the template

I'll work on those

kukushking avatar Sep 21 '22 10:09 kukushking