aws-sdk-pandas projection partitions in s3 to

projection partitions in s3 to_parquet method

Open alekya1024 opened this issue 3 years ago • 1 comments

Describe the bug

I am using s3 to_parquet method to create a table in Glue catalog and store the data in the S3 bucket. I want to maintain partitions in my s3, and for them to be dynamic I need projected partitions.

Requirements/Expectations: 1/ partition format should be yyyy/MM/dd 2/ object prefix --> s3-bucket-name/folder1/folder2/yyyy/MM/dd 3/ My table should have a partition column called 'datepath' and it contains the 'yyyy/MM/dd' value i.e partition folders information. 4/ storage.location.template table parameter should be added to my table and its value should be s3-bucket-name/folder1/folder2/${datepath}

Observations: 1/ to_parquet method is not accepting the partition column format as an input. So I have executed a ALTER command on table to add that table property and format is yyyy/MM/dd 2/ object prefix -> s3-bucket-name/folder1/folder2/datepath=yyyy/MM/dd. Here I didn't expect the folder name to contains the partition column name. 3/ Table is having a partition column "datepath" --> as expected 4/ storage.location.template table property is not added to the table.

Here #1 can be improvement, accept the partition format from user. Please let me know what else I need to do to meet my Requirements.

How to Reproduce

I have used the below code wr.s3.to_parquet(

df = pd.DataFrame(data_dict),
      compression = 'snappy',
      dataset = 'True',
      path = f'{s3_path}/{current_date}',
      partition_cols = ['datepath'],
      mode = 'append',
      projection_enabled = True,
      projection_types = {'datepath':'date'},
      projection_ranges = {'datepath':f'{current_date},NOW+1DAYS'},
      projection_intervals = {'datepath':'1'},
      schema_evolution = 'True',
      database = db_name,
      table = table,
      table_type = 'EXTERNAL_TABLE',
      dtype = cols_dict,
      sanitize_columns = False,
      parameters = {
          'classification':'parquet',
          'compressionType':'none',
          'typeOfData':'file'
      }
)

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Executing the code in AWS glue job

Python version

3.6

AWS SDK for pandas version

1.1.5

Additional context

No response

Sep 16 '22 12:09 alekya1024

Thanks @alekya1024 for opening this!

Yes looks like we assume the format - definitely worth to make that configurable
This is widely accepted Hive partitioning style
Yes looks like we're not passing the template

I'll work on those

Sep 21 '22 10:09 kukushking

aws-sdk-pandas aws-sdk-pandas copied to clipboard

projection partitions in s3 to_parquet method

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

aws-sdk-pandas
aws-sdk-pandas copied to clipboard