s3-sqs-connector icon indicating copy to clipboard operation
s3-sqs-connector copied to clipboard

S3-SQS source does not populate partition columns in the dataframne

Open DipeshV opened this issue 5 years ago • 6 comments
trafficstars

Hi, I are using this "s3-sqs" connector with spark structured streaming and deltalake to process incoming data in partitioned s3 buckets. The problem I are facing is with "s3-sqs" source is that the file is directly read and returns a dataframe/dataset without the partition columns. Hence, when we merge the source and target dataframes, we get all the partition columns as  HIVE_DEFAULT_PARTITION.

Do have any solution/workaround to add partition colums as a part of dataframe??

Thanks and regards, Dipesh Vora

DipeshV avatar Jun 18 '20 13:06 DipeshV

@DipeshV seems like a bug. Thanks for pointing this out. I will work on the fix.

abhishekd0907 avatar Jun 23 '20 12:06 abhishekd0907

Hi Abhishek,

I am currently adding partition manually, which makes my code a bit messy and cannot be used as is while adding new integrations. Do we have any fix for this?

Thanks, Dipesh

DipeshV avatar Jun 30 '20 07:06 DipeshV

@DipeshV yeah i'll raise a PR for the fix today.

abhishekd0907 avatar Jul 01 '20 06:07 abhishekd0907

@DipeshV I've created a pull request. Can you build a jar from the new branch and try it out?

abhishekd0907 avatar Jul 01 '20 10:07 abhishekd0907

@DipeshV Did you get a chance to try out the new code? Does it solve your use case?

abhishekd0907 avatar Jul 17 '20 13:07 abhishekd0907

@abhishekd0907 - I haven't checked the new code since I had currently manually added the partitions from input_file_name(). But I will test it though with the new code.

DipeshV avatar Jul 27 '20 07:07 DipeshV