s3-sqs-connector icon indicating copy to clipboard operation
s3-sqs-connector copied to clipboard

adding support for partitioned s3 source

Open abhishekd0907 opened this issue 4 years ago • 0 comments

Background

S3-SQS source doesn't support reading partition columns from the S3 bucket. As a result, the dataset formed using S3-SQS source doesn't contain the partition columns leading to issue #2

How this PR Handles the Problem

With the new changes, the user can specify partition columns in the schema with isPartitioned set to true in column metadata.

Example:

import org.apache.spark.sql.types._
import org.apache.spark.sql.types.MetadataBuilder

val metaData = (new MetadataBuilder).putString("isPartitioned", "true").build()

val partitionedSchema = new StructType().add(StructField("col1", IntegerType, true, metaData))

Also, the user needs to specify the basePath in options if the schema contains partition columns. Specifying partitioned columns without specifying the basePath will throw an error.

Example:

 `s3://bucket/basedDir/part1=10/part2=20/file.json` will have basePath as `s3://bucket/basedDir/`

abhishekd0907 avatar Jul 01 '20 10:07 abhishekd0907