s3-sqs-connector
s3-sqs-connector copied to clipboard
adding support for partitioned s3 source
Background
S3-SQS source doesn't support reading partition columns from the S3 bucket. As a result, the dataset formed using S3-SQS source doesn't contain the partition columns leading to issue #2
How this PR Handles the Problem
With the new changes, the user can specify partition columns in the schema with isPartitioned
set to true
in column metadata.
Example:
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.MetadataBuilder
val metaData = (new MetadataBuilder).putString("isPartitioned", "true").build()
val partitionedSchema = new StructType().add(StructField("col1", IntegerType, true, metaData))
Also, the user needs to specify the basePath
in options if the schema contains partition columns. Specifying partitioned
columns without specifying the basePath
will throw an error.
Example:
`s3://bucket/basedDir/part1=10/part2=20/file.json` will have basePath as `s3://bucket/basedDir/`