Problem with read query from redshift
Hi,
We use spark-redshift to read data from redshift and we encountered the following error:
Error: Amazon Invalid operation: syntax error at or near "" Position: 403; java.sql.SQLException: Amazon Invalid operation: syntax error at or near "" Position: 403;
(full stack trace is attached)
What happens is that we read the data using the following sql query and then we filtered the resulting df for a specific id and counted like so:
count
df.count($"iid" === "'rzfNKjeX9iYxrvfWosCH").count
Generic Method for read:
`def readDF(preAction: String, sql: String): DataFrame = { logger.info(s"*** readDF for sql: $sql") val sqlContext = sparkSession.sqlContext
val dataWaresHouseConfig = configuration.databases.dataWarehouse
val redshiftConfig = dataWaresHouseConfig.redshift
val redshiftDBName = redshiftConfig.dbName
val redshiftUserId = redshiftConfig.user
val redshiftPassword = redshiftConfig.password
val redshiftUrl = redshiftConfig.url
val schemaName = redshiftConfig.schemaName
val jdbcURL = s"jdbc:redshift://$redshiftUrl/$redshiftDBName?user=$redshiftUserId&password=$redshiftPassword"
//temporary S3 bucket with an object lifecycle configure action
//to ensure that temporary files are automatically deleted after a specified expiration period
val now = DateTimeUtils.dateFormatS3DestinationFolders.print(DateTime.now())
val tempS3Dir = dataWaresHouseConfig.s3.bucketUrlS3nUnload + s"/unload_from_dwh/$now"
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", jdbcURL)
.option("query", sql)
.option("tempdir", tempS3Dir)
.option("preactions", preAction)
.load()
df
}`
Read df
df: DataFrame = dataWarehouseWriterService.readDF("", s"select DISTINCT iid,normalized_domain ,is_new_visit ,impression_timestamp,impression_date,visitor_key,url" + ",uri,ip,page_type_id,site_id,segment_id,vertical_id,visit_id,instance_id ,referrer, is_poc " + s"from ${schemaName}.${tableName} where ${dateFieldName} >= '${start.toString()}' and ${dateFieldName} <'${end.toString()}'" )
Due to laziness, spark created the following query directly from redshift:
SELECT count(*) FROM (select DISTINCT iid,normalized_domain ,is_new_visit ,impression_timestamp,impression_date,visitor_key,url,uri,ip,page_type_id,site_id,segment_id,vertical_id,visit_id,instance_id ,referrer, is_poc from xspark_dev.fact_events where impression_timestamp >= '2017-09-18T08:00:00.000Z' and impression_timestamp <'2017-09-19T08:00:00.000Z') WHERE "iid" IS NOT NULL AND "iid" = \'rzfNKjeX9iYxrvfWosCH\'
and decided to read it using jdbc and not unload to s3 ( which I believe is the correct desicion in this case). However, it escaped the iid value (as it should with unload) which resulted in the error
Is this known?
Thanks Nir stack_redshift.txt
6 months and this issue still open. Any update?