Problem with read query from redshift

Open nirby82 opened this issue 8 years ago • 1 comments

Hi,

We use spark-redshift to read data from redshift and we encountered the following error:

Error: Amazon Invalid operation: syntax error at or near "" Position: 403; java.sql.SQLException: Amazon Invalid operation: syntax error at or near "" Position: 403;

(full stack trace is attached)

What happens is that we read the data using the following sql query and then we filtered the resulting df for a specific id and counted like so:

count df.count($"iid" === "'rzfNKjeX9iYxrvfWosCH").count

Generic Method for read:

`def readDF(preAction: String, sql: String): DataFrame = { logger.info(s"*** readDF for sql: $sql") val sqlContext = sparkSession.sqlContext

val dataWaresHouseConfig = configuration.databases.dataWarehouse
val redshiftConfig = dataWaresHouseConfig.redshift

val redshiftDBName = redshiftConfig.dbName
val redshiftUserId = redshiftConfig.user
val redshiftPassword = redshiftConfig.password
val redshiftUrl = redshiftConfig.url
val schemaName = redshiftConfig.schemaName
val jdbcURL = s"jdbc:redshift://$redshiftUrl/$redshiftDBName?user=$redshiftUserId&password=$redshiftPassword"

//temporary S3 bucket with an object lifecycle configure action
//to ensure that temporary files are automatically deleted after a specified expiration period
val now = DateTimeUtils.dateFormatS3DestinationFolders.print(DateTime.now())
val tempS3Dir = dataWaresHouseConfig.s3.bucketUrlS3nUnload + s"/unload_from_dwh/$now"


  val df: DataFrame = sqlContext.read
    .format("com.databricks.spark.redshift")
    .option("url", jdbcURL)
    .option("query", sql)
    .option("tempdir", tempS3Dir)
    .option("preactions", preAction)
    .load()

  df

Read df

df: DataFrame = dataWarehouseWriterService.readDF("", s"select DISTINCT iid,normalized_domain ,is_new_visit ,impression_timestamp,impression_date,visitor_key,url" + ",uri,ip,page_type_id,site_id,segment_id,vertical_id,visit_id,instance_id ,referrer, is_poc " + s"from ${schemaName}.${tableName} where ${dateFieldName} >= '${start.toString()}' and ${dateFieldName} <'${end.toString()}'" )

Due to laziness, spark created the following query directly from redshift:

SELECT count(*) FROM (select DISTINCT iid,normalized_domain ,is_new_visit ,impression_timestamp,impression_date,visitor_key,url,uri,ip,page_type_id,site_id,segment_id,vertical_id,visit_id,instance_id ,referrer, is_poc from xspark_dev.fact_events where impression_timestamp >= '2017-09-18T08:00:00.000Z' and impression_timestamp <'2017-09-19T08:00:00.000Z') WHERE "iid" IS NOT NULL AND "iid" = \'rzfNKjeX9iYxrvfWosCH\'

and decided to read it using jdbc and not unload to s3 ( which I believe is the correct desicion in this case). However, it escaped the iid value (as it should with unload) which resulted in the error

Is this known?

Thanks Nir stack_redshift.txt

Nov 27 '17 13:11 nirby82

6 months and this issue still open. Any update?

May 10 '18 09:05 ahharu