spark-redshift How to used data frame in query

How to used data frame in query

Open yogesh-0586 opened this issue 6 years ago • 0 comments

I have following data frame

          ("1", "A", "P"),
          ("2", "B", "Q"),
          ("3", "C", "R"),
          ("4", "D", "S"),
          ("5", "E", "T")
        ).toDF("id", "key", "key_val")

in red shift we have same format data which looks like this

2  |  A   |  P
3  | C    |  R
3  |  D   |  S

now I want delete records which matched data frame criteria like this val deleteQuery = " delete from table_name where concat(concat(key_val, ':'), key) in (" + data.select("key_val","key") +")";

     data.write.format("com.databricks.spark.redshift")
      .option("url", jdbcURL)
      .option("dbtable", "data")
      .option("tempdir", "s3n://tempDir")
      .option("temporary_aws_access_key_id", awsAccessKey)
      .option("temporary_aws_secret_access_key", awsSecretKey)
      .option("temporary_aws_session_token", token)
      .option("preactions", deleteQuery)
      .mode(SaveMode.Append)
      .save()

but the above delete query not works, in condition data.select('key') returns 'RDD' which not understand in query and here I don't want to make collect and passed the formatted string to query like this

val collectKey = data.select("key","key_val") .collect.map(x => x(0)+":"+x(1)).mkString("'", "', '", "'")

because the data frame having millions of key, value pairs and I don't want to collect each pair and make string and passed to query.

Is there way to passed data frame values in query ?

Jun 05 '18 07:06 yogesh-0586

spark-redshift spark-redshift copied to clipboard

How to used data frame in query

spark-redshift
spark-redshift copied to clipboard