spark-redshift
spark-redshift copied to clipboard
How to used data frame in query
I have following data frame
("1", "A", "P"),
("2", "B", "Q"),
("3", "C", "R"),
("4", "D", "S"),
("5", "E", "T")
).toDF("id", "key", "key_val")
in red shift we have same format data which looks like this
2 | A | P
3 | C | R
3 | D | S
now I want delete records which matched data frame criteria like this
val deleteQuery = " delete from table_name where concat(concat(key_val, ':'), key) in (" + data.select("key_val","key") +")";
data.write.format("com.databricks.spark.redshift")
.option("url", jdbcURL)
.option("dbtable", "data")
.option("tempdir", "s3n://tempDir")
.option("temporary_aws_access_key_id", awsAccessKey)
.option("temporary_aws_secret_access_key", awsSecretKey)
.option("temporary_aws_session_token", token)
.option("preactions", deleteQuery)
.mode(SaveMode.Append)
.save()
but the above delete query not works, in condition data.select('key')
returns 'RDD' which not understand in query and here I don't want to make collect and passed the formatted string to query like this
val collectKey = data.select("key","key_val") .collect.map(x => x(0)+":"+x(1)).mkString("'", "', '", "'")
because the data frame having millions of key, value pairs and I don't want to collect each pair and make string and passed to query.
Is there way to passed data frame values in query ?