hyperspace
hyperspace copied to clipboard
[FEATURE REQUEST]: please do add support for derived columns in source dataset
Feature requested this feature request is based on issue we have experienced when running .createIndex method in scala. Our source dataset has a few fields, which have format that looks like a day, but stored as string, we would like to expose same field as a date.
Currently, when we perform data ingestion from remote storage account using spark.read() call we tag to it a few expansion calls that look similar to following: .withColumn("BillingMonthdt",to_date(to_timestamp(col("BillingMonthdt"))))
However, a few lines from this call we perform .createIndex and it fails with exception. Message reads: com.microsoft.hyperspace.HyperspaceException: Only creating index over HDFS file based scan nodes is supported
Acceptance criteria
it would be very nice if such derived columns were supported to creating index. Else, we are forced to persist dataset in an attached storage, creating 2 copies of data, before we can pass dataset to .createIndex command, which we really - really need.
Success criteria
Same as acceptance criteria
Additional context
N/A
This is essentially creating an index on a view. Related issue: https://github.com/microsoft/hyperspace/issues/186
@SynapsePOC, if your index doesn't include those added columns, you can try to create an index on the base relation. Just to be sure: in your case, do you want to include those added columns in indexed columns or included columns? Or both?