hyperspace [FEATURE REQUEST]: please do add support for derived columns in source dataset

[FEATURE REQUEST]: please do add support for derived columns in source dataset

Open SynapsePOC opened this issue 4 years ago • 2 comments

Feature requested this feature request is based on issue we have experienced when running .createIndex method in scala. Our source dataset has a few fields, which have format that looks like a day, but stored as string, we would like to expose same field as a date.

Currently, when we perform data ingestion from remote storage account using spark.read() call we tag to it a few expansion calls that look similar to following: .withColumn("BillingMonthdt",to_date(to_timestamp(col("BillingMonthdt"))))

However, a few lines from this call we perform .createIndex and it fails with exception. Message reads: com.microsoft.hyperspace.HyperspaceException: Only creating index over HDFS file based scan nodes is supported

Acceptance criteria

it would be very nice if such derived columns were supported to creating index. Else, we are forced to persist dataset in an attached storage, creating 2 copies of data, before we can pass dataset to .createIndex command, which we really - really need.

Success criteria

Same as acceptance criteria

Additional context

N/A

Feb 25 '21 20:02 SynapsePOC

This is essentially creating an index on a view. Related issue: https://github.com/microsoft/hyperspace/issues/186

Apr 06 '21 03:04 clee704

@SynapsePOC, if your index doesn't include those added columns, you can try to create an index on the base relation. Just to be sure: in your case, do you want to include those added columns in indexed columns or included columns? Or both?

Apr 06 '21 03:04 clee704

hyperspace hyperspace copied to clipboard

[FEATURE REQUEST]: please do add support for derived columns in source dataset

hyperspace
hyperspace copied to clipboard