hyperspace
hyperspace copied to clipboard
[FEATURE REQUEST]: Ability to specify path to create index outside system path.
Feature requested
Currently, Hyperspace creates indexes under the system path specified by spark.hyperspace.system.path
. The user should be able to specify different path to create/search the indexes. Note that #242 removes spark.hyperspace.index.creation.path
and spark.hyperspace.index.search.paths
since they are not used, but can be brought back when this feature is implemented.
Acceptance criteria
- [ ] The user can specify a different path to create an index outside
spark.hyperspace.system.path
. FYI, this is more or less possible by temporarily setting thespark.hyperspace.system.path
to a different location, but this needs to be discusses if that's a good solution. - [ ] The user can specify multiple paths to "search" indexes to apply.
Success criteria N/A
Additional context N/A
Would be possible to store indexes outside of the primary ADLS path (Defined by spark.sql.warehouse.dir
)? Let's say we have one single Synapse instance which processes the files (Delta lakes) into different storage accounts (By tenant).
Hi @Neoix. I think you can specify spark.hyperspace.system.path
with the storage account that will be used.
Hi @paryoja, unfortunately this is not working. spark.hyperspace.system.path
depends to spark.sql.warehouse.dir
as per the docs.
For reference: https://microsoft.github.io/hyperspace/docs/ug-configuration/
@Neoix, it depends only for the default value. You can change the path under any other storage.
EDIT: Note that the config should be set before creating Hyperspace object.
Example:
spark.conf.set("spark.hyperspace.system.path", "abfss://<containerName>@<accountName>.dfs.core.windows.net/path/to/indexes")
val hs = new Hyperspace(spark)
hs.createIndex(..)
I'm getting this error when trying to create indexes on top of my delta lake data frame.
com.microsoft.hyperspace.HyperspaceException: Only creating index over HDFS file based scan nodes is supported.
Please note that my ADLS Gen2 is in another Azure tenant and Synapse is accessing it using service principal.
It looks when using delta lake, we should explicitly set new configs as described in the docs!
spark.Conf().Set("spark.hyperspace.index.sources.fileBasedBuilders", "com.microsoft.hyperspace.index.sources.delta.DeltaLakeFileBasedSourceBuilder," + "com.microsoft.hyperspace.index.sources.default.DefaultFileBasedSourceBuilder");
It works fine after adding this line. Thank you!
Hi, I'm having the same problem. I'm trying to create indexes on some delta tables that I have in Synapse.
The creation of indexes is fine if I read the table as a traditional parquet table. If instead I try to pass to the createIndex function a dataframe like in the following instruction, the exception that is generated is com.microsoft.hyperspace.HyperspaceException: Only creating index over HDFS file based scan nodes is supported.
idx=IndexConfig(index_name ,list_of_index_columns, list_of_included_columns) df=spark.read.format('delta').load(path_to_delta) hyperspace.createIndex(df, idx)
I set the spark conf correctly.
spark.conf.set("spark.hyperspace.index.sources.fileBasedBuilders", "com.microsoft.hyperspace.index.sources.delta.DeltaLakeFileBasedSourceBuilder," + "com.microsoft.hyperspace.index.sources.default.DefaultFileBasedSourceBuilder")
The spark version of the notebook is 3.2 and I don't mind if the path of the creation of the indexes is the default one. No need to modify it. Did some of you experience the same?