hyperspace icon indicating copy to clipboard operation
hyperspace copied to clipboard

[FEATURE REQUEST]: Ability to specify path to create index outside system path.

Open imback82 opened this issue 4 years ago • 7 comments

Feature requested Currently, Hyperspace creates indexes under the system path specified by spark.hyperspace.system.path. The user should be able to specify different path to create/search the indexes. Note that #242 removes spark.hyperspace.index.creation.path and spark.hyperspace.index.search.paths since they are not used, but can be brought back when this feature is implemented.

Acceptance criteria

  • [ ] The user can specify a different path to create an index outside spark.hyperspace.system.path. FYI, this is more or less possible by temporarily setting the spark.hyperspace.system.path to a different location, but this needs to be discusses if that's a good solution.
  • [ ] The user can specify multiple paths to "search" indexes to apply.

Success criteria N/A

Additional context N/A

imback82 avatar Nov 02 '20 18:11 imback82

Would be possible to store indexes outside of the primary ADLS path (Defined by spark.sql.warehouse.dir)? Let's say we have one single Synapse instance which processes the files (Delta lakes) into different storage accounts (By tenant).

neoix avatar Jan 23 '22 15:01 neoix

Hi @Neoix. I think you can specify spark.hyperspace.system.path with the storage account that will be used.

paryoja avatar Jan 25 '22 03:01 paryoja

Hi @paryoja, unfortunately this is not working. spark.hyperspace.system.path depends to spark.sql.warehouse.dir as per the docs. For reference: https://microsoft.github.io/hyperspace/docs/ug-configuration/

neoix avatar Jan 25 '22 09:01 neoix

@Neoix, it depends only for the default value. You can change the path under any other storage.

EDIT: Note that the config should be set before creating Hyperspace object.

Example:

spark.conf.set("spark.hyperspace.system.path", "abfss://<containerName>@<accountName>.dfs.core.windows.net/path/to/indexes")
val hs = new Hyperspace(spark)
hs.createIndex(..)

sezruby avatar Jan 26 '22 01:01 sezruby

I'm getting this error when trying to create indexes on top of my delta lake data frame.

com.microsoft.hyperspace.HyperspaceException: Only creating index over HDFS file based scan nodes is supported.

Please note that my ADLS Gen2 is in another Azure tenant and Synapse is accessing it using service principal.

neoix avatar Jan 29 '22 11:01 neoix

It looks when using delta lake, we should explicitly set new configs as described in the docs!

spark.Conf().Set("spark.hyperspace.index.sources.fileBasedBuilders", "com.microsoft.hyperspace.index.sources.delta.DeltaLakeFileBasedSourceBuilder," + "com.microsoft.hyperspace.index.sources.default.DefaultFileBasedSourceBuilder");

It works fine after adding this line. Thank you!

neoix avatar Jan 29 '22 11:01 neoix

Hi, I'm having the same problem. I'm trying to create indexes on some delta tables that I have in Synapse.

The creation of indexes is fine if I read the table as a traditional parquet table. If instead I try to pass to the createIndex function a dataframe like in the following instruction, the exception that is generated is com.microsoft.hyperspace.HyperspaceException: Only creating index over HDFS file based scan nodes is supported.

idx=IndexConfig(index_name ,list_of_index_columns, list_of_included_columns) df=spark.read.format('delta').load(path_to_delta) hyperspace.createIndex(df, idx)

I set the spark conf correctly.

spark.conf.set("spark.hyperspace.index.sources.fileBasedBuilders", "com.microsoft.hyperspace.index.sources.delta.DeltaLakeFileBasedSourceBuilder," + "com.microsoft.hyperspace.index.sources.default.DefaultFileBasedSourceBuilder")

The spark version of the notebook is 3.2 and I don't mind if the path of the creation of the indexes is the default one. No need to modify it. Did some of you experience the same?

S-G-dg avatar Aug 25 '23 10:08 S-G-dg