hyperspace Use indexes subdirectory for custom index system path

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: n/a
Dependencies: n/a

What changes were proposed in this pull request?

Use subdirectory for custom index system path for consistency & maintenance. e.g. with indexDirName = "indexes"

current:
- /custom/system/path/index1
- /custom/system/path/index2

with this change:
- /custom/system/path/indexes/index1
- /custom/system/path/indexes/index2

Added new config for the subdir name

// Config used for subdirectory name under the system path.
  val INDEX_DIR_NAME = "spark.hyperspace.system.indexDirName"
  val INDEX_DIR_NAME_DEFAULT = "indexes"

Does this PR introduce any user-facing change?

Yes, this PR changes the system directory path. This is a breaking change.

How was this patch tested?

Need to fix tests

Feb 26 '21 20:02 sezruby

@imback82 @rapoth @apoorvedave1 @thugsatbay

Though this change will be a breaking change, I think we need this sub root dir to avoid any kind of interference. We shouldn't rely on user's input. WDYT?

Feb 26 '21 20:02 sezruby

One advantage I see is that user index are more organized.
Will we be keeping some files in the indexes folder that denote some global state of all the index created so far ?
What kind of interference are we expecting ?
Will user ever come to the hyperspace index dir and play or move files ?

Feb 26 '21 21:02 thugsatbay

we might have some metadata later
if there's other directories in there, index manager tries to find or build IndexLogEntry on it and causes some exception, or directory size estimation or something like that
rarely but possible

The default behavior is creating indexes under the spark warehouse dir. But if the system path config is set, we don't create indexes under it.

@imback82 could you add some comment?

Feb 27 '21 00:02 sezruby

I guess the only concern I have is the breaking change as you pointed out. @rapoth?

Feb 27 '21 00:02 imback82

I'll give the example of using custom indexing path with Iceberg.

Iceberg has its own metadata folder where it's keeping its own information. That is the default value but can be changed if needed through some table or custom define Spark parameters. This is named metadata_location and is a relative path starting from the dataset path. In the case of Iceberg it is set in stone that the main use case is keeping the metadata next to the data.

In the case of Hyperspace, the current possibility of setting the path to whatever value it is needed is the best option although the user needs to know not to use a folder used for other purposes (ie: metadata in Iceberg case).

In my opinion, adding the hardcoded indexes value inside the path adds some limitations to the current freedom.

What should some one do if they want to keep the indexes next to their data using custom indexes path but the indexes folder inside the dataset is already used by other indexing mechanism?

Mar 03 '21 10:03 andrei-ionescu

@andrei-ionescu That's a fair point but currently Hyperspace doesn't have "metadata" dir and rely on the directory structure under the system path; so I suggest this sub-root dir. We don't support different index data locations yet (#243), but I think Hyperspace also needs "metadata" dir to maintain indexes located in different paths.

Mar 04 '21 04:03 sezruby

@sezruby I'm not against having a default folder name for the custom index system path, but I'm against of having it hardcoded (burned in code) as indexes.

I'm for making the sub-folder a setting at all levels: either if the user chooses the Spark warehouse (default) or dataset path (custom path) or other path if a new suffix is added to the path, the suffix should be a setting.

I see that in HyperspaceConf are these two constants: INDEX_SYSTEM_PATH and INDEXES_DIR. I see that INDEXES_DIR is not used in code and I would suggest to make use of it too, or rename it for a better understanding.

Code wise, this what I suggest:

def systemPath: Path = {
  val indexesSuffix = conf.getConfString(IndexConstants.INDEXES_DIR_SUFFIX, IndexConstants.INDEXES_DIR)
  val defaultIndexesPath = conf.getConfString("spark.sql.warehouse.dir")
  new Path(conf.getConfString(IndexConstants.INDEX_SYSTEM_PATH, defaultIndexesPath), indexesSuffix)
}

And add the new INDEXES_DIR_SUFFIX into IndexConstants

val INDEXES_DIR_SUFFIX = "spark.hyperspace.indexes.dir.suffix"

Just to conclude, I'm not against having this subdirectory but 1) let's make it settable 2) let's make it consistent over all possible pathing options in Hyperspace.

Mar 04 '21 10:03 andrei-ionescu

@andrei-ionescu Yep sounds good to me. Thanks for the good suggestion :)

@rapoth any opinion on this PR?

Mar 05 '21 07:03 sezruby

@imback82 Now at least we have a workaround ( indexDirName = "" ) for the breaking change. WDYT?

Mar 11 '21 02:03 sezruby

I think this PR is good in general, but we need some clarification and write something about how and where data is stored in Hyperspace. Maybe there is one already?

Apr 09 '21 05:04 clee704

Maybe there is one already?

Maybe this one? https://microsoft.github.io/hyperspace/docs/toh-indexes-on-the-lake/ I think it's good to add this link to quick-start guide for reference.

Apr 09 '21 05:04 sezruby