hyperspace
hyperspace copied to clipboard
[FEATURE REQUEST]: Scalable Refresh Index for Large Data Sources
Feature requested
Imagine a scenario where users have a folder /data/
where they ingest data into partitions (hive-style or not) and each partition contains ~10 files. Then we'd have something like the following:
-
/data/year=2020/month=10/day=20/hour=00/<30 files>
-
/data/year=2020/month=10/day=20/hour=01/<30 files>
-
/data/year=2020/month=10/day=20/hour=02/<30 files>
- ...
That is, we would have 30 files/hour * 24 hours = 720 files/day
, 21,600 files/month
, 259,200 files/year
. There are users who want to create a Hyperspace index on top of this data. Since data comes in at hourly granularity, users would want to call refreshIndex
(in incremental mode) to continue maintaining the index.
Currently, Hyperspace relies on a full enumeration on the underlying file system to determine what data has been appended/deleted by comparing it against what is stored in the _hyperspace_log
. However, this process could be prohibitively expensive in scenarios like the one above, especially if the user wants to call refreshIndex
more frequently.
There is a potential scope for optimization if the user provides us with information that indicates which portion of the data has changed. This can be done potentially by allowing users to specify the mutable portion of the underlying source data. For instance:
hyperspace.refreshIndex("indexName", scanPattern="<scan-pattern>")
Therefore, the user would use this API in one of the following ways:
- Data change in a particular partition:
refreshIndex("idx1", scanPattern="/data/year=2020/month=10/day=29/hour=03/*")
- Data change in a particular hierarchy:
refreshIndex("idx1", scanPattern="/data/year=2020/month=07/*")
In both cases, Hyperspace can avoid doing a naive enumeration on the underlying data source and instead look for changes only in the given scanPattern
.
Acceptance criteria
- [ ] New
refreshIndex
API that can take in thescanPattern
parameter
Success criteria
- [ ] Measure the time difference for invoking the current implementation of
refreshIndex
which does a full enumeration on a data source containing ~300k files with a single partition change and compare against the new implementation which scans for changes only for a given partition(s)