hyperspace icon indicating copy to clipboard operation
hyperspace copied to clipboard

deleteOldVersions API

Open sezruby opened this issue 5 years ago • 5 comments

Describe the problem After refreshing an index, the old version of the index remains on the storage. We should keep the old versions to support consistency & isolation of index data, but at some point in time, they're no longer needed. So it would be good if there's an API to clean up the old versions.

Describe your proposed solution

API design

def deleteOldVersions(indexName: String)

hs.deleteOldVersions("indexName")

~~But there's no API to show the list of versions. I think it would be great to provide an API for statistics of an index so that a user can check [ size of index / existing versions / creation time / last used time(from event log).. etc]~~

Now hs.index("indexName") returns "indexContentPaths" column that shows the paths referred by the latest index version. So based on that info, we could validate the given versions and determine which versions should we delete.

Additional context

sezruby avatar Aug 07 '20 01:08 sezruby

But there's no API to show the list of versions. I think it would be great to provide an API for statistics of an index so that a user can check [ size of index / existing versions / creation time / last used time(from event log).. etc]

Why these sentences are struck out? Is there any API to show the list of versions after suggesting this feature? Or is it just another feature suggestion?

paryoja avatar Jun 30 '21 07:06 paryoja

Yea hs.index("indexName") was added after the sentences.

sezruby avatar Jun 30 '21 08:06 sezruby

@sezruby Is there any API shows all histories of given index? hs.index("indexName") seems to show only the latest stable log.

paryoja avatar Jul 02 '21 03:07 paryoja

No but there's an internal API - getIndexLogEntry and you can use getIndexContentDirectoryPaths.

BTW I realized that the following description is invalid now.

hs.deleteOldVersions("indexName", Seq(0, 1, 3)) // remove v__=0, v__=1, v__=3 dir

There are several problems here:

  • "version" doesn't mean the index data directory number (v__*)
  • one index log entry can refer multiple "v__*" directory, because of incremental refresh
  • delta lake time travel query can refer the old version of index data.
  • Hyperspace currently doesn't check the existence of index files at query time, for some reason.

So, I would suggest the following to make things simple:

  • API:
def deleteOldVersions(indexName: String)
hs.deleteOldIndexData("indexName")
  • remove all the directories that the latest version doesn't refer
  • for delta lake source, update the history property in IndexLogEntry, "deltaVersions"
    • reference pr : https://github.com/microsoft/hyperspace/pull/272
    • e.g. if the history value is 1:1,3:5,5:7,7:10,9:15, we can update the value as 9:15 for the latest log version.
    • this requires to create a newer version of index log entry

sezruby avatar Jul 02 '21 07:07 sezruby

@sezruby Ok. I will work on this. Since I am not aware of Delta Lake time travel query, I will first do the simple implementation ### first and ask you about how time travel query works.

For naming convention, delete index doesn't remove actual index files but vacuum index does remove the files.

Since the new api actually removes the index files (except the latest one), I think it is more like vacuumOldVersions or vacuumOld

WDYT?

paryoja avatar Jul 08 '21 05:07 paryoja