dataall icon indicating copy to clipboard operation
dataall copied to clipboard

Re-index Open search catalog

Open TejasRGitHub opened this issue 1 year ago • 4 comments

Is your idea related to a problem? Please describe. In data.all , whenever a change happens on a dataset, the indexer is ran and it automatically updates the Open search index.

There can be situations in which data.all admin manually edits/updates/ deletes records from RDS which pertain to the datasets. In this case, the catalog index is not updated.

Moreover, in case when a dataset is being mutated and some unintended error occurs which is not handled , few thing on the dataset might get updated and other don't . This might happen during testing or if some certain bugs are present. In this case, the data.all admin manually updates the opensearch index.

Describe the solution you'd like A way in data.all where in the data.all admin could easily re-sync/ update the open search index. This could be maybe a UI button which is only displayed to data.all admin OR could be a config in cdk.json where in during the deployment the open search is again updated and re-synced

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

TejasRGitHub avatar Feb 26 '24 19:02 TejasRGitHub

Hi @TejasRGitHub - could this idea be an enhancement to the logic already in place in the ECS Scheduled Catalog Indexer Task which runs every 6 hours (/backend/dataall/modules/catalog/tasks/catalog_indexer_task.py)?

This ECS task should be able to handle updates to data objects - maybe can extend the logic to also include deletes if some data objects no longer exists or a similar type of logic if required?

noah-paige avatar Feb 27 '24 22:02 noah-paige

Hi @noah-paige , thanks for pointing that out. I think we can use this existing Catalog Indexer and the ECS to extend it to the datasets object. Currently I see that tables , folders and Dashboard are indexed and we could potentially just extend it to the dataset objects.

Although this itself would solve the problem of indexing the dataset objects and Re-index the Catalog. I was also thinking if it would be helpful to manually start this process of re-indexing from the UI. This button would only be visible to the data.all admins in which they could start the indexer if needed.

@noah-paige , @zsaltys , @anushka-singh , @rbernotas any thoughts on above ?

TejasRGitHub avatar Feb 28 '24 17:02 TejasRGitHub

Hi @TejasRGitHub, I agree with using the ECS catalog indexer task and extend it to datasets. As a data.all admin, they can trigger the ECS task on demand directly in ECS (with ECS API commands). Do you think that is enough? Or should we add a UI functionality? Curious to hear other people's thoughts

dlpzx avatar Mar 20 '24 22:03 dlpzx

Hi @dlpzx , I think we should have a UI for triggering this functionality on the fly. Also, a separate UI to delete indexes, update indexes would be good to have. Currently if you have to delete an index on serverless opensearch it is a tedious process of setting up EC2 to reach the Opensearch cluster. A UI which is only visible to admins, would be a lot helpful here I think

TejasRGitHub avatar Apr 19 '24 13:04 TejasRGitHub

This feature to allow Admins to re-index the data.all Catalog has been implemented in PR #1365

It allows Admins to run re-index catalog tasks to sync catalog objects with data.all DB and optionally delete any orphaned resources on-demand

Closing this issue by EOD today - please do let us know if any additional follow ups or concerns

noah-paige avatar Jul 01 '24 18:07 noah-paige