dataall
dataall copied to clipboard
Use postgreSQL Full Text Search instead of ElasticSearch
Is your feature request related to a problem? Please describe. Use and maintain Opensearch cluster is hard and costly to maintain (the AWS serverless version cost 700.80$/month).
Describe the solution you'd like Use Full text search in postgresql
https://www.postgresql.org/docs/current/textsearch.html For example : https://amitosh.medium.com/full-text-search-fts-with-postgresql-and-sqlalchemy-edc436330a0c https://www.crunchydata.com/blog/postgres-full-text-search-a-search-engine-in-a-database
Additional context I can help to add this functionality, I'm very interested by the product but Opensearch/ES is clearly a blocking point.
P.S. Please Don't attach files. Add code snippets directly in the message body instead.
Curious if the Opensearch serverless version that was announced a re:invent would be more cost effective if that is the blocking point. what maintenance activities make it difficult to maintain?
Sadly, the serverless version is not cost effective. The maintenance activities makes opensearch hard to maintain : the reindex, the mapping change, upgrading cluster... I know AWS had done a great jobs to make opensearch better but for little organisation, it is hard to operate.
Hi @blackrez, I agree that OpenSearch indexes and mappings are hard to update in OpenSearch. Also, OpenSearch is the service that takes the bigger portion of data.all costs
You are more than welcome to work on implementing the functionality :) Fork the code and open a PR when the code is ready. Keep in mind that to make it mergeable, you will need to make the "underlying catalog database" configurable, something like adding a parameter in the cdk.json that defines whether we use RDS or Opensearch.
Feel free to add here your questions, comments or designs and we can discuss them
Hi @blackrez we are exploring the possibility to implement OpenSearch serverless into data.all (@kukushking). For the case of data.all, where we hardly ever update mappings or have to re-index, do you see any limitations into using OpenSearch serverless as a cost-effectiveness solution? Could you give more detail about other alternatives? Your opinion is highly appreciated :)
Hello @dlpzx,
From my point of view, OpenSearch Serverless is not at all cost-effective.
You can configure a maximum number of OCUs for your account in order to control costs. You're billed for a minimum of 4 OCUs allocated for your workloads when you create a collection.
It will cost 0,24$4OCU730h = 700$
My main alternative is the full text search from PostGreSQL. The creation of index will be simpler (https://www.postgresql.org/docs/current/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX). I don't see any technical limitations to achieve this. Most text functions are present in and it works out of the box.
I found a lot of code about the index mapping and creation but not a lot about the search itself.
@InasK6, I believe your customer conducted a cost analysis on this component of data.all. Would it be possible to share the results?
hello @dlpzx @blackrez
Context
Indeed, the main findings from my cost analysis are that
- OpenSearch drives the cost of the platform
- from the documentation of Opensearch serverless here, we can see that:
- When you create your first Opensearch serverless collection, 4OCUs are instantiated. All subsequent collections can share these OCUs. But one collection can, at most, scale down to 2 OCUs for indexing and 2 OCUs for search. architecture reference
- 1 OCU = 6GB + 1vCPU + data transfer to S3
- At the moment, the estimated cost of using these 4 OCUs is around 660$/Month
Solutions
I have considered two options to reduce these cost:
- Replacing the catalog search capability by RDS filters
- Consider a managed Opensearch cluster with smaller instances
Implementing RDS search capabilites
Advantages
- If we look at the filters currently performed from the catalog UI, we can see that they can be translated queries and do not necessitate heavy searches on indexed text
- We already have the data stored in RDS, no extra cost would be incurred for storage
- This would lead to a cost reduction of 100% by eliminating the Opensearch component
Challenges
- The catalog UI implementation is heavily linked to OpenSearch. It uses UI components with appbaseio/reactivesearch library which generate Elasyic search queries from frontend -> to API Gateway -> Search handler
Integrating RDS search capability means implementing the capability end to end from frontend to graphQL queries to RDS queries and integration with the api_handler. Therefore, extra developer time
- Some features were implemented on top of the current implementation of the Catalog using the current Catalog API calls structure. Therefore, that would mean, needing to adapt every extra feature implemented. Example: Any feature that have been developped modifying the frontend, happened based on the appbaseio/reactivesearch library components syntax
Managed OpenSearch
Advantages
-
Developer effort: You only need to switch to deploy data.all with Managed OpenSearch by setting up
enable_opensearch_serverless
flag incdk.json
file tofalse
+ change two lines of code to specify the instance type of your master nodes and worker nodes Note You cannot chose M3, R3, T2 types. You will get an error saying they do not support encryption of data at rest for your Managed OpenSearch cluster - Cost reduction: We have considered multiple configurations for our workload (460 KB of data stored in our cluster)
- 2 master 2 worker with T3.medium : $660/month to $145/month → 78% cost reduction
- 3 master 3 worker with T3.medium : $660/month to $218/month → 67% cost reduction
- 3 masters and 2 workers with T3.medium : $660/month to $194/month → 71% cost reduction
Challenges
- You might need to perform load testing to estimate the right instance size
- Operational Overhead: As you mentionned, since it is not serverless, you handle the reindex, the mapping change, upgrading cluster... It worth considering at what frequency do you need to do this
- Rollback: If you have already deployed your platform with OpenSearch Serverless, you will need extra effort to migrate back to Managed OpenSearch (ex: copying the data etc...)
Conclusion
Tradeoffs need be considered regarding this choice. If you want to go for the RDS implementation, That would need more developer effort. I have already started implementing db functions and GraphQL objects to test on one of the filter options of the Catalog. So, I could help starting on that. If you consider the Managed OpenSearch option, extra operational or testing overhead can be needed to estimate the right size for your need, but it only needs two lines of code changes
We will close this issue as the moment as we are not planning to implement it. If interest grows we can reopen it and work on it.