[SIP-166] AI Assistant
[SIP-166] Proposal for AI Assistant
Motivation
An accurate text-to-SQL translator (AI Assistant) can greatly enhance the SQLLab user experience by increasing productivity, supporting users with limited SQL knowledge, and making it easier to discover and access data in SQLLab.
Proposed Change
We propose implementing a text-to-SQL translator that is intentionally simple — avoiding the use of RAG, vector databases, or agentic LLM frameworks. This approach is designed to maximize compatibility across diverse database types and sizes, provide flexible configuration options, and leverage user-supplied context filtering when available. The system is built to handle scenarios with limited support gracefully, ensuring robust operation even when some functionality is unavailable.
We believe that by intentionally keeping this solution simple and avoiding complex dependencies, it will be easier for the community to reach consensus and approve its inclusion. This practical and accessible first implementation of the AI Assistant is designed to accelerate its adoption and help it materialize sooner as an official Superset OSS release.
The AI Assistant was developed in alignment with the guiding principles outlined above, within a dedicated fork of the Superset repository, based on the 5.0.0rc3 tag. For a comprehensive overview of its features and configuration, refer to the AI Assistant documentation.
New or Changed Public Interfaces
-
React Components:
- AI Assistant Editor: Introduced in SQLLab as a text input bar for interacting with the AI Assistant.
- Table Selector: Enhanced to allow multi-selection of schemas.
- SQL Editor: Updated to support schema multi-selection.
- AI Assistant Options: Added as a tab in the Database modal for configuring AI Assistant settings per database.
- Table View: Added a SQL comment icon next to each column name.
-
REST Endpoints:
sqllab/generate_db_context: Initiates a rebuild of the database metadata LLM context.sqllab/generate_sql: Sends user prompts to the LLM provider to generate SQL queries.sqllab/db_context_status: Retrieves the status of the database metadata context and the context builder worker.database/{db_id}/schema_tables: Returns all schemas and tables for a specified database.
-
Dashboards or Visualizations:
No changes. -
Superset CLI:
No changes. -
Deployment:
No changes.
To simplify the setup of a custom Docker Compose deployment (e.g. deploying this fork), we have provided a shell script and configuration files. Detailed instructions and resources can be found here.
New dependencies
The new dependencies introduced are primarily related to integration with supported LLM API providers and data structure validation for building the database metadata context JSON file:
google-genai: Python SDK for Google Generative AI.openai: Python SDK for OpenAI models.anthropic: Python SDK for Anthropic models.pydantic: Used for robust data validation and serialization.
These dependencies are required to enable AI Assistant functionality and ensure reliable handling of LLM-related data.
Migration Plan and Compatibility
Since these are additive changes, migration should be straightforward.
Changes to metadata database tables:
- llm_connection: New table.
- llm_context_options: New table.
- context_builder_task: New table.
No breaking changes are expected, and existing deployments can be upgraded without data loss. Standard database migration procedures apply.
I think this is fantastic, but I think this has a direct corrolation with how we plan to build extensions as part of SIP-151. We plan to build something like VS Code does (see docs) so that any/all extensions can interact natively with the host app (Superset)'s configured LLM(s).
All of this is still being sorted out, so I wouldn't recommend voting on this, until it fits into that other plan.
I think this is fantastic, but I think this has a direct corrolation with how we plan to build extensions as part of SIP-151. We plan to build something like VS Code does (see docs) so that any/all extensions can interact natively with the host app (Superset)'s configured LLM(s).
All of this is still being sorted out, so I wouldn't recommend voting on this, until it fits into that other plan.
@rusackas Thanks for the update.
Hi All, I see you guys have updated a lot, can I now run and test with docker compose?
@bachtdx Yes, we added an example set of configuration files for Docker compose deployment here. Be sure to modify env_variables.sh with:
export SUPERSET_GITHUB_URL=https://github.com/tenstorrent/apache-superset-tt/tree/awilliamsTT/ai-assistant
export TAG="awilliamsTT/ai-assistant"
Merged newly released V5.0.0.
I'm admittedly over-my-skis a bit here, but it seems like MCP may be a helpful in how this all works. Most DBs have or are in the process of releasing an MCP Server (e.g. https://github.com/awslabs/mcp/tree/main/src/redshift-mcp-server for Redshift). Perhaps this could be architected such that Superset, via a variety of LLMs, talks to those MCP servers (maybe configure the MCP server at the DB connection level..?) to get all necessary metadata on the fly, avoiding the need for the Redis cache and also avoiding the LLM caching limitations...? Either way, super-excited to see where this goes!
@jpdeloitte that's certainly a good idea as part of the architecture of a Superset AI agent, which isn't really what we are aiming to achieve with this proposal. Our core idea is to build a very simple text-to-sql functionality with the objective to use it as soon as possible. We employed plenty of already available Superset functions to collect schema metadata. Those functions power the SQLLab database explorer on the left-side bar (schemas, tables, columns, data types, SQL comments, constraints, etc.). Our code simply packages this info in a JSON file that is then added to the context of the LLM prompt. This file is built on a schedule and cached for efficiency.
Integrating an MCP client will surely be a great follow-up proposal, which will open the door for more than just text-to-sql. For instance, you could envision a Superset chat, with agent properties, that you can talk to and included in its answers is information retrieved from the connected databases.
It does seem though an overkill for simple schema-aware text-to-sql, which doesn't include automated execution of the query and a sequence of follow-up LLM interaction to blend the query results into an LLM response or agent action.
@diegoscarabelli Hi Diego, thanks for your effort.
Super idea! 🚀 AI is becoming a must-have in analytics platforms, and bringing it into Superset would be a game-changer. I was about to propose something similar, so it’s great to see this request already here. We could even speed things up by leveraging Open Source Library Vanna for natural language to SQL and AI-powered insights. Excited to see where this goes! would like contribute as well @mistercrunch @diegoscarabelli @michael-s-molina @sadpandajoe
Curious if there's been any movement on this initiative?
Priority [at least for us at Preset] is shipping an MCP companion service as part of Superset, enabling people everywhere to connect their LLMs and agents to their Superset instances. See the SIP here https://github.com/apache/superset/issues/33870 and ongoing active development here: https://github.com/apache/superset/pull/33976 . Goal is to ultimately enable agents through MCP to pretty much perform any action or fetch context the same way a user can in Superset.
In terms of RAG-supporting infrastructure and more specific use cases, we're currently pointing towards the fast-evolving Superset extension framework, and keeping agents out of core Superset and into extension(s) (see the SIP here https://github.com/apache/superset/issues/31932). Note that these RAG uses cases and extensions will be enabled to build on top of the MCP service as it launches. Personally I feel like design patterns around leveraging AIs inside products are still largely settling, and that constraints/restrictions as to how to bring an LLM-in-the-loop inside products is still very varied and unsettled across organizations, namely policies around which models to bring, through which provider, considerations around what agents should be able to see or act-upon are largely still semi-charted territory. So from my perspective it makes sense to enable for this to mature within extensions, at least for a time. So no blockers as to building things, anyone can ship an extension today, we're just keeping it out of the main repo for now, and outside the scope of direct maintenance for core maintainers. If you look at the vscode ecosystem for inspiration, they have done a fantastic job at delivering amazing AI-powered extensions, and and the fact that core vscode is focussed on core things, it enables for a lot of various options in the ecosystem: varied preferences and approaches can cohabit without muddying/slowing-down the core project. Separation of concerns is great here.
Loosely related, but a vision around how MCP services are coming together to serve cross-tools use cases: https://preset.io/blog/the-promise-of-mcp-powered-data-workflows/ - this share some of my views around how I think personally things may evolve, but LLMs can achieve so much more things when they are connected to a set of MCPs that align with your organization's tool and personal workflow (superset, airflow, dbt, datahub, cube.dev, snowflake, ...) and break the boundaries of contexts that exist when confined to a single app...