superset icon indicating copy to clipboard operation
superset copied to clipboard

[SIP-182] Semantic Layer Support in Apache Superset

Open betodealmeida opened this issue 6 months ago • 7 comments

SIP: Semantic Layer Support in Apache Superset

Abstract

This proposal introduces changes to Apache Superset to better support semantic layers and external data modeling systems. The changes include (1) the definition of an “Explorable” interface (a Python protocol), and (2) the introduction of a new class of connections for semantic layers and similar systems.

Motivation

Semantic layers are a powerful way to expose well curated metric and related dimensions to users, allowing for an improved user experience when presenting data: instead of focusing on datasets, semantic layers in general operate on a higher level of abstraction, exposing curated metrics as first class citizens. Once a user has selected one or more metrics that they're interested in, most modern semantic layers allow them to slice and dice metrics by presenting associated dimensions, automatically performing joins between the underlying data sources. This workflow, where metrics are curated and their dimensions are freely available, allows users to focus on the metrics that matter to them, while providing confidence that the underlying data is correct and well defined.

Because Superset is fundamentally dataset-centric, integrations with semantic layers have been timid so far. When they exist, they usually represent the semantic layer as pseudo-database, exposing one or more pseudo-datasets that represent models in the semantic layer. Cube, for example, uses the Postgres wire protocol to expose cubes as pseudo-tables that can be queried in Superset. Minerva, an inhouse semantic layer from AirBnB, uses a different approach, exposing all metrics and dimensions as a single dataset, with custom overrides to indicate to the users which metrics and dimensions are compatible. Other experimental integrations (with MetricFlow, Snowflake, and DataJunction) used a similar approach, though they haven’t been contributed to OSS yet.

There are a couple limitations in Superset's architecture that create friction when integrating with semantic layers and modern data platforms:

Datasets are the Superset Semantic Layer

In order to explore data in Superset we need some kind of semantic layer that informs us which columns are available, which ones can be filtered, which ones are groupable, etc. In Superset, that semantic layer is the dataset, a thin abstraction that allows users to define metrics, declare derived columns, and add metadata that informs the UI and prevents expensive queries from running (group bys on high cardinality columns, for example).

Because the dataset is the native semantic layer, adding an external source from a semantic layer as a dataset to Superset is unlikely to work as expected, since we’re adding a second semantic layer on top of the first one. For example, the external semantic layer might not allow for adhoc metrics or computed columns, making it incompatible with the Superset dataset editor, as well as making the experience in Explore unintuitive and error-prone.

For these “semantic datasets” (or “semantic models”) we likely want to disable the Superset semantic layer, since the metadata is already defined externally:

  1. Users shouldn't be able to add metrics nor calculated columns when editing them, unless the semantic layer supports adhoc expressions.
  2. Similar to the "Sync columns from source", metrics should have a button to "Sync metrics from source", to fetch new or updated metrics from the semantic layer.
  3. When explored, users should not be allowed to add adhoc metrics, derived columns, or use "Custom SQL", unless the semantic layer supports adhoc expressions.
  4. When explored, not all metrics and dimensions might be compatible, requiring a mechanism for disabling metrics/columns as the user selects metrics/columns, similar to the inhouse approach used by Minerva.

This suggests that these semantic models should not be represented as standard datasets in Superset, given that in order to make it work we need to remove all the value that datasets provide — the semantics.

Query generation

The current flow for chart creation is tightly coupled not only with SQL but with SQLAlchemy. The frontend sends a request to the backend, via the QueryObject, indicating which columns, metrics, filters, and other parameters the user has selected. The backend then generates a SQLAlchemy query from this payload. The query generation is highly dependent on the specific database engine, so the get_sqla_query method inspects several attributes of the DB engine spec. Once this query is generated, it is transpiled to the target dialect using SQLAlchemy, and passed to the DB engine spec for execution.

This approach works reasonably well for traditional databases, but it creates friction when integrating with semantic layers that do not use SQL or SQLAlchemy. For example, if a semantic layer uses GraphQL or REST APIs, the current flow requires parsing the generated SQL and building a custom request. This is the case for the experimental MetricFlow DB engine spec, which has this flow:

Explore controls
       ↓
  QueryObject
       ↓
SQLAlchemy query
       ↓
   pseudo-SQL
       ↓
Shillelagh handler
       ↓
 GraphQL request

For the new Snowflake semantic layer, even though it exposes a SQL interface to semantic views, the flow looks like this:

Explore controls
       ↓
  QueryObject
       ↓
SQLAlchemy query
       ↓
   pseudo-SQL
       ↓
  sqlglot AST
       ↓
   actual SQL
       ↓
    execute

The SQL parsing step is necessary for Snowflake because Explore builds the query on top of either a table or a subquery, but in Snowflake it should be a UDTF (user defined table function):

-- SQL generated by Explore
SELECT "Item.Brand", "Store.State", "StoreSales.TotalSalesQuantity"
FROM pseudo_table
GROUP BY "Item.Brand", "Store.State";

-- final SQL after parsing and manipulating the AST
SELECT * FROM SEMANTIC_VIEW(
    TPCDS_SEMANTIC_VIEW_SM
    DIMENSIONS Item.Brand, Store.State
    METRICS StoreSales.TotalSalesQuantity
);

This is not only inefficient and brittle but also limits the flexibility of how queries can be executed against different data sources.

Proposed Change

In order to properly support semantic layers we need to move away from the current solutions based on pseudo-databases and custom DB engine specs. Instead, we should implement first class support for semantic layers, bypassing the need for using a dataset when exploring data. This will allow users to choose between the semantics provided by Superset datasets, or an external system.

The first change will be the introduction of an Explorable interface, defined as a Python protocol. This interface is in some ways similar to the existing ExploreMixin, that was added when we decided to support Query objects in explore. The problem with the current ExploreMixin is that it’s too tightly-coupled with SQLAlchemy and datasets, and has no clear separation between the functionality needed for data exploration vs. query generation.

The new Explorable, on the other hand, is concerned only with chart building:

@runtime_checkable
class Explorable(Protocol):

    # =========================================================================
    # Core Query Interface
    # =========================================================================

    def get_query_result(self, query_object: QueryObject) -> QueryResult:
    def get_query_str(self, query_obj: QueryObjectDict) -> str:

    # =========================================================================
    # Identity & Metadata
    # =========================================================================

    @property
    def uid(self) -> str:

    @property
    def type(self) -> str:

    @property
    def metrics(self) -> list[Any]:

    @property
    def columns(self) -> list[Any]:

    @property
    def column_names(self) -> list[str]:

    @property
    def data(self) -> ExplorableData:

    # =========================================================================
    # Caching
    # =========================================================================

    @property
    def cache_timeout(self) -> int | None:

    @property
    def changed_on(self) -> datetime | None:

    def get_extra_cache_keys(self, query_obj: QueryObjectDict) -> list[Hashable]:

    # =========================================================================
    # Security
    # =========================================================================

    @property
    def perm(self) -> str:

    # =========================================================================
    # Time/Date Handling
    # =========================================================================

    @property
    def offset(self) -> int:

    # =========================================================================
    # Time Granularity
    # =========================================================================

    def get_time_grains(self) -> list[TimeGrainDict]:

    # =========================================================================
    # Drilling
    # =========================================================================

    def has_drill_by_columns(self, column_names: list[str]) -> bool:

    # =========================================================================
    # Optional Properties
    # =========================================================================

    @property
    def is_rls_supported(self) -> bool:

    @property
    def query_language(self) -> str | None:

Note the get_query_result method, which essentially returns a Pandas dataframe from a QueryObject. This allows exploring data from sources that are not SQL based, as well as decoupling the SQL generating from SQLAlchemy. This gives a much simpler flow:

Explore controls
       ↓
  QueryObject
       ↓
  Explorable
       ↓
   DataFrame

In addition to the Explorable we would also add models for a new class of connections for semantic layers. This would be similar to existing database connections, but with a few key differences:

  1. There will be no concept of a SQLAlchemy URI. The connection info should have a well defined schema, which could vary wildly between different semantic layers. For example, Malloy could point to a GitHub repository and a database; Snowflake would require parameters to build a SQLAlchemy engine, just like today; and MetricFlow would require an API key and an optional customer URL. This is similar to how some DB engine specs today use the BasicParametersMixin for a better experience when defining the connection parameters.
  2. The association between a given instance of a semantic layer and its implementation will be explicit. Today, there’s only an implicit mapping between a Database instance and the engine spec that should be used with it; we infer the DB engine spec based on the scheme of the SQLAlchemy URI, which has caused problems in the past, especially because early implementations used only the backend name, and not the driver.

An important note is that this change would not be very different from the initial versions of Superset, where we had different connectors for Druid (via REST and JSON, before its SQL API was introduced) and SQLAlchemy. Having different connectors offered some challenges, mostly when defining adhoc metrics (since for Druid the user would have to provide JSON). I hope that we can avoid these problems by ensuring a consistent interface in Explore that works across all types of Explorable, while still allowing for flexibility in how queries are executed.

Finally, we would also need a lightweight model for explorables, containing metadata about them: UUID, name, parent connection, timeout, default metric when exploring.

The introduction of semantic layers should open up new interesting workflows. A deployment of Superset could have a traditional connection to Snowflake, allowing powers users to run SQL directly, as well as defining semantic views in SQL Lab. These semantic views could then be exposed to other users via the semantic layer connection, providing a curated collection of metrics and dimensions that non power users would have access to. Both connections to Snowflake, via DB engine spec and via semantic layer, would have their purposes and target audiences.

Implementation Plan

Phase 0: implement the Explorable protocol

  • This will be done regardless of this SIP, since it adds value to Superset regardless of semantic layers.

Phase 1: introduce semantic layers

  • Add feature flag for enabling semantic layer support.
  • Add new models
  • CRUDIE (create, read, update, delete, import, export) for semantic layers.
  • DAR for semantic layers and semantic models.
  • Allow users to explore semantic models.

Phase 2: UI/UX Updates

  • Update Explore to support semantic models:
    • No adhoc metrics, derived columns, or custom SQL when not supported.
    • Implement reactive UI for metric/dimension compatibility matrix.

All of these phases require considerable work, and should be done with constant feedback from the community when it comes to terminology, UI, and UX.

Alternative Approaches Considered

We considered using the plugin architecture for semantic layers. While potentially valuable, the proposed approach provides the necessary flexibility without the complexity of a full plugin system, especially when taking in consideration the tight coupling between Superset and SQL/SQLAlchemy.

betodealmeida avatar Sep 03 '25 19:09 betodealmeida

Thank you for this SIP @betodealmeida .

We considered using the plugin architecture for semantic layers. While potentially valuable, the proposed approach provides the necessary flexibility without the complexity of a full plugin system, especially when taking in consideration the tight coupling between Superset and SQL/SQLAlchemy.

One thing we recently noticed is that solutions tend to vary significantly when you consider extensions up front vs introducing them later. This was the case with Client/Server MCP tools and more recently with @villebro's work on the Global Async Task Framework.

Would you be willing to join our Extensions meeting on Thursdays so we can discuss this in more detail? It's a public meeting available on Superset's calendar.

michael-s-molina avatar Jan 06 '26 16:01 michael-s-molina

Yeah, I'll be there!

betodealmeida avatar Jan 06 '26 18:01 betodealmeida

👋 Hello!

I work on Minerva, Airbnb's in-house semantic layer.

IIUC this proposal would make it easier to use Superset as a thin, read-only layer on top of a "headless" semantic layer like Minerva, Cube, etc. I think this approach is very practical given that Superset is "dataset-centric".

We've been using Superset like this for many years. It works. But is has its limitations - mainly, analytics development lifecycle velocity.

As an example, let's look at the workflow you mention above:

  1. Power user explores raw data with SQL
  2. Power user defines the semantic layer in an external system (e.g. semantic view DDL for snowflake, or git-based config files in Minerva, Cube, MetricFlow, etc).
  3. Power user registers the semantic layer in Superset
  4. Non-power users build charts on top of the semantic layer

However, there are some complications, including:

  • Slow iteration - The semantic layer might have concepts that can't easily be tested with raw SQL in step 1, so simply developing a metric spans steps 1-4, which involves multiple different systems.
  • Fragile version control - If a particular explorable/metric/etc is widely used, you have to be careful when changing its definition. Without first-class version control in Superset, it can be very hard to test how a change might affect existing charts and dashboards.

So you end up with a heavyweight, waterfall-style lifecycle that necessitates power users. Many non-power users end up bypassing the semantic layer, modeling data directly in Superset's thin semantic layer. This works well for isolated, quick analyses but lacks many of the benefits of a more powerful semantic layer (collaborative data modeling, preaggregation, integration with other applications, etc.).

So I love the direction of this proposal, and I think its worthwhile, but I think's still fundamentally limited compared to tools like Looker and Tableau, which have more sophisticated semantic layers.

I'm new to the Superset OSS community, and this might be tangential to other goals, overly-ambitious, or already discussed, but instead of keeping Superset's semantic layer thin, maybe we should be making it fatter? My dream is building Minerva directly into Superset...

barakalon avatar Jan 06 '26 19:01 barakalon

@barakalon thanks for your comment!

I do think we should standardize the Superset semantic layer and make it more powerful. Here's what I've been thinking for the roadmap (this is just my personal opinion, and would need discussion with the community):

Phase 1 (this SIP). We add support for external semantic layers

  • In this stage we have two paths for chart building. Explore generates a QueryObject, which gets passed to either a SqlaTable (dataset), a Query, or a SemanticView.
  • The QueryObject is really complex today. It's built from 22 arguments + kwargs, and some of those arguments are dictionaries with multiple keys. I didn't want to bring this complexity to semantic layers — I want it to be easy to add new semantic layers — so I built an interface that converts the QueryObject to an isomorphic SemanticRequest object with a limited number of fields: metrics, dimensions, filters, order, limit, offset.

Phase 2 (we probably want this). We add write support for external semantic layers

  • Malloy has some cool demos exploring this — and they don't even use SQL! This could increase the iteration speed. A user in Explore would be able to define a metric, chart it, and save it back to the semantic layer.

Phase 3 (this is just in my head right now). We formalize the Superset semantic layer

  • We have the simplied SemanticRequest, so why not use it for Superset as well? Given a SqlaTable and a SemanticRequest we can easily build the SQL necessary to fetch data, and get rid of the current mess we have (get_sqla_query, which generates SQL from the QueryObject, is function with 700+ lines of code, most of it not covered by tests).
  • We would implement a new path for charts: users would still create Superset datasets and explore them. The changes would be just in how the SQL is generated and fetched. There would be the current path (complex, via get_sqla_query) and the current path (more simple, using the SemanticRequest).

Phase 4. We get rid of the old path and QueryObject

  • Here we get rid of the old path. Charts would generate the much simpler SemanticRequest object, and would pass it directly to the semantic views. Code would be smaller and easier to understand.
  • More importantly, the Superset semantic layer would be just one next to others (Minerva, DJ, Snowflake, Cube, Mallow, OSI). It would have clear interfaces, and it would be easier to extend: we could add a way of defining relationships with dimensional tables, for example.
  • It could also lead to Superset having more than one native semantic layer. Instead of replacing the current semantic layer with Minerva, like you want, you could add Minerva and have it side-by-side with the original one.

betodealmeida avatar Jan 06 '26 21:01 betodealmeida

I like the idea of generalizing from SQLAlchemy to something more generic. I feel SQL, and in particular SQLAlchemy, has given us a lot over the years, and Superset wouldn't have become as successful as it has without it. However, I feel with a little architectural wizardry, we should be able to open up Superset to other sorts of integrations, like

  • Non-SQL based datasources, like Prometheus
  • Non-tabular datasources, like graph databases, REST APIs

The long standing request to support PromQL and other TSDBs (the discussion is still accumulating +1s) is a good indication that there is demand for Superset outside pure SQL. Making this generic isn't necessarily easy, but at least considering these will help us isolate the boundaries we want to work with in the future. For instance, if we do decide to only support tabular datasources, then that should dictate the structure of the Explorable and ExplorableData interfaces. But if we want to go more generic, we may need deeper abstractions, like TabularExporableData which in turn extends ExplorableData. The same would then need to apply to charts, controls etc, as SQL controls will likely behave very differently from non-tabular datasource controls.

villebro avatar Jan 06 '26 22:01 villebro

@villebro @michael-s-molina

Here's the schema I'm using for a semantic query (the branch has code to map a QueryObject to the SemanticQuery):

https://github.com/apache/superset/blob/e253bd2fb3f8d2dbb7398b955d71e3ca11ecfe83/superset/semantic_layers/types.py#L321-L333

And this is the core of the protocol that individual semantic layers would have to implement (Minerva, eg):

https://github.com/apache/superset/blob/e253bd2fb3f8d2dbb7398b955d71e3ca11ecfe83/superset/semantic_layers/types.py#L437-L497

betodealmeida avatar Jan 08 '26 18:01 betodealmeida

This is the initial work, in this case the signature for get_dataframe and get_row_count could be simplified to:

def get_dataframe(query: SemanticQuery) -> SemanticResult:
    pass

def get_row_count(query: SemanticQuery) -> SemanticResult:
    pass

betodealmeida avatar Jan 08 '26 18:01 betodealmeida

Thanks for the references @betodealmeida. Could you update the SIP description with the results of our discussion? Especially the Rejected Alternatives section.

michael-s-molina avatar Jan 09 '26 17:01 michael-s-molina

@michael-s-molina edited to add:

In this work, new semantic layers will be added as extensions, following the design from https://github.com/apache/superset/issues/31932.

betodealmeida avatar Feb 04 '26 19:02 betodealmeida

More importantly, the Superset semantic layer would be just one next to others (Minerva, DJ, Snowflake, Cube, Mallow, OSI). It would have clear interfaces, and it would be easier to extend...

To me, this is one of the "juicy bits" of the proposal, and something we should be doing as part of the Extensions effort in general. It's easy to fall for the temptation of adding Extensions alongside (or overriding) the existing Superset functionality as an alternative. But migrating the existing Superset feature to be a built-in Extension serves a few key goals:

  1. Defines the interface in a way that for sure supports Superset's current needs and functionality
  2. Make sure that our tests and day-to-day users are effectively hardening that interface
  3. Shrinks the "core" codebase, making sure that the feature can be maintained/improved/release independently, reducing maintenance burden on the core codebase.

Over time, I would expect many of our core competencies will become built-in extensions, and a world of alternative options will proliferate around them. Then Superset becomes a Data Access Platform limited only by its API / MCP / Components more so than the built-in features that come with it in the box.

rusackas avatar Feb 04 '26 20:02 rusackas

Thanks for the great discussion in the Extensions meeting @betodealmeida! I'm pasting here what we aligned to make semantic layers extendable:

Semantic Layers: Extension System Integration

Goal

Transition from Python entry points to Superset's extension system with extension.json metadata, enabling:

  • Declarative contribution registration
  • Lazy loading / activation events
  • Marketplace discovery without code execution

Architecture Changes

1. Move Interfaces to superset-core

Protocols and types move to superset-core (the shared pip package):

superset-core/
└── semantic_layers/
     ├── types.py            # Dimension, Metric, Filter, SemanticResult, etc.
     ├── semantic_view.py
     └── semantic_layer.py

Implementations remain in superset or extensions.

2. Add Backend Contribution Types

Current state: Frontend has structured contributions (views, commands, menus, editors). The backend only has entry points.

Proposed: Add contributions to the backend. This would also formalize existing superset-core APIs (REST APIs, MCP tools) that are registered via Python calls.

// extension.json
{
  "backend": {
    "contributions": {
      "semanticLayers": [
        {
          "id": "snowflake",
          "name": "Snowflake Semantic Layer",
          "description": "Connect to Snowflake's semantic layer",
          "module": "my_extension.snowflake.SnowflakeSemanticLayer"
        }
      ],
      "restApis": [
        {
          "id": "my_api",
          "name": "My Extension API",
          "module": "my_extension.api.MyExtensionAPI"
        }
      ],
      "mcpTools": [
        {
          "id": "query_database",
          "name": "Query Database",
          "description": "Execute a SQL query against a database",
          "module": "my_extension.mcp.QueryDatabaseTool"
        }
      ],
      "mcpPrompts": [
        {
          "id": "analyze_data",
          "name": "Analyze Data",
          "description": "Generate analysis for a dataset",
          "module": "my_extension.mcp.AnalyzeDataPrompt"
        }
      ]
    }
  }
}

Benefits of declarative metadata:

  • Activation events: Like VSCode - load extensions lazily when their contribution type is needed
  • Marketplace/Discovery: Display available semantic layers without importing code
  • Dependency resolution: Understand what extensions provide before loading
  • Security review: Admins can review contributions before enabling

This pattern could also be applied to other existing Superset features like database engines, auth providers, cache backends, and alert handlers - allowing them to be provided as extensions with the same benefits.

3. Replace Registry with Extension Manager

Current: Standalone registry.py with register_semantic_layer() / get_semantic_layer().

Proposed: Use extension manager as the registry:

# Instead of:
from superset.semantic_layers.registry import get_semantic_layer
layer_cls = get_semantic_layer("snowflake")

# Use:
from superset.extensions import extension_manager
layer_cls = extension_manager.get_contribution("semanticLayers", "snowflake")

This mirrors the frontend pattern:

// Frontend (TypeScript)
import { extensionManager } from "@superset-ui/core";
const editor = extensionManager.getContribution("editors", "monaco_sql");

Benefits:

  • Consistency between frontend and backend
  • Single source of truth for all contributions
  • Lazy loading - defer until contribution is requested
  • Metadata-first - query available contributions without loading code

4. Mapper Location

The mapper stays in superset (not superset-core) since it depends on:

  • Superset's QueryObject
  • BaseDatasource and other internal types
  • Superset-specific query handling (time comparisons, series limits)

Open Questions

1. Interface Pattern

Option A: Protocol pattern (current branch approach)

Uses Python's Protocol from typing module with @runtime_checkable decorator.

Option B: Stub replacement pattern (consistent with superset-core)

Uses concrete base class with methods that raise NotImplementedError.

We decided for Option B.

Gap Analysis: Dimension/Metric Compatibility

Bidirectional Compatibility Filtering

Some semantic layers have constraints where not all metric/dimension combinations are valid. The available metrics may depend on which dimensions are selected, and vice versa. This requires the ability to filter dimensions and metrics based on each other.

Use cases:

  • Metrics tied to specific dimension sets - selecting a metric limits available dimensions
  • Dimensions tied to specific data sources - selecting dimensions limits available metrics
  • The UI should dynamically filter available options as users make selections

Current API Gap

The proposed SemanticViewImplementation assumes dimensions and metrics are globally available:

# All dimensions (static)
get_dimensions()

# All metrics (static)
get_metrics()

This doesn't support semantic layers where compatibility depends on what's already selected.

Proposed API Extension

Add optional methods for compatibility filtering:

# Returns metrics compatible with the selected dimensions
get_compatible_metrics(selected_dimensions)

# Returns dimensions compatible with the selected metrics
get_compatible_dimensions(selected_metrics, selected_dimensions)

These would be optional - semantic layers without this constraint would return all metrics/dimensions. The frontend would call these methods as users make selections to filter the available options.

Semantic Layers With This Requirement

Semantic Layer Compatibility Handling
dbt Semantic Layer (MetricFlow) The GraphQL API provides dimensionsPaginated(metrics: [MetricInput!]!) which returns only dimensions compatible with selected metrics. This exists because metrics can span multiple semantic models.
Minerva (Airbnb) Has validation endpoints (/minerva/valid_metrics, /minerva/valid_columns) that implement this filtering. Metrics are tied to event sources and dimension sets, so selecting certain metrics excludes certain dimensions and vice versa.
Cube.js Handles compatibility structurally - all dimensions within a cube are compatible with all measures in that cube. No explicit filtering API needed.

michael-s-molina avatar Feb 05 '26 19:02 michael-s-molina

Adding a note that SIP-199 is proposing related work that aligns with the proposed SemanticQuery model. CC: @shivamgoel

villebro avatar Feb 05 '26 19:02 villebro

Hi @betodealmeida and the Superset community 👋

We wanted to share some context from a fork we've been developing in case it's useful to anyone looking for DataFrame, MCP (Model Context Protocol), or AI-agent-driven chart/dashboard generation capabilities today while SIP-182 matures.

The fork: https://github.com/PromptExecution/superset-datafusion-mcp

Why the fork exists

@PromptExecution is a consulting org working with a client that operates a data platform currently in testing, expected to reach roughly ~1,000 daily users within the next few months. Many of those users are Python-proficient analysts who needed a capability that goes beyond what standard BI tools offer today:

  • In-session DataFrames as chart sources — ingest an Arrow/Parquet table via an AI agent, immediately generate a Superset chart against it, no database required
  • MCP tool surface — expose chart creation, dashboard assembly, and DataFrame querying as first-class tools that LLM agents can call
  • "Better than Grafana" diagram and dashboard generation — including Mermaid diagram output and composite dashboard assembly from agent conversations

The delivery timeline made a clean upstream contribution path impractical for this cycle. Rather than wait, we made a hard fork to ship the MCP service layer on top of Superset's existing chart infrastructure.

What we built (relevant to SIP-182)

The fork adds a VirtualDatasetRegistry backed by Apache Arrow (in-memory tables, TTL-scoped, session-isolated) and Apache DataFusion / DuckDB for query execution. We think this is the natural internal engine choice for Apache Superset — Arrow and DataFusion are both Apache-family projects with strong columnar performance characteristics, and Arrow in particular is already the lingua franca for DataFrame interchange across the Python ecosystem.

An AI agent can:

  1. Ingest a DataFrame → register as a virtual dataset (Arrow table in memory)
  2. Call generate_chart(dataset_id="virtual:{uuid}", config={...}) → DataFusion/DuckDB executes the query → Superset renders the chart
  3. Query the virtual dataset with arbitrary SQL via the MCP tool surface

The bridge between virtual datasets and chart rendering lives entirely outside Superset's get_sqla_query() path, which means it is structurally aligned with the decoupling SIP-182 proposes — the Explorable protocol would give our bridge a proper first-class home.

How we're planning to harmonize

This fork is also serving as a live test of gh-aw (GitHub Copilot Agent Workflows) for CI/CD automation. We've wired up a breaking-change checker agent that watches specifically for SIP-182 milestones:

  • Explorable protocol introduction (Phase 0 / PR #36245)
  • form_data key renames (Phases 2/3) — our bridge centralises all form_data reads into accessor functions so they're a single-file update
  • get_sqla_query() removal (Phase 4) — low direct risk since we already bypass it, but we'll do a full audit when it lands

When Phase 0 merges, our plan is to implement Explorable for the VirtualDatasetRegistry so virtual datasets work natively through Superset's chart pipeline. At that point we'd love to discuss upstreaming the registry, the MCP tool surface, and potentially the Prometheus query tool (which has no upstream equivalent proposed yet).

Cherry-pick contributions

In the meantime we're tracking upstream closely and tagging anything that looks like a clean upstream contribution candidate. If any of the patterns we've built — session-scoped in-memory datasets, TTL lifecycle management, Arrow-native query results, or the MCP agentic tool layer — would be useful reference material as Phases 1–3 land, we're happy to share specifics or open draft PRs for discussion.

Thanks for the thoughtful design work here — SIP-182 is exactly the right abstraction boundary and we're genuinely excited to see it mature.

@PromptExecution

elasticdotventures avatar Feb 22 '26 06:02 elasticdotventures

@betodealmeida feel free to close the vote, this one has plenty of support!

rusackas avatar Feb 25 '26 17:02 rusackas