marimo icon indicating copy to clipboard operation
marimo copied to clipboard

Register Data Sources

Open riyavsinha opened this issue 1 year ago • 2 comments

Description

Right now, only global scope dataframes are saved by marimo for reference. However, I'd love to be able to have them saved to marimo without explicitly defining it as a variable

For example, when I run a Langchain agent, it fetches data and streams dataframes and text to the cell output: Image

I'd like to register these data frames to the global scope so that:

  1. Data an agent has fetched already is summarized in the Explore data sources panel
  2. Fetched data can be referenced in future AI cells

In my case, I'd like to be able to programmatically generate the variable symbol name.

Suggested solution

def run_agent(input):
   for event in agent.run(input):
      # if is tabular
      df = pd.DataFrame(event.content)
      mo.editor. register_datasource(df, event.name + event.tool_args + "_df") # <--- saves df to var "nearest_gene_tool_chr15_88569444_df"
      mo.output.append(df)
      ...
    ...

Alternative

No response

Additional context

No response

riyavsinha avatar Nov 01 '24 17:11 riyavsinha

Thinking out loud - I wonder if this should be explicit like mo.editor.register_datasource, or just "happen" when outputing an unnamed (no variable) dataframe.

Right now we auto-discover dataframes/datasources:

  1. Variables declared
  2. In SQL for CREATE TABLE
  3. In SQL for ATTACH db

mscolnick avatar Nov 01 '24 17:11 mscolnick

I think just happening could be cool also for sure, but for the purpose of referencing in later cells, especially if a user has multiple data tables with the same schema (the result of fetching data from the same api multiple times with diff params), then that could get confusing, so I'd still like the option to be able to assign a meaningful name ideally!

riyavsinha avatar Nov 01 '24 17:11 riyavsinha

Hi! created a draft PR for what the mo.editor.register_datasource method could look like.

I'm not sure if it's adding the variable in the proper way though.

I initially considered using an AST method like visit_Call is already used for the SQL tables, but this prevents dynamic variable name registration, such as mo.editor.register_datasource(df, construct_varname(api_call, api_call_params))

A specific example I tested with was:

my_module.py:

  import marimo as mo
  import pandas as pd

def register_dfs_in_background():
    df = pd.DataFrame({'a': [1], 'b': [2]})
    x = 'myvarname'
    mo.editor.register_datasource(df, x)

marimo_notebook.py:

import marimo

__generated_with = "0.9.14"
app = marimo.App()


@app.cell
def __():
    import my_module
    return (my_module,)


@app.cell
def __(my_module):
    my_module.register_dfs_in_background()
    return


if __name__ == "__main__":
    app.run()

riyavsinha avatar Nov 03 '24 23:11 riyavsinha

I did not attempt to register "outputted" dataframes yet, because I'd like to understand how you would want variable naming to look for that first

riyavsinha avatar Nov 03 '24 23:11 riyavsinha

Hi @riyavsinha - I'll take a look at the PR. I dont think we want to support dynamic variable registration - this can lead to bugs and confusing behavior. And i dont know if register_df should hook into the reactivity at all.

I did think of a better solution since you mentioned SQL. (Sorry on a phone) you can potentially create the df as a local variable and do mo.sql("""create or replace table_name from df_name"""). This will store the df in memory via duckdb which will give you the data sources features you are looking for.

mscolnick avatar Nov 04 '24 01:11 mscolnick

Yeah that makes sense. I can make do with the workaround

riyavsinha avatar Nov 04 '24 02:11 riyavsinha