marimo icon indicating copy to clipboard operation
marimo copied to clipboard

app.function: top-level functions in notebook files

Open akshayka opened this issue 1 year ago • 6 comments

This issue is for making it possible to import functions directly from marimo notebooks:

from notebook import my_function

Only "pure" functions will be importable, ie functions that have no refs besides imported modules. Some care will need to be taken to handle modules.

A cell that includes a single function def, and nothing more, will be saved in the notebook file as:

@app.function
def my_function():
  ...

Discussed on:

  • Discord Discussion: https://discord.com/channels/1059888774789730424/1059891311190229082/1253813629317288108
  • GitHub Discussion: https://github.com/marimo-team/marimo/discussions/2288

akshayka avatar Sep 10 '24 20:09 akshayka

I worry that privileging just single-cell functions is going to be a huge UX footgun. There are many other things I might reasonably want the same behaviour from. (Classes are the obvious example.) Moreover I will almost always want my function to depend upon other values:

SOME_CONSTANT = 42

def the_answer_to_life_the_universe_and_everything() -> int:
    return SOME_CONSTANT

Or -- I may simply write code that depends upon being able to import a function like this... and now find that I am unable to adjust this function's behaviour without breaking that import resolution.


Can I make an alternate suggestion? Marimo supports adding meaningful names for cells, so it could be arranged for the syntax import my_notebook; my_notebook.my_cell.my_function to work.

In particular this would:

  • Automatically evaluate all cells that my_cell depends upon.
  • Set __module__ and __qualname__ appropriately, so that even dynamic resolution via importlib.import_module works correctly.
  • ... and as such, basically support the full generality of things that we may wish to do in Python.

patrick-kidger avatar Sep 13 '24 22:09 patrick-kidger

Or -- I may simply write code that depends upon being able to import a function like this... and now find that I am unable to adjust this function's behaviour without breaking that import resolution.

Oh, that's a very good point; that kind of action at a distance would be sad. I hadn't considered this.

Regarding

so it could be arranged for the syntax import my_notebook; my_notebook.my_cell.my_function to work

It's definitely an interesting idea, and would be readily done by abstracting over the low-level API we have for programmatically running cells: https://docs.marimo.io/api/cell.html. AFAIK right now I'm the only consumer of that API, it's quite low level.

I've considered what you've suggested before, but the downside is that you won't get code completion, type information, etc. when typing out

my_notebook.my_cell.my_function

whereas top-level functions would play well with an LSP and other static analysis tools.

Given these tradeoffs, would you still prefer a cell-based API? We'd have to make a breaking change to the cell API to support this (what if a user had a function called run?), or perhaps we could expose it under a namespace, such as my_cell.functions.my_function ...

akshayka avatar Sep 13 '24 22:09 akshayka

Okay, refining this idea a little bit: it should be possible to define a module-level __getattr__ that runs the appropriate cells up to the point of providing that name. That would enable the simpler my_module.my_value based lookup, without having to name an individual cell.

This would then offer your originally suggested API, whilst supporting all kinds of Python types. (And neatly sidesteps questions about the Cell API.)

Autocompletion could be done via an if TYPE_CHECKING: block somewhere in the file.

patrick-kidger avatar Sep 14 '24 08:09 patrick-kidger

Relevant: https://github.com/marimo-team/marimo/issues/1954

So in previous discussions @app.function, would be primarily used for serializing and creating cells that with a name, and denoting that it is a pure function. Importing a function from a cell is disingenuous when it can rely on those additional, cell specific references that are not constants. For instance:

result = expensive_computation()
mutable_arr_in_scope = []

# Not clear how exposed_fn should be exposed,
# or what the behavior should be on state change.
# Even if convention decided, should be intuitive API
# Not worth complexity IMO
def exposed_fn() -> int:
    mutable_arr_in_scope.append(state_in_diff_cell())
    return result, len(mutable_arr_in_scope)

One fix might be to move constants (all caps and are 'primitive') to the top level script. There has been discussion about doing this for import only cells (https://github.com/marimo-team/marimo/discussions/1379), it might be feasible to do this for constant only cells

import marimo

# cell 2 <-  (Auto generated comment? Use name if given)
import numpy as np

# cell 3
SCALE = 1

__generated_with = "0.0.0"
app = marimo.app()

@app.function
def lengthify(v, scale=SCALE, axis=-1):
   return scale * v / np.linalg.norm(v, axis=axis)

The previous discussion of @app.function made the UI very simple.

  1. If a cell with only a function definition hits the scope requirements, then it is automatically turned into an @app.function, and cell name shows that (similar to a sql cell showing the df name).
  2. If a cell with only a function definition does not hit the requirements (cell name conflicts with function name, scope refs are not modules or constants), then a "?" icon appears where cell name should be, explaining why it is not turned into a @app.function

dmadisetti avatar Sep 14 '24 16:09 dmadisetti

Another implementation realization. @app.function should be able to reference other @app.functions

EPSILON = 1e-12

@app.function
def probs_ortho(v, u):
    return np.abs(np.dot(lengthify(v), lengthify(u))) <= EPSILON

dmadisetti avatar Sep 14 '24 16:09 dmadisetti

Had another idea about delimiting top level declarations in the notebook.

Opposed to a comment, the lines of the notebook (or statement count) could be registered:

import numpy as np # line 1, import count 1
import pandas as pd # line 2, import  count 2
# line 3
import jax # line 4, import count 3
# line 5
SCALE = 1 # line 6, constant count 1
# line 7
__generated_with = "0.0.0"
import marimo
app = marimo.app()

@app.cell
def _():
   return mo.md("This cell is displayed first")

@app.cell
def _():
   return mo.md("## Imports")

# The the imports by line number are displayed in editor here
app.declare_imports(count=2)
# or app.declare_imports(lines=(1, 3)) # np, pd

app.declare_imports(count=1)
# or app.declare_imports(lines=[4]) # jax

@app.cell
def _():
   return mo.md("## Constants")

app.declare_constants(lines=[6]) # primitive values only
# app.declare_constants(count=1)

Which removes the need for statically parsing the file opposed to just running it.

Couple functionality thoughts:

  • imports not paired with a declare_x, can be merged with the closest declaration, or automatically added as a block to the end of the notebook.
  • If using lines, after __generated_with cannot be indexed
  • Invalid top level blocks (e.g. a = custom_obj or logic) should be stripped (as is currently the case)

"import only" blocks are already detected, so implementing this change is

  1. building declare_import to build a cell from the notebook content (then verify import only)
  2. changing the serialization of the notebook

"primitive only" blocks, would be easy enough to detect (no refs, assign statements only)- maybe enforce defs in UPPER- but might be better suited as followup feature once the lessons of modules is learned.

dmadisetti avatar Dec 20 '24 00:12 dmadisetti

Which removes the need for statically parsing the file opposed to just running it.

But adds the burden on those who edit the files: if I add a line somewhere to the top of the file, suddenly something below in the file becomes invalid. This seems way too cumbersome for "normal" development a la #2288.

There is already kind of a standard for cell-delimiting comments, # %%: https://docs.spyder-ide.org/4/panes/editor.html#defining-code-cells, supported by VS Code and IntelliJ as well.

"Normal development" should be like:

  • Just normal top-level functions and variables
  • Classes are "split up" such that each method is a separate top-level block (which permits preceding it with # %%) a la fastcore.patch.
  • Tests and examples are in mo.cells.

Objective: dogfooding, develop marimo in marimo!

leventov avatar Feb 10 '25 17:02 leventov

Yeah, I agree with the editing issue. I don't think there's a problem in just duplicating:

import numpy as np # Import only blocks, just duplicated
import pandas as pd

import jax

__generated_with = "0.0.0"
import marimo
app = marimo.app()

@app.cell
def _():
   return mo.md("## Imports")

@app.cell
def _():
  import numpy as np # Import only blocks, just duplicated
  import pandas as pd


@app.cell
def _():
   import jax

With manual editing overriding the top level imports over the cell level (and last cell creation if no corresponding block exists)

dmadisetti avatar Feb 10 '25 22:02 dmadisetti

I think the simplest solution for pure top-level functions proposed so far is to (1) serialize pure functions as app.function decorated functions, and (2) duplicate imports to the top of the notebook when an imported module is used in an otherwise pure function. This is readable, adds just minimal noise and works well with manual editing of notebook files.

The only downside I see is the footgun of accidentally making functions impure, as Patrick brought up, but perhaps we could address this in the editor user interface, as @dmadisetti suggested in a previous comment.

@leventov's suggestion of enabling normal development is interesting. In fact Pluto.jl has a #%% style file format which enables development of Pluto in Pluto, although Pluto.jl does not have the equivalent of our mo.cells.

I think there are two issues that have been brought up in this discussion.

(1) When a user writes a cell containing only a pure function in the marimo editor, how should we serialize it as a "top-level function" in the notebook file, so it can be easily used by other Python modules? That is the topic of this GitHub issue. (2) When a user types functions, classes, variables — arbitrary Python code — in the file itself, should we (a) preserve them on save from the marimo editor and (b) make them visible in the marimo editor?

akshayka avatar Feb 11 '25 04:02 akshayka

(2) When a user types functions, classes, variables — arbitrary Python code — in the file itself, should we (a) preserve them on save from the marimo editor and (b) make them visible in the marimo editor?

I would add: (2) When a user types functions, classes, variables — arbitrary Python code — in the file itself, should we [...] (c) present their outputs in marimo editor? (1) Whether to permit caching the results of top-level functions, and if yes, how. Prior relevant discussions: https://github.com/marimo-team/marimo/discussions/2653, https://github.com/marimo-team/marimo/discussions/3270.

Doing AI evals or a la https://www.braintrust.dev/docs/guides/playground would be a breeze in Marimo if caching was optionally supported for top-level functions.

I'm now much less sure about the "XTDB" suggestion that I made in the first message of https://github.com/marimo-team/marimo/issues/3176. On the other hand, the arguments to these functions could be Pickled and be thought of as "cell inputs" for module_hash computation. This approach probably won't "scale" (for example because the local persistence subsystem as currently envisioned won't store last_accessed and thus cannot do LRU; no insight into the parameters and thus not possible to prune the cache "intelligently" along these params), but for AI evals it can often be "good enough", so why not enable it, such as via App(default_cache_top_level_fns=True).

leventov avatar Feb 12 '25 13:02 leventov

Closed by https://github.com/marimo-team/marimo/pull/3755 and https://github.com/marimo-team/marimo/pull/3779

mscolnick avatar Apr 09 '25 16:04 mscolnick