marimo icon indicating copy to clipboard operation
marimo copied to clipboard

Optional stable cell IDs

Open leventov opened this issue 11 months ago • 6 comments

Description

Stable (and per-workspace-unique) cell IDs are absolutely needed for scalable, on-by-default cell results persistent cache, see https://github.com/marimo-team/marimo/issues/3176. Stable cell IDs can even enable "moving the cell" to another notebook while preserving its cache.

Stable cell IDs would also be very helpful at the interaction of Marimo and Git. As a minimum, to better track changes per cell when Git itself fumbles with its regular line tracking feature (e.g., when files are renamed or cell order is changed).

Suggested solution

The only solution that I can think of at the moment is as follows:

  1. Marimo app uses uuids as cell-ids internally instead of what it's using at the moment (four letters whose uniqueness doesn't seem to be enforced even within the App?)
  2. Optional id parameter in cell decorator: @cell(id="uuid"). Marimo tries to make sure they are not duplicated on the code level (if the user copy-pastes cell code and doesn't change the id, for example).
  3. If id is not specified, it's generated by marimo app internally (and thus, will not be stable), and this means the cell will be excluded from the on-by-default caching, and writing an explicit with persistent_cache() block within such a cell will fail.
  4. When editing through Marimo UI, and when cell results cache is on-by-default, Marimo app will generate uuid for the cell itself and will add it to the code (thankfully the user will not see this code most of the time in the app).

Alternative

Two main problems with the solution above:

  • UUIDs in the code are ugly and clutter. Not as much accidental complexity as requiring users to think about caching themselves, but still not ideal.
  • When the user manipulates the code of the notebook outside the Marimo UI, and they need to create a new cell, they need to generate UUID themselves and insert it in code, which is a speed/flow bump.

However, I don't see alternatives. Using hashes of cell's code + hashes of all its code dependencies will lead to a lot of trashing of cell_ids and thus will subvert the proposed solution for persistent caching of results at https://github.com/marimo-team/marimo/issues/3176.

Additional context

No response

leventov avatar Dec 15 '24 18:12 leventov

FWIW, https://github.com/dbos-inc/dbos-transact-py doesn't seem to be any smarter than this, they also require putting stable workflow ID (equivalent of cell_id for Marimo) right in the code: see https://docs.dbos.dev/python/tutorials/idempotency-tutorial.

By the way, for potential integration with DBOS which might be helpful in productionising Marimo notebooks, the fact that DBOS's workflow IDs are UUIDs (by default) is another reason to make cell IDs uuids: as they could be matched: cell = workflow, cell ID = workflow ID. Although, another alternative would be to identify cells with a DBOS's steps.

leventov avatar Dec 16 '24 07:12 leventov

Stable IDs are also required for UI-Elements to be properly re-used

@mscolnick did something towards this here: https://github.com/marimo-team/marimo/pull/3061 which could be made more deterministic, but seems OK for the time.

but aside from that, not sure why changing the IDs system is needed (esp, on a user exposed level)

The ID is an internal mechanism for runtime, but isn't saved anywhere. The current ID system does have an issue: #1962 but there's trivial fixes for this


In cached execution, the cell is hashed just like it is wrapped in a persistent_cache block. There's 2 steps to hashing:

  1. Execution path- hashing the relevant executing cells into a given path
  2. Content Addressing- additional hashing content is added to the cell based on certain inputs, helping mitigate state and UI based changes.

This makes it robust to:

  1. Ordering (hashing occurs on a graph level
  2. Accidental state changes producing an inconsistent caching state

If the issue is tracking changes, cells can be given names. This was tangentially suggested as a solution for #3129 (which could have used execution based hashing).

But hashes and IDs aren't super readable. Might be good for detecting downstream notebook changes, but limits marimo notebooks from being written in any editor.


Maybe HTML exports could be tagged with an execution hash on a per cell level? Thoughts on this?

dmadisetti avatar Dec 16 '24 10:12 dmadisetti

Without stable cell IDs, there is no way to know if some module_hashes in the cache dir correspond to some stale or entirely deleted cells. And therefore, cache cleaning couldn't be made automatic and relatively fast, as it would require computing the entire module_hash graph for the entire workspace, which can in general require arbitrarily expensive (both in terms of time and $$$, due to LLM calls) computations (for example, if the user removed parts of the cache manually, for whatever reason, or if they off-loaded it with git-annex to another device and didn't re-download no the machine where the hypothetical marimo cache tidy command is executed).

Whereas, with .py-level cell ids, this is a very straightforward python code scanning excercise. The same complexity as pytest does when it scans directories for test files and functions, for example.

And, the alternative to "targeted cleanup" would be removing the entire cache only, which again, is problematic because it will potentially force very expensive re-computations.

I really see the DX/UX gap that may grow here between Marimo and Jupyter here (because with all its problems, Jupyter generally doesn't lose your cell results), and a potential deal-breaker for scaling Marimo workspaces beyond "I work alone by myself or within a small team" scenarios.

Cf. also this comment by Leo Meyerov, I think it's a relevant bit of practical insight about notebook DX/UX.

leventov avatar Dec 16 '24 10:12 leventov

Re: editing notebooks in any editors,

  1. IDs are still optional. The user may write @app.cell() def __(): ... by hand. In many cases it will be fine because actually in some cases caching cell results is superfluous (e.g., in unit tests). If the "cache cell results by default" setting is ON, next time marimo runtime "lays its hands" on the cell code (either via UI opening or command-line processing), it may supplant missing UUIDs itself unless the cell is also tagged with "test" or some other tag which excludes the cells from automatic result caching (which could also be configured).
  2. In popular IDEs, generating UUID with a shortcut is not that hard: VS Code, IntelliJ.

When the user edits the notebook in the most bare bones editor (vi, nano, Notepad, whatever), and they know they really want to cache this cell, opening https://www.uuidgenerator.net/ or running uuidgen from command line and copying the UUID over from there is also not that hard.

To increase the legibility of the fact that these IDs are required for caching, probably the optional param should be called cache_id, i.e., @app.cell(cache_id=...). In this way, the users who are used to seeing bare cells code will learn quickly and remember firmly that if they want the cell to be auto-cached, they need to add cache_id.

leventov avatar Dec 16 '24 10:12 leventov

entire module_hash graph for the entire workspace, which can in general require arbitrarily expensive (both in terms of time and $$$, due to LLM calls) computations

The execution path hashing is done statically and is very cheap. So this doesn't really follow either:

Without stable cell IDs, there is no way to know if some module_hashes in the cache dir correspond to some stale or entirely deleted cells.

since the diff of what's in the notebook and what's in the cache is easy to find

If it is useful, then maybe #3129 with the execution path hashing might be worth pursuing.

Opt in or opt out cache should be as easy as cache = True | False, with the reference mechanism hidden, but cell name or persistent_cache name should be leveraged where possible


edit: woops, tagged the wrong issue initally, meant the unique hash name one

dmadisetti avatar Dec 16 '24 10:12 dmadisetti

See relevant to this feature request part of this comment:

Note important thing: whereas you probably most think about two scenarios:

  1. Editing cells through Marimo Web UI (and VS Code extension provides the same, or its closer to scenario two?) where the JS side can "remember" the cell_id the user is editing to ensure "short history" continuity
  2. Editing cells as code in simple text editors, by a human, where UUID proposal is ugly and makes somewhat bumpy user experience with need to generate UUIDs by hand or figure out how to do it in IDE,

They main scenario as I see using Marimo is a third one: the majority of cells are created and edited by AIs, e.g. within Co-STORM. The wrapper code can add UUIDs trivially, so the clunkiness of (2) above is not an issue. But also unless that wrapping logic is also knee-deep in Marimo internals (calls to module_code_hash directly before it instructs the AI to re-write the cell, for instance, and then manually tells Marimo somehow to look up potentially duplicate results there), the recent history dedupliication wouldn't work.

The abiliity to provide explicit cell_ids (which my wrapper logic can generate easily) would enable much better separation of concerns between AI workflow wrappers and Marimo layer. And also, those UUIDs needed to "correlate cells across Git branches" and moving cells from one notebook to another without "losing the context" is also directly motivated by those AI-workflow-ish scenarios.

leventov avatar Dec 16 '24 18:12 leventov

Closing as with notebook key scoping of persistence keys cell names would suffice. Corner case:

# apps.py
a = mo.App(persistent_execution=True)
b = mo.App(persistent_execution=True)
# a.py
from apps import a
@a.cell
def foo():
    pass
# b.py
from apps import b
@b.cell
def foo():
    pass

Here, a and b would use the same notebook key because they originate from the same file. But cell names foo are clashing.

Solution: for multiple apps created in a single python file (as apps.py), if their cells include persistent_cache() blocks, or cells persisted themselves (as sketched above), Marimo should require that such apps have explicit save_key params passed and they are different. That is:

# apps.py
a = mo.App(persistent_execution=True, save_key="...")
b = mo.App(persistent_execution=True, save_key="...")

leventov avatar Jan 22 '25 12:01 leventov