marimo icon indicating copy to clipboard operation
marimo copied to clipboard

Reactively respond to non-python file changes (e.g. parquet, csv, json)

Open leventov opened this issue 11 months ago • 3 comments

Description

Part of Roadmap: https://github.com/marimo-team/marimo/discussions/2899

Related to refresh of the app itself on file changes: https://github.com/marimo-team/marimo/issues/1511, https://github.com/marimo-team/marimo/issues/2675

(Creating this issue for now so that I can refer and link to it from another issue.)

Suggested solution

Perhaps not a universal solution, but in some use cases when these files are tracked in Git, piggy-backing Git's own file change tracking (which itself is highly optimised) would be nice. Such as using pygit2.Diff.

Alternative

No response

Additional context

No response

leventov avatar Dec 20 '24 11:12 leventov

Initial thoughts, high uncertainty:

I think this should be a context manager API rather something like automatic picking up of builtin open() or other "known" file-opening library calls in the cell code. For example:

with mo.watched_file("mymodel.pt") as f:
    # note: we don't try to "pick up" torch.load() calls automatically,
    # relying on the user to do more explicit thing, i.e., this context block.
    m = torch.load(f)
    ....

The file name should probably(?) be a string literal like "mymodel.pt" rather than dynamic (watched_file(filename_coming_from_other_cell_as_variable), so that the need for re-run of the cell can be determined upon a static parsing pass through the code.

Note: these watched_file() contexts could be inside persistent_cache context manager, optionally within the same with statement, like with mo.persistent_cache(...) as _cache, mo.watched_file(...) as f:, or within a cached cell (whether persistently or not), then the changes to the file should feed into cache invalidation: #3271.

It seems that making these files writable within the same context block, even if theoretically possible, makes the whole thing way too hard to reason about even from the user perspective. So, it should probably be prohibited, and not just on syntactic level: should probably use https://github.com/samuelcolvin/watchfiles and raise an exception on existing from the mo.watched_file() context if there were changes to the file between the context entrance and exit.

Note sure: what to do about file writes within other cells? should they also be explicitly marked with a context block? Perhaps, that would be necessary to make a static DAG where "cell connected via file writes/reads" semantically completely coincides with "cells connected in a "normal" way via variables (otherwise, we risk circular DAGs). But to avoid user omissions of that, which would be way too easy, there perhaps should be a global watch on all the watched files within the notebook, and if there is a change from the notebook process itself, it would raise an error. But how to reliably detect all file openings within the process? I don't know yet.

To explore: whether using https://github.com/samuelcolvin/watchfiles within the context manager alone would be sufficient to prevent races like https://git-scm.com/docs/racy-git, or "extra tricks" along the lines of https://git-scm.com/docs/racy-git would also be needed (at least, in the case where we care not just about watched file's mtime, but its contents as well).

Content-addressible

When the "cached watch" does care about the contents rather than mtime alone (the API how to indicate that in user's cell code is TBD), e.g., if there is a noisy external process that the user doesn't have control over and cannot easily patch, and that process frequently "touches" the file without changing its contents. FWIW: when another cell within the notebook writes to the file, we may provide an extra wrapper that avoids touching the file via first writing to a temp file and then replacing it if it has a different content_hash, along the lines of how it's already done in cache or in Git itself.

Note: this is orthogonal with #3271: cache may or may not be fine with mtimes alone; "simple runtime watch" with cell/blocked simply @cache-d, not persistently, or not cached at all, may likewise be fine (or not) with mtimes alone.

I don't currently see a way to avoid races without, upon entering the watched_file context manager, to either:

  • Read the file into memory in full, along with computing its hash, and then serving this file "from memory" i.e., just a single bytearray.
  • Add the file to a Git index, either "native", if we discover the user already tracks this file in some of their own Git, or if not, a special-purpose submodule that marimo would maintain specifically for this purpose.

The second option is needed if the file is too big to read in memory in full at once. However, the second option has a downside because the file content will be copied in full. Also, if file is bigger than core.bigFileThreshold (default is 500 MB), my understanding is that (but not 100% sure, need to double check) Git itself will basically ignore its own stored hash and will degrade to "if mtime is higher than file is considered updated", which draws the whole exercise pointless.

leventov avatar Dec 23 '24 12:12 leventov

TODO: at least check that the chosen approach will be forward-compatible with the following systems for data sharing/management, in the order of priority as I see it:

  • https://kedro.org/
  • https://dvc.org/
  • https://www.datalad.org/, and transitively git-annex
  • https://github.com/datonic/datadex (?) Don't clearly understand how it works. cc @davidgasquez
  • https://pachyderm.com/ - but not sure because its usage dwindles(?)

Note: I tried to choose data management systems that are simple local files-first (even if for metadata only, not the actual beefy files for downloading later). For the plethora of non-files-first systems, such as Data Catalogs, Hugging Face, etc., the alternatives are either:

  1. Write a dedicated cell that fetches the latest data(set) version via the API. In effect, this cell would be necessarily non-deterministic, but the user should still be able to "force rerun" it even if it cached.
  2. Create a separate mechanism/function within Marimo that would be familiar with certain data catalog / remote dataset versioning metadata APIs, and if configured a la with mo.watched_hf_dataset("https://huggingface.co/datasets/datonic/world_development_indicators") as _x: ... and would do a metadata fetch and trigger the run on update if needed. But, that would be a separate feature.

leventov avatar Dec 24 '24 09:12 leventov

Heya @leventov! Sharing a bit of context around Datadex in case it helps. Datadex itself is not a data management system, but an "opinionated glue" around some of the tools used in data these days. I don't think it fits with the DVCs or Git Annexes/LFSs around.

Over the years I compiled a list of "Data Package Managers" which might be helpful in your search!

PS: Making a small (not very informed) suggestion.

Create a separate mechanism/function within Marimo that would be familiar with certain data catalog / remote dataset versioning metadata APIs, and if configured a la with mo.watched_hf_dataset("https://huggingface.co/datasets/datonic/world_development_indicators") as _x: ... and would do a metadata fetch and trigger the run on update if needed. But, that would be a separate feature.

Since hf has a working file system URI in Python, it would be great to make the API something like watched_dataset("hf://...")

davidgasquez avatar Dec 24 '24 09:12 davidgasquez

Support for watching data files and automatically refreshing cells that depend on them is not yet supported. Follow along at https://github.com/marimo-team/marimo/issues/3258 and let us know if it is important to you.

Following the docs advice, this is important for me 😀

I work with a DSL with editor integrations (language server, syntax highlighting, etc). I use marimo to create small applets for this DSL that output graphics and other artefacts.

The ideal scenario for me would be to edit a source file in my editor, benefiting from all its tooling, and have marimo watch it and refresh its output, giving me immediate feedback. This will let me keep the marimo-based stack I have developed.

lvignoli avatar May 06 '25 08:05 lvignoli

Documentation for this can be found here: https://docs.marimo.io/api/watch/

The api follows Path, so you'll likely want read_test or read_bytes to pass into your respective applications. It'll work similar to state. This is also cache busting based on content hash

+1 for external DSL. I tried this out with writing any widgets with JS in a separate file, and it has been nice.

Polling time is 1 second and not publicly configurable for the moment. Happy to receive feedback on it !

dmadisetti avatar May 13 '25 18:05 dmadisetti

This is really powerful! Awesome addition! 🚀

AH-Merii avatar May 13 '25 21:05 AH-Merii

Woaw thanks @dmadisetti for this! I'll test it asap 😀

lvignoli avatar May 15 '25 19:05 lvignoli