edgedb-python icon indicating copy to clipboard operation
edgedb-python copied to clipboard

Save vs Sync

Open vpetrovykh opened this issue 4 months ago • 0 comments

Latest state of this issue

  • [x] PR #855 implements most of the functionality
  • [x] PR #860 implements merging data for objects with the same ID, kinda like identity map
  • [ ] Fetch links and multi links of newly inserted objects (relatively low pri)
  • [ ] Set up SYNC_NEW_THRESHOLD and SYNC_REFETCH_THRESHOLD constants and warn the user about potentially using client.save() instead as sync() is probably doing more work than needed. Add a keyword-only arg to disable logging.
  • [ ] Optimize client.sync()

Save vs Sync

There are two fundamentally different ways of persisting ORM model changes:

  1. One-way save: Push changes to the database with minimal feedback (only new object IDs). Useful when you don't need updated computed values, defaults, or triggers.

  2. Bidirectional sync: Push changes to the database, then refetch objects to get current state including computed values, defaults, and trigger effects.

Currently, save(refetch=False) handles case 1, while save(refetch=True) handles case 2. However, using the same method name for these different operations creates poor developer experience and the refetch mechanism is poorly defined.

The proposal is to separate these cases into 2 client methods: save() and sync().

Save

The save() method is the simpler of the two. The current implementation of save(refetch=False) already does what we need. Now it will be the default (and only) behavior of save().

Sync

The sync() method has 2 stages: save and re-fetch. The save stage works pretty much identically to save(). We apply the changes to the database and we keep track of new IDs. However, in addition to new IDs, we also record all the IDs of the Gel objects that were updated. We will refer to all of the new and updated object IDs as delta.

The next step is to refetch data from Gel. We need to define which Python objects and fields to refetch.

Which Python objects need re-fetching

Any object that is either directly passed to sync() or is reachable by links from these root objects will need to be re-fetched. The reasoning is that these are the same objects that get scanned for changes that need to be saved in the first place. But we will re-fetch objects even if there were no direct changes that needed to be saved for them because they might be affected by changes in other objects in the sync batch (e.g. because of backlink computeds).

Which data needs re-fetching

Deciding which data to re-fetch for each object is tricky because we want to avoid accidental re-fetching of large amounts of data that is not used.

The rules are slightly different for new and existing objects.

New objects:

  • Refetch all properties (single, multi, regular, computed - equivalent to *-splat)
  • Refetch single links if they were explicitly defined (not "unset")
  • Multi-links follow existing object strategy

Existing objects:

  • Skip "unset" fields (never fetched originally)
  • Refetch previously fetched properties and single links
  • Reconcile multi-links with delta (see strategy below)

Multi-link Refetch Strategy

Multi-links aren't refetched entirely to avoid performance issues. Instead, we reconcile existing data with the delta (new and updated object IDs) by using a filter.

The refetch filter includes:

  • All existing link target IDs from the Python field
  • All IDs from the delta

This captures both additions and removals to the multi-link.

Note: For partially-fetched multi-links, original filtering criteria (filter, offset, limit) may no longer apply after reconciliation.

Data integrity

save() has minimal impact on Python objects. After saving, changes are applied to the database and objects remain unchanged (except new IDs), but they may contain stale data (e.g., last_modified timestamps).

On the other hand, sync() provides guarantees that the Python objects have no stale data (i.e. computeds, backlinks, last_modified, etc. are up-to-date). But the downside is that if any of that data was fetched only in part originally, the ordering or filtering criteria are not necessarily valid any more. If they are important, the user should either re-validate them in Python or explicitly re-fetch the object in question.

Additional sync option

We may want to add another way to "sync" a specific object by using get(). We could have a version of get(some_obj, query, **kwargs) such that instead of creating a new object to contain the results it updates the fields of an existing object some_obj with whatever was fetched by the query. Any nested objects should also be updated (based on matching them and the corresponding query results by id). New objects would only be created if the existing fields don't contain a match already. The overall structure and order of data would be dictated only by the query results in this case.

This mechanism can complement the sync() functionality, allowing the users to have more control over what gets re-fetched.

vpetrovykh avatar Aug 08 '25 10:08 vpetrovykh