nebulex Q: recommended way replace all records / generation

Hi, I have local cache that represents the latest batch of records to, well, keep cached 😄 The flow is such that external worker needs to batch upsert cached records periodically, at intervals defined by that worker. Cache needs to keep the records until the next batch comes in, whenever that may be. And there are multiple instances, one per some internal category. Records are time based too, so the key is composed: {subcategory, timestamp}. Below is the current idea.

Is this the correct way of using generations? Initially I was executing new_generation before put_all, but that did not expire old records. Using all(:unexpired, return: :value) also did not help (can't even use it due to additional key filtering, just experimented with expiry). Then I played with spec that would return non-expired records only, something akin to:

  defp unexpired_spec(opts) do
    key = Keyword.get(opts, :key) || :"$1"
    value = Keyword.get(opts, :value) || :"$2"
    touched = Keyword.get(opts, :touched) || :"$3"
    ttl = Keyword.get(opts, :ttl) || :"$4"

    [
      {
        Local.entry(key: key, value: value, touched: touched, ttl: ttl),
        [
          {
            :orelse,
            {:==, :"$4", :infinity},
            {:<, {:-, Nebulex.Time.now(), :"$3"}, :"$4"}
          }
        ],
        [:"$2"]
      }
    ]
  end

Unfortunately, nothing I did apart from current solution helped with this. Old generation data would still be accessible. Is there a better way? Note that the aim is not only to replace records with same key, but also to remove obsolete records on new batch. Below test shows the idea.

test "replaces all events" do
  start_supervised!({MyCache, name: MyCache.id(:sports)})

  events = [
    %{subcategory: :basketball, timestamp: @yesterday, title: "Basket A"},
    %{subcategory: :rugby, timestamp: @today, title: "Rugby A"},
    %{subcategory: :football, timestamp: @tomorrow, title: "Football A"}
  ]
  :ok = MyCache.replace_all(:sports, events)

  new_events = [
    %{subcategory: :rugby, timestamp: @tomorrow, title: "Rugby B"},
    %{subcategory: :football, timestamp: @tomorrow, title: "Football A"}
  ]
  :ok = MyCache.replace_all(:sports, events)

  cached_events = MyCache.get_matching(:sports)
  assert length(cached_events) == 2
  refute Enum.find(cached_events, &(&1.subcategory == :basketball))
  assert %{title: "Rugby B"} = Enum.find(cached_events, &(&1.subcategory == :rugby))
  assert %{title: "Football A"} = Enum.find(cached_events, &(&1.subcategory == :football))
end

defmodule MyCache do
  alias Nebulex.Adapters.Local

  require Local

  use Nebulex.Cache,
    otp_app: :pulsar,
    adapter: Local

  def id(category), do: Module.concat(__MODULE__, category)

  def get_matching(category, subcategory \\ nil) do
    key =
      case subcategory do
        nil -> nil # return all
        subcategory -> {subcategory, :_} # return only those for subcategory
      end

    with_dynamic_cache(id(category), __MODULE__, :all, [matching(key)])
  end

  def replace_all(category, events) do
    events_with_keys =
      Enum.map(
        events,
        &{{&1.subcategory, DateTime.to_unix(&1.timestamp)}, &1}
      )

    with_dynamic_cache(id(category), __MODULE__, :put_all, [events_with_keys])
    # setting the new generation after put_all, to prepare for the next iteration
    # on next iteration, old data will automatically expire
    with_dynamic_cache(id(category), __MODULE__, :new_generation, [])
  end

  # see https://hexdocs.pm/nebulex/Nebulex.Adapters.Local.html#module-queryable-api
  defp matching(key) do
    [
      {
        # definition of cache entry, with values or placeholders for each property
        # :"$num" defines a placeholder to be used when filtering
        # and defining returning result shape
        # :_ means "match anything"
        Local.entry(key: key || :"$1", value: :"$2", touched: :_, ttl: :_),
        # no need to filter by other properties, including value
        # key is already filtering what we need
        [],
        # defines return result shape, value in this case
        [:"$2"]
      }
    ]
  end
end

Jun 05 '25 10:06 elvanja

Hi. Ok, so let me first clarify some misconceptions about the cache generations.

Old generation data would still be accessible.

Of course, when you create a new generation, the current generation becomes the older one, and the created one becomes the latest generation. Therefore, the data is still in the cache. The next time you create another generation, the older generation is now removed (its cycle is over), and the created one becomes the new, and the current one the older. That is the idea behind a generational cache.

I think you shouldn't rely on the generational process to achieve what you're looking for. Instead, following the approach you mentioned, you could use delete_all to remove the entries you need (using a matching query), and then the put_all. Also, it seems to me you are somehow handling the eviction, so maybe you don't need the generation, because the generation process may remove data at some point that you might need later. So, you can set the gc_timeout to nil, which tells the cache not to create a new generation. Which means, you will keep one single generation or table, and your process is basically handling the eviction. Let me know if that makes sense.

Now, if you really want to use the generation, you will have to call new_generation twice to ensure they both the old and new generations are removed, and then you start with an empty cache, but according to what I understand, that's not what you want? So, I'd maybe follow the other approach above.

Jun 05 '25 10:06 cabol

If I understand correctly, these are the mentioned viable options. Both seem to be achieving the result of complete replace of data in the cache.

This one ensure older generations are removed. Q: will latest generation expire automatically? I would think not, given :infinity default TTL, but not sure if there is another param that might override it.

with_dynamic_cache(id(range), __MODULE__, :new_generation, [])
with_dynamic_cache(id(range), __MODULE__, :new_generation, [])
with_dynamic_cache(id(range), __MODULE__, :put_all, [buckets_with_keys])

This one deletes all and inserts new data.

with_dynamic_cache(id(range), __MODULE__, :delete_all, [])
with_dynamic_cache(id(range), __MODULE__, :put_all, [buckets_with_keys])

Q: What about the effect of each approach on serving existing data? The data in the cache is gonna be requested frequently, and I expect around ~100K records to be batch inserted every time we refresh the cache. Ideally there would be no "gap" with no data due to replacing the records. Hence the generation idea, with working assumption that generation replacing ensure we don't experience such a gap.

With delete_all/put_all, I am not sure if there would be a small window of opportunity where cache would be basically empty and unable to serve any data. Q: would wrapping that solution in transaction help?

Jun 05 '25 15:06 elvanja

I would like to provide an example to ensure we have a common understanding of how generations work. First of all, think about generations like tables or buckets, so when I say an old generation is removed, the entire table is deleted, so all the entries are deleted too. Imagine you start your app with the cache, initially there is only one generation G1 (which is an ETS table). Like:

-> App starts with a cache
-> G1 is created
-> ...
-> Imagine you put the keys `k1` and `k2`, this will be stored in `G1`
-> ...
-> The cache GC creates a new generation `G2`, so `G1` becomes the old generation and `G2` the new one (but at this point `G2` has no data
-> ...
-> You fetch the key `k1`. Then, `k1` is moved from `G1` to `G2` (this happens automatically ensuring the most frequent accessed keys are available)
-> ...
-> A new generation `G3` is created. Then `G1` is deleted (all its keys are deleted, in this case `k2` is gone), `G2` becomes the older generation, and `G3` becomes the new one.
-> ... and that happens every time a new generation is created

Q: will latest generation expire automatically? I would think not, given :infinity default TTL, but not sure if there is another param that might override it.

I'm not sure if I understand what you mean by the "latest generation", but assuming you have G1 as the old or latest generation and G2 as the new one, when a new generation G3 is created, then yes, all the entries in G1 are removed automatically. On the other hand, if the latest generation means G2 for you, then no, it is not automatically removed or expired, because it becomes the old generation, but it does exist yet. Now, you can also use the :ttl option, which will evict expired keys.

Q: What about the effect of each approach on serving existing data? The data in the cache is gonna be requested frequently, and I expect around ~100K records to be batch inserted every time we refresh the cache. Ideally there would be no "gap" with no data due to replacing the records. Hence the generation idea, with working assumption that generation replacing ensure we don't experience such a gap. With delete_all/put_all, I am not sure if there would be a small window of opportunity where cache would be basically empty and unable to serve any data.

I see. And yes, there is a change with both approaches, that there is a tiny time window where there is no data. Let's take a closer look at each approach. First:

with_dynamic_cache(id(range), __MODULE__, :new_generation, [])
with_dynamic_cache(id(range), __MODULE__, :new_generation, [])
with_dynamic_cache(id(range), __MODULE__, :put_all, [buckets_with_keys])

After the second new_generation call, the cache will be empty until the put_all is completed. So yes, there is a change of a tiny time window where the cache is empty.

The second option:

with_dynamic_cache(id(range), __MODULE__, :delete_all, [])
with_dynamic_cache(id(range), __MODULE__, :put_all, [buckets_with_keys])

It is pretty much the same situation. After the delete_all, the cache will be empty until the put_all is completed.

BUT, I can maybe give you another option (just changing the order of the commands). For example:

with_dynamic_cache(id(range), __MODULE__, :new_generation, [])
with_dynamic_cache(id(range), __MODULE__, :put_all, [buckets_with_keys])
with_dynamic_cache(id(range), __MODULE__, :new_generation, [])

In this case, the first new_generation call will create a new generation and the current latest one becomes the older, but the data is still there, so the cache can still serve data (is not empty). Then, you store the new entries with put_all and that will happen on the new generation, so from that moment on, the keys will be served from there (unless a key doesn't exist, in that case it is moved from the older generation back to the new one, which is ok). Finally, the last new_generation will remove the older one, and the previous latest one (the one with the new entries) becomes the older one, BUT, the data is still there, and eventually all those keys are moved to the new generation when they are accessed. Does it make sense? Please let me know if that works for you.

Q: would wrapping that solution in transaction help?

I'd try the previously suggested option before adding transactions, because that will affect performance. So, if there is a way to make it without using transaction, I'd go with that.

Jun 06 '25 10:06 cabol

unless a key doesn't exist, in that case it is moved from the older generation back to the new one, which is ok eventually all those keys are moved to the new generation when they are accessed

Let's go through an example. Imagine this is the first batch of cached records (each tuple is {key, value}}.

[
  {{:basketball, ~U[2025-06-06 12:00:00]}, "NBA 1"},
  {{:tennis, ~U[2025-06-06 13:00:00]}, "Wimbledon 1"}
]

Fetching all basketball games would return ["NBA 1"], like this:

spec = [
  {
    Local.entry(key: {:basketball, :_}, value: :"$2", touched: :_, ttl: :_),
    [],
    [:"$2"]
  }
]

with_dynamic_cache(id(range), __MODULE__, :all, [spec])

Fetching tennis, in similar fashion, would return ["Wimbledon 1"].

Then a new batch comes in, e.g.:

[
  {{:basketball, ~U[2025-06-07 12:00:00]}, "NBA 2"},
  {{:swimming, ~U[2025-06-06 13:00:00]}, "Olympics"}
]

What I am aiming to achieve is:

fetching basketball games: ["NBA 2"]
fetching tennis games: []
fetching swimming games: ["Olympics"]
fetching all games: ["NBA 2", "Olympics"]

So basically, when new batch comes in, old values have to be completely gone. I am using the timestamp in the key value since we may have more than one event for same category Maybe I can get around it and just use :basketball, it would have to be some other type of ETS table then, but likely doable too. That should help with existing keys, right? But still not sure what happens for keys not present in the new batch.

My initial attempt, before even using nebulex, was to have two ETS tables and do this when new batch comes in:

keep reading from old table
create a new table
write data into that new table
switch reading to new table
clear/delete old table

But, instead of orchestrating all that, the generations approach seemed spot on this idea. Maybe this is not true and generation is to be thought in context of each key instead and separately from the rest of the data (each key has it's own lifecycle)?

Jun 06 '25 12:06 elvanja

But, instead of orchestrating all that, the generations approach seemed spot on this idea. Maybe this is not true and generation is to be thought in context of each key instead and separately from the rest of the data (each key has it's own lifecycle)?

Right. I'd say, despite the generation approach may be compelling to address this use case, it may not necessarily be the best. Because it doesn't behave exactly that way. It might work, but maybe not exactly as you expect. When a generation is created, the cached entries are not all removed; only the ones in the older generation are. But the current latest generation that becomes the older one may also have keys. So, it doesn't follow the exact behavior you describe; when new batch comes in, old values have to be completely gone.

Based on what you describe in the example, perhaps the best option using the Nebulex local adapter is:

with_dynamic_cache(id(range), __MODULE__, :new_generation, [])
with_dynamic_cache(id(range), __MODULE__, :new_generation, [])
with_dynamic_cache(id(range), __MODULE__, :put_all, [buckets_with_keys])

Only pushing generation twice, you will flush all previous data. However, like you said, there is still a chance that the cache is empty while the new entries are populated.

So, I think you would have to write something on top of the cache to make it work as you expect. For example, one easy thing you could do is to keep some sort of versioning in your data. Right now your value is defined like this:

{{:basketball, ~U[2025-06-06 12:00:00]}, "NBA 1"}

You can add another field tag or vsn, like:

{{:basketball, ~U[2025-06-06 12:00:00]}, "NBA 1", 0}

NOTE: Maybe, I'd consider using a struct here to make the code more readable (but the tuple is also fine).

Now, you can change your match query to also consider the tag or vsn. That way, when a new batch comes, you first put the new data with the new tag or vsn and then remove the old data with the previous one.

with_dynamic_cache(id(range), __MODULE__, :put_all, [buckets_with_keys])
with_dynamic_cache(id(range), __MODULE__, :delete_all, [match_query_with_prev_tag])

I think that way you ensure the cache always serves data, and also removes the previous data after inserting the new one. The thing here is, you will have to keep the tag/vsg somewhere, so every time the batch comes, you update it. In other words, it is like tagging your entries and using a query to remove the old data based on the tag.

There may be another alternative, but it is a bit more complicated, I mean, it is still straightforward, but not as simple as the previous one. Based on how the generation process works under the hood, one idea I can think of is, imagine you have 2 caches (using the same approach you have with dynamic caches). Like :cache1 and :cache2 (dynamic caches). They will be like your application generations. So, you switch between caches every time a new batch comes in. You store the new data in cache2, and then you can remove the data from cache1 (maybe schedule the removal after some seconds to avoid empty data while the cache switching takes place). Let me try to give you a better idea. Imagine you have an extra module that looks like this:

defmodule CacheManager do
  use GenServer

  alias MyCache

  defstruct cache1: nil, cache2: nil, cache_to_flush: nil

  ## API

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts)
  end

  def switch_cache do
    GenServer.call(__MODULE__, :switch_cache)
  end

  @compile {:inline, get_current_cache: 0}
  def get_current_cache do
    :persistent_term.get(current_cache_key())
  end

  ## Callbacks

  @impl true
  def init(opts) do
    cache1 = Module.concat([MyCache, Cache1])
    cache2 = Module.concat([MyCache, Cache2])

    :ok = put_current_table(cache1)

    {:ok, %__MODULE__{cache1: cache1, cache2: cache2}}
  end

  @impl true
  def handle_call(:switch_cache, _from, %__MODULE__{cache1: cache1, cache2: cache2} = state) do
    current_cache = get_current_cache()

    new_current_cache =
      case current_cache do
        ^cache1 ->
          :ok = put_current_cache(cache2)

          cache2

        ^cache2 ->
          :ok = put_current_cache(cache1)

          cache1
      end

    # Perhaps you can use an async task here? (instead of a timer)
    {:reply, new_current_cache, %__MODULE__{cache_to_flush: current_cache}, :timer.seconds(10)}
  end

  @impl true
  def handle_info(:timeout, %__MODULE__{cache_to_flush: cache_to_flush} = state) do
    # TODO: Flush the cache

    {:noreply, %{state | cache_to_flush: nil}}
  end

  ## Private functions

  @compile {:inline, current_cache_key: 0}
  defp current_cache_key do
    {__MODULE__, :current_cache}
  end

  defp put_current_cache(cache_name) do
    # An atom is a single word so this does not trigger a global GC
    :persistent_term.put(current_cache_key(), cache_name)
  end
end

So your replace function may look like so:

cache = CacheManager.switch_cache()

with_dynamic_cache(cache, __MODULE__, :put_all, [events_with_keys])

Anyway, it is another idea, of course, you will have to fix and improve some things, but maybe it may help?

Jun 06 '25 16:06 cabol

By the way, I was checking, and the initial idea you had about using a TTL (via the :ttl option) may also work. As long as you ensure the TTL you set for a batch expires when a new batch comes, it should also do the work. For example, a first batch comes and you set a TTL for, let's say, one hour. Then, when a second batch comes, if the current time exceeds the previous configured TTL, then those entries will expire. Anyway, you can also try it out and let me know. Thx!

Jun 08 '25 08:06 cabol

Hmm, nice idea. The cache manager maps to the idea of having two separate ETS tables and redirect reading to whichever is the latest. That might still be the cleanest option by far.

Tried playing with TTL idea. And this is an interesting finding. Seems that :touched changes only with inserting data. And it seems to be unique for the entire set of data added via put_all, regardless of new_generation. So it got me thinking about this idea:

write normally, into a new generation - generation approach still kept so we can expire stuff / remove it from ETS (otherwise things would just keep piling up)
find latest touched when reading
use it to return only the latest batch

defmodule Demo do
  def replace_all(games) do
    new_generation()
    put_all(Enum.map(games, &{{&1.type, &1.when}, &1}))
  end

  def get_last_batch do
    last_touched = 
      generations()
      |> Enum.flat_map(& :ets.lookup(&1, :ets.first(&1)))
      |> Enum.map(fn {:entry, _, _, touched, _} -> touched end)
      |> Enum.max()

    
    spec = [
      {
        Local.entry(key: :_, value: :"$2", touched: :"$3", ttl: :_),
        [{:==, last_touched, :"$3"}],
        [:"$2"]
      }
    ]

    all(spec)
  end
end

It exploits the idea that touched is changed only on insert, and then adds related filter in ETS match spec to use only those records. As mentioned above, generations would still be needed to evict obsolete data.

I guess it all depends on how atomic put_all is. E.g. if records inserted via put_all were to be matchable even before all records have been inserted, then this approach would not work since it would start serving data before it is completely inserted.

Or, reverse the idea and serve oldest, but ensuring we push latest batch to be the only generation in data once inserting is complete:

  def replace_all(games) do
    new_generation()
    put_all(Enum.map(games, &{{&1.type, &1.when}, &1}))
    new_generation()
  end

  def get_last_batch do
    last_touched = 
      ...
      |> Enum.min()
      ...
  end

As far as I understand this would:

create new generation
- serving data would still return the older batch
start inserting data
- still serving older batch
create yet another new generation
- old data is evicted since that generation is now removed
- latest batch becomes the oldest generation
- and is served until new batch comes in

Does it make sense? Is it OK to depend on internals for touched? Alternatively, I can add a unique timestamp to record keys and use the same idea.

Jun 09 '25 17:06 elvanja

A small update. Using new_generation two times like above is a mistake since it immediately evicts old generation (when 1st new_generation is executed). So this works:

insert new data
old data is still being served
new generation
old data is evicted
new data becomes old data
and continues to be served

defmodule Demo do
  def replace_all(games) do
    put_all(Enum.map(games, &{{&1.type, &1.when}, &1}))
    new_generation()
  end

  def get_last_batch do
    oldest_touched = 
      generations()
      |> Enum.flat_map(& :ets.lookup(&1, :ets.first(&1)))
      |> Enum.map(fn {:entry, _, _, touched, _} -> touched end)
      |> Enum.min()

    
    spec = [
      {
        Local.entry(key: :_, value: :"$2", touched: :"$3", ttl: :_),
        [{:==, oldest_touched, :"$3"}],
        [:"$2"]
      }
    ]

    all(spec)
  end
end

P.S. figured also that it is easy to just generate my own "inserted_at" and put it in key, so I don't have to rely on touched internals.

Jun 10 '25 05:06 elvanja

Assuming above idea is legit, here's a few last questions:

I am using default values for cache setup, records have ttl set to :infinity, is there anything else that might mess up with eviction? (need to keep serving cached data forever, until new generation comes in)
any issues with the piece that finds the oldest touched? (ets lookup first idea)

Jun 10 '25 06:06 elvanja

Hey. Apologies for the late response.

I am using default values for cache setup, records have ttl set to :infinity, is there anything else that might mess up with eviction? (need to keep serving cached data forever, until new generation comes in)

The approach you mentioned above looks ok to me, I don't think you're missing anything.

any issues with the piece that finds the oldest touched? (ets lookup first idea)

Well, let's see it is not ideal because you are using the very internals of the implementation, BUT, there may be special cases where you may need to, like this. If that gives us the behavior you are looking for, then is fine. Again, ideally, we don't want to go at that low level, but there are special cases. Another alternative (maybe) is, instead of finding the oldest touched (where you are traversing the entire two generations), you could maybe do a delete_all using a match spec that deletes all the entries you don't need, or maybe run all with a match spec that returns the entries with the latest timestamp, similar to what you are doing, but trying to replace the traversing of the tables by running a match query. Anyway, not sure if os possible for your use case, but just to put another alternative to the table.

Does it make sense? Is it OK to depend on internals for touched? Alternatively, I can add a unique timestamp to record keys and use the same idea

I don't see any issue on using the touched field. It is a legit alternative too. Overall, I think using the touched or you own inserted_at should be fine.

Jun 13 '25 09:06 cabol

Any feedback here? Let me know if we can close this issue or if you need more assistance with something. Thx!

Jun 20 '25 16:06 cabol

Hey @cabol, apologies for late reply (was indisposed, sorry). All good, all clear and I certainly have a number of ways to handle this. Will likely go with custom timestamp to avoid touching internals and custom read that fetches the set I am interested at the time. Thank you for you time and this super cool lib, much appreciated 🙇🏼

Jun 20 '25 18:06 elvanja