tqec Implement a file-based database for simulation results

Is your feature request related to a problem? Please describe. The goal of our initiative is to generate graphs such as

In the above graph, each point is:

one quantum circuit, generated by tqec for a given value of k, and that can be represented as a stim file,
a large number of simulations performed by stim.

One problem is that Stim simulations are not free, and computing one point from the above graph can take minutes to hours of computational time.

Currently, we have no clever way of storing such data, meaning that the stim simulations have to be re-done each time we want to generate a new graph.

Describe the solution you'd like

We should have a database-like way of storing simulation data. There are multiple requirements:

we should be able to retrieve easily already existing results,
data should be written on disk,
we should be able to add new results to existing ones (typically, start a simulation with 1000 shots to see the overall look of the plot and check that there is not mistake, and once obvious mistakes have been corrected be able to launch 999000 more shots to reduce the error bars),
we should be able to remove existing results, but this should be hard to do (i.e., be wary of accidental data loss)

Note that simulation results might be quite heavy in terms of memory, so an optimised storage would be a plus.

Jul 25 '24 09:07 nelimee

We can think about utilizing the existing sampling tool like sinter. But as far as I know, currently there is no API provided by sinter to store the intermediate sampled detectors/observables to files.

Jul 26 '24 08:07 inmzhang

We can think about utilizing the existing sampling tool like sinter. But as far as I know, currently there is no API provided by sinter to store the intermediate sampled detectors/observables to files.

Yep, the goal of this issue is not the generation (which will very likely be handled by sinter as you note) but rather the storage of generated results.

Also, even if sinter had the possibility to store to files, we would need to have a clear organisation to allow easy retrieval, modification and deletion, so in any case we will need at least helper methods to do that.

Note that it looks a lot like the work done by a database, that might be a path to the solution.

Jul 26 '24 08:07 nelimee

Craig: can you comment on how Stim/sinter simulation results can be systematically stored so that one could later gather additional data for a plot to improve its statistics or explore a wider range of code distances and error rates?

On Fri, Jul 26, 2024 at 1:14 AM Adrien Suau @.***> wrote:

We can think about utilizing the existing sampling tool like sinter. But as far as I know, currently there is no API provided by sinter to store the intermediate sampled detectors/observables to files.

Yep, the goal of this issue is not the generation (which will very likely be handled by sinter as you note) but rather the storage of generated results.

— Reply to this email directly, view it on GitHub https://github.com/QCHackers/tqec/issues/273#issuecomment-2252212612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAXTEDMPTVC5TETCVFNTTZOIAOVAVCNFSM6AAAAABLOFRWF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSGIYTENRRGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jul 26 '24 13:07 afowler

Craig: can you comment on how Stim/sinter simulation results can be systematically stored so that one could later gather additional data for a plot to improve its statistics or explore a wider range of code distances and error rates?

Whenever I have a task like that, I really follow the database point of view:

I try to find a set of small data points that uniquely identify an "experiment" (in database terms, the primary key),
I try to store in the "experiment" (i.e., the value associated to the primary key) whatever I may need in the future.

In this specific case, I think that the primary key will be composed of:

an algorithmically generated (hash-like) key representing the experiment being benchmarked. For the moment, with the limited use-cases we explicitly target, I guess that we can compute such a hash (or a unique value if we really want to avoid any collision) by only considering:
- each block identifier ("xzx", "zxz", "xozh", ...),
- each block position (i.e., the position of its origin, that is uniquely defined for each block).
These can be directly obtained from the SketchUp file representing the computation and should be:
1. robust enough in the sense that if the computation does not change, the value should not change,
2. sensitive enough to avoid representing 2 different computations by the same value.
the value of k (determining the size of our logical qubits, and code distance),
the noise level might be tricky because of the floating-point representation, but there are ways around it that I think should be satisfactory for this use case, e.g., representing the noise level e = powerOfTenMantissa * 10**(-negativePowerOfTen) as a tuple (powerOfTenMantissa, negativePowerOfTen) where 0 <= powerOfTenMantissa <= 1 can be represented as a fraction.

The data stored will have to include the outputs of stim simulations (depending on what we need, direct measurements or detection events), and I think some metadata could be added to such a value such as:

date of data generation,
library versions used to generate the data,
custom annotations/tags provided by the user (e.g., "confidential", "internal use only", "public") to be able to filter out some data,
...

In terms of format, and because the main data we will store is binary anyway, I do not have any preferences and it can be anything (a real database, a file/folder-based storage, ...).

Jul 26 '24 14:07 nelimee

Sinter always hashes the circuit it was asked to simulate and the decoder it was asked to use, producing a cryptographically strong id. This id is stored alongside any statistics. When you merge multiple files, you match up statistics by this id when deciding whether or not to combine two entries into one entry.

I don't think "how to store stats" is particularly important to the goal of input-skeleton-output-circuit. That's later.

On Fri, Jul 26, 2024 at 6:32 AM Austin Fowler @.***> wrote:

Craig: can you comment on how Stim/sinter simulation results can be systematically stored so that one could later gather additional data for a plot to improve its statistics or explore a wider range of code distances and error rates?

On Fri, Jul 26, 2024 at 1:14 AM Adrien Suau @.***> wrote:

We can think about utilizing the existing sampling tool like sinter. But as far as I know, currently there is no API provided by sinter to store the intermediate sampled detectors/observables to files.

Yep, the goal of this issue is not the generation (which will very likely be handled by sinter as you note) but rather the storage of generated results.

— Reply to this email directly, view it on GitHub https://github.com/QCHackers/tqec/issues/273#issuecomment-2252212612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAXTEDMPTVC5TETCVFNTTZOIAOVAVCNFSM6AAAAABLOFRWF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSGIYTENRRGI . You are receiving this because you are subscribed to this thread.Message ID: <QCHackers/tqec/issues/273/2252212612 @.***>

Jul 26 '24 22:07 afowler

Since this is not regarded as an important issue it would make sense to restrict it to some minimal effort. Here is my suggestion:

I would like to mention one obstactle first: When you finally view your results you want to associate the right observable with your experiment. Following the user guide one obtains a nice plot which also shows the observable (correlation surface) as an inset (see below).

Right now the user is required to drag around the list of observables. They must know which experiment belongs to which observable. Right now this is done by comparing list indices which is error prone once one stores results on disk. Imagine you store the results, then you update tqec, and for some reason the correlation surfaces are generated in a different order.

A possible solution would be to also store a (persistent) hash (similar to strong id of the experiment itself) of the observable (or maybe better the correlation surface) in the json_metadata of each sinter task. I did not check though, if one can reliably do this for the correlation surface.

So before implementing this here are a few questions and suggestions (@nelimee):

What do you think about including a hash of the correlation surface into each sinter task?
I would store everything into a single csv file (also for more than one observable).
I would implement storing-data as a standalone function which just takes a (flat) list of sinter tasks. Those tasks corresponds to a line in the csv file. If the file exists and an added line matches an already existing line by strong id then I think one can merge it in some sensible way (adding the shots, errors, ...).
For parsing the csv file one could make live easy and use the dataframe library polars. Here would be the question if adding a dependency would be OK? One could also do it without but before thinking about how much work this is it would be good to know this first.

This is how such a csv file would look like (except that the json_metadata could also need a strong id for the correlation surface):

shots,errors,discards,seconds,decoder,strong_id,json_metadata,custom_counts
  10000000,      4412,         0,    11.6,pymatching,9ab7fe2f490b24f06a5ca3e56f4d76fdbf6229555c681277f6d6d945e91057bc,"{""d"":3,""p"":0.00021544346900318845,""r"":3}",
  10000000,       964,         0,    10.1,pymatching,206c18ed87c385578694915fb409e5a84644a1e27767d6702c50c75fe7f35ccd,"{""d"":3,""p"":9.999999999999999e-05,""r"":3}",
     14609,      5336,         0,   0.322,pymatching,5172c07caf3e752e8d85edd4274b76d84bc2fe1157d1a5ffe1e85b31b08fd34f,"{""d"":3,""p"":0.01,""r"":3}",

error_rate

Jan 13 '25 13:01 rainij

Since this is not regarded as an important issue

This is not the current focus, as other issues are deemed more important. But I agree that this is an important issue :)

What do you think about including a hash of the correlation surface into each sinter task?

Seems like a very good idea. Observables are simply unordered lists of measurements, i.e., can ultimately be re-phrased as a list of elements containing a timestep (an int) and a qubit location (a tuple[int, int]). Computing a reliable hash should be quite easy: sort the list (whatever the order picked, as long as it is deterministic) and hash the integers.

Note that another solution could be to only include one observable per stim file. In this case, the stong hash already stored by stim should also include the observable. I do not think this is the best idea, as that would lead to generate nearly the same circuit multiple times and would remove the possibility to fix #364, but that is an alternative.

I would store everything into a single csv file (also for more than one observable).

I have no strong feelings for or against that.

I would implement storing-data as a standalone function which just takes a (flat) list of sinter tasks. Those tasks corresponds to a line in the csv file. If the file exists and an added line matches an already existing line by strong id then I think one can merge it in some sensible way (adding the shots, errors, ...).

Seems reasonable. I think that stim does not merge lines directly but rather just append new results to the existing file, merging being done when reading.

For parsing the csv file one could make live easy and use the dataframe library polars. Here would be the question if adding a dependency would be OK? One could also do it without but before thinking about how much work this is it would be good to know this first.

Honestly, adding a dependency to parse a CSV that is not even expected to be very large seems overkill. I would first try to parse with the built-in csv module, or even by writing our own functions.

This is how such a csv file would look like (except that the json_metadata could also need a strong id for the correlation surface):

If everything can be done by using sinter.TaskStats data format and the json_metadata field that would be perfect.

Jan 13 '25 16:01 nelimee

OK sounds like a good mini-task to get into the workflow of this project. I plan to open at least a draft PR until sunday (2025-01-19). If I do not do that everybody is encouraged to remind me of this.

I will basically do what I proposed except that I do not use an additional dependency (like polars).

@nelimee you mentioned that stim does export and import to csv. Can you point me to the API reference (if possible)? I know that a sinter task can print itself into a csv line but I thought I didn't see a full read/write routine. Could be that I do not use it but might be still good to know what already exists.

Jan 13 '25 18:01 rainij

@nelimee would you assign me to the issue? It seems I have not the rights to do that.

Jan 13 '25 18:01 rainij

OK sounds like a good mini-task to get into the workflow of this project. I plan to open at least a draft PR until sunday (2025-01-19). If I do not do that everybody is encouraged to remind me of this.

I will basically do what I proposed except that I do not use an additional dependency (like polars).

@nelimee you mentioned that stim does export and import to csv. Can you point me to the API reference (if possible)? I know that a sinter task can print itself into a csv line but I thought I didn't see a full read/write routine. Could be that I do not use it but might be still good to know what already exists.

You can have a look at the documentation of sinter.read_stats_from_csv_files. Basically:

from pathlib import Path
import sinter 

def save_csv(tasks: list[sinter.TaskStats], path: Path) -> None:
    with open(path, "w") as f:
        f.write(sinter.CSV_HEADER + "\n")
        for task_stats in tasks:
            f.write(str(task_stats) + "\n")

and you can read back using

stats = sinter.read_stats_from_csv_files("./stat1.csv", "./stat2.csv")

Jan 13 '25 19:01 nelimee