phoenix icon indicating copy to clipboard operation
phoenix copied to clipboard

🗺️ Persistence

Open mikeldking opened this issue 2 years ago • 6 comments

As a user of phoenix, I would like a persistent backend - notably a way to

  • Resume phoenix on previous collection of data
  • Keep track of evaluation results

Spikes

  • [x] #2158
  • [x] #2335
  • [x] #2704
  • [x] #2742
  • [x] #2793

Server

  • [x] #2336
  • [x] #2512
  • [x] #2546
  • [x] #2687
  • [x] #2547
  • [x] #2548
  • [x] #2559
  • [x] #2560
  • [x] #2628
  • [x] #2721
  • [x] #2743
  • [x] #2753
  • [x] #2776
  • [x] #2777
  • [x] #2778
  • [x] #2806
  • [x] #2781
  • [x] #2838
  • [x] #2807
  • [x] #2779
  • [x] #2780
  • [x] #2782
  • [x] #2786
  • [x] #2787
  • [x] #2810
  • [x] #2539
  • [x] #2877
  • [x] #2878
  • [x] #2879
  • [x] #2957
  • [x] #2961
  • [x] #2963
  • [x] #2930
  • [x] #2931
  • [x] #2932
  • [x] [persistence] remove defaulting to memory
  • [x] [persistence] promote db config out of experimental
  • [x] #2894
  • [x] #2901
  • [x] #2909
  • [ ] #2906
  • [x] #2913
  • [x] #2925
  • [x] #2926
  • [x] #2922
  • [x] #2813
  • [x] #2843
  • [x] #2889
  • [x] #2929
  • [x] #2969
  • [x] #2975

UI

  • [x] #2882
  • [x] #2883
  • [x] #2540

Metrics / Observability

  • [x] #2615
  • [ ] #2712
  • [x] #2927
  • [x] #2970

Infra

  • [x] #2676
  • [x] #2783
  • [x] #2794
  • [x] #2795
  • [x] #2811
  • [x] #2815
  • [x] #2921

Remote Session management

  • [x] #2136
  • [x] #2137
  • [x] #2091
  • [x] #2138

Performance

  • [ ] #3003
  • [x] #3016
  • [x] #3017
  • [ ] #3018
  • [ ] #3019
  • [ ] #3021
  • [ ] #3027
  • [x] #3028
  • [x] #2123
  • [x] #3033
  • [ ] #3049

Notebook-Side Persistence

  • [x] #2814

Docs

  • [ ] #2935
  • [ ] #2936

Breaking Changes

  • [x] #2933
  • [x] #2934

Testing

  • [ ] #2880
  • [ ] #2881
  • [x] #3040
  • [ ] Make sure delete cascades work for data retention
  • [ ] Make sure delete cascades work for data retention on evals @axiomofjoy
  • [ ] Test run some migrations on named constraints
  • [ ] Test pandas and parquet endpoints and interactions with the client
  • [ ] Filters and Sorting on the UI
  • [ ] SpanQuery compatibility
  • [ ] Metadata filtering
  • [ ] Data volume
  • [ ] Data size (large context windows, large IO)
  • [ ] Embedding querying

Open Questions

  • Storage of embeddings
  • Controlling sqlite version explicitly
  • [ ] trace retention / management needed?

mikeldking avatar Oct 31 '23 21:10 mikeldking

I think at least for the second need you can see here: https://docs.arize.com/phoenix/integrations/llamaindex#traces

aazizisoufiane avatar Nov 15 '23 13:11 aazizisoufiane

I am investigating using Parquet as the file format. Here's a snippet to add custom metadata to a Parquet file:

"""
Snippet to write custom metadata to a single parquet file.

NB: "Pyarrow maps the file-wide metadata to a field in the table's schema
named metadata. Regrettably there is not (yet) documentation on this."

From https://stackoverflow.com/questions/52122674/how-to-write-parquet-metadata-with-pyarrow
"""

import json

import pandas as pd
import pyarrow
from pyarrow import parquet

dataframe = pd.DataFrame(
    {
        "field0": [1, 2, 3],
        "eval0": ["a", "b", "c"],
    }
)
OPENINFERENCE_METADATA_KEY = b"openinference"
openinference_metadata = {
    "version": "v0",
    "evaluation_ids": ["eval0", "eval1"],
}

original_table = pyarrow.Table.from_pandas(dataframe)
print("Metadata:")
print("=========")
print(original_table.schema.metadata)
print()

updated_write_table = original_table.replace_schema_metadata(
    {
        OPENINFERENCE_METADATA_KEY: json.dumps(openinference_metadata),
        **original_table.schema.metadata,
    }
)
parquet.write_table(updated_write_table, "test.parquet")
updated_read_table = parquet.read_table("test.parquet")
print("Metadata:")
print("=========")
print(updated_read_table.schema.metadata)
print()

updated_metadata = updated_read_table.schema.metadata
updated_metadata.pop(OPENINFERENCE_METADATA_KEY)
assert updated_metadata == original_table.schema.metadata

axiomofjoy avatar Dec 22 '23 04:12 axiomofjoy

Notes on Parquet and PyArrow:

  • Large "row groups" are suggested for quicker analytical queries (on the order of a GB). Many of our datasets will be far smaller than this. There are potentially performance consequences at query time for writing small Parquet files frequently.
  • Parquet files are immutable. As far as I can tell, there is no notion of updating just the file metadata.
  • It's possible to augment the metadata of individual Parquet files (see above). Another pattern used by Spark and Dask actually is to write a separate metadata file at _common_metadata to describe all the Parquet files in an Arrow dataset (a single metadata file describing multiple Parquet files).
  • Arrow supports directory partitioning. It looks straightforward to partition, for example, on date.
  • Arrow also provides nice file system interfaces to the various cloud storage providers.

axiomofjoy avatar Dec 22 '23 04:12 axiomofjoy

@axiomofjoy what kind of backends will you target? i see some code related to file backends, but why not sql databases (given the .to_sql for dataframes, and probably other/better methods); or native json support in most solutions. given phoenix coupling with RAG, most people will already have a vectordb that should work.

stdweird avatar Mar 14 '24 15:03 stdweird

@mikeldking Can you provide an update for @stdweird?

axiomofjoy avatar Mar 21 '24 16:03 axiomofjoy

@axiomofjoy what kind of backends will you target? i see some code related to file backends, but why not sql databases (given the .to_sql for dataframes, and probably other/better methods); or native json support in most solutions. given phoenix coupling with RAG, most people will already have a vectordb that should work.

@stdweird - good point - I think we see some limitations with sql backends and so we are currently benchmarking different backends. In general we will probably have a storage interface and you will be able to choose your storage mechanism but for now we are working on keeping the backend pretty lean and figuring out the interface as we go

mikeldking avatar Mar 21 '24 16:03 mikeldking

🥳

mikeldking avatar May 13 '24 15:05 mikeldking