docq RFC: Streamlit hosting - data persistence support

Situation/Problem

Each time Docq is deployed to Streamlit cloud it wipes all the data because isn't using ephemeral storage. So this hosting option can only used for throw away demo mode. It cannot be used for any real customer scenarios. Streamlit hosting is on the low cost and easy end of hosting option. Such as option has a place in a customers journey to adopting Docq.

If we can persist data we can use it for hosting a real usable version for customers that can be used for serious trials/pilots.

Requirments

Components with disk persistence that need to be altered:

SQLite - is using the standard disk-based persistence requiring a mount point
Datasource document list tracking - uses the Python standard lib json which requires a disk mount point
LlamaIndex index - is using the standard disk-based persistence requiring a mount point-
Manual file upload - uses a standard. st.file_uploader returns a byte array which is written to disk using a standard file handler.

Have the ability to configure the deployment to be S3 backed or filesystem mount backed for persistence.

Solution

The high-level approach is to use an S3 bucket as the backing store. This is not a drop in approach. This solution is proposed because there doesn't seem to be a drop in solution. That means Streamlit Cloud doesn't seem to support persistent filesystem mounts.

Each of the components we use that persists data will need to have some sort of support for S3 as a backing store. Below is the S3 backing solution for each component.

SQLite - https://github.com/uktrade/sqlite-s3vfs. Does the concurrency model change. s3vfs makes a point that it doesn't handle concurrent write and needs to be handled in the app.
LlamaIndex - StorageContext can take an instance of fsspec as for persistence via the fs argument. fsspec, S3fs
document list - does the json module support a byte array/stream interface? if so use that together with an S3 interface module like s3fs.
manual file uploads - switch to using fsspec / [s3fs](https://s3fs.readthedocs.io/en/latest/) rather than standard file handler.

fsspec support several backing stores like S3, localfile, GCS, etc.

Alternatives

Simple persistent filesystem mout

This would be the simplest solution and therefore idea. It will require no code changes. However there doesn't seem to be an option for this in Streamlit cloud.

Streamlit file connections

This is unlikely to work given none of the components we use that persist data will not support this interface out of the box.

Streamlit data connections feature abstracts over s3fs hence fsspec. Specifically using S3, Streamlit file connection, and S3fs

See KB article

Jul 14 '23 10:07 janaka

@cwang looked into the whole Streamlit file persistence thing. I can't see an easy drop in option. Basically, persistent filesystem mount doesn't seem to be supported. Above is the best alter I can think of.

SQLite is the biggest unknown and risk given the s3vfs comment:

Python virtual filesystem for SQLite to read from and write to S3.

No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.

Jul 14 '23 15:07 janaka

Should be able to run it with render.com, utilising the mounted share disk

Sep 08 '23 11:09 cwang

Yes. But it would be free on Azure.

Sep 10 '23 10:09 janaka