RFC: Streamlit hosting - data persistence support
Situation/Problem
Each time Docq is deployed to Streamlit cloud it wipes all the data because isn't using ephemeral storage. So this hosting option can only used for throw away demo mode. It cannot be used for any real customer scenarios. Streamlit hosting is on the low cost and easy end of hosting option. Such as option has a place in a customers journey to adopting Docq.
If we can persist data we can use it for hosting a real usable version for customers that can be used for serious trials/pilots.
Requirments
Components with disk persistence that need to be altered:
-
SQLite - is using the standard disk-based persistence requiring a mount point
-
Datasource document list tracking - uses the Python standard lib
jsonwhich requires a disk mount point -
LlamaIndex index - is using the standard disk-based persistence requiring a mount point-
-
Manual file upload - uses a standard. st.file_uploader returns a byte array which is written to disk using a standard file handler.
Have the ability to configure the deployment to be S3 backed or filesystem mount backed for persistence.
Solution
The high-level approach is to use an S3 bucket as the backing store. This is not a drop in approach. This solution is proposed because there doesn't seem to be a drop in solution. That means Streamlit Cloud doesn't seem to support persistent filesystem mounts.
Each of the components we use that persists data will need to have some sort of support for S3 as a backing store. Below is the S3 backing solution for each component.
-
SQLite - https://github.com/uktrade/sqlite-s3vfs. Does the concurrency model change. s3vfs makes a point that it doesn't handle concurrent write and needs to be handled in the app.
-
LlamaIndex -
StorageContextcan take an instance offsspecas for persistence via thefsargument. fsspec, S3fs -
document list - does the
jsonmodule support a byte array/stream interface? if so use that together with an S3 interface module likes3fs. -
manual file uploads - switch to using
fsspec/[s3fs](https://s3fs.readthedocs.io/en/latest/)rather than standard file handler.
fsspec support several backing stores like S3, localfile, GCS, etc.
Alternatives
Simple persistent filesystem mout
This would be the simplest solution and therefore idea. It will require no code changes. However there doesn't seem to be an option for this in Streamlit cloud.
Streamlit file connections
This is unlikely to work given none of the components we use that persist data will not support this interface out of the box.
Streamlit data connections feature abstracts over s3fs hence fsspec. Specifically using S3, Streamlit file connection, and S3fs
@cwang looked into the whole Streamlit file persistence thing. I can't see an easy drop in option. Basically, persistent filesystem mount doesn't seem to be supported. Above is the best alter I can think of.
SQLite is the biggest unknown and risk given the s3vfs comment:
Python virtual filesystem for SQLite to read from and write to S3.
No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.
Should be able to run it with render.com, utilising the mounted share disk
Yes. But it would be free on Azure.