RFC: [WIP] Change persistence folder structure
Situation
Data persistence on disk isn't consistently separated by scope of ownership.
Current filesystem structure:
-
/index/PERSONAL/{user_id}/- index files for Ask Your Docs feature -
/index/SHARED/{org_id}/{space_id}/- index files for Spaces -
/sqlite/PERSONAL/{user_id)/usage.db- retrieval and LLM request and response data (chat history etc) for all interactions -
/sqlite/SHARED/system.db- system data and metadata (orgs, users, user_groups, spaces, and space_groups) -
/upload/PERSONAL/{user_id}/- Ask Your Docs feature is hard coded to MANUAL_UPLOAD document. Those files are persisted here. -
/upload/SHARED/{org_id}/{space_id}- file uploads for any spaces with datasource = MANUAL_UPLOAD are persisted here.
Database table to file mapping:
usage.db:settings(user scoped),history_{feature_name},history_thread_{feature_name}system.db:orgs,org_members,users,settings(none user scoped),space_groups,space_group_members,spaces,space_access,user_groups,user_group_memebers
Tables with joins:
orgs<>org_membersorg_members<>usersspaces<>space_access<>usersspaces<>space_group_membersusers<>user_group_members
This structure isn't ideal with the addition of Orgs and given upcoming features such as public chatbots and changing the Ask You Docs functionality to be structured as a personal org.
- The semantics of space type (PERSONAL or SHARED) don't hold any longer for determining persistence location
- FeatureType to pass a user_id context around, then using that to decide on a persistence location is also not great.
Goals and Requirements
- Reduce the risk of org-owned content data (e.g. confidential docs that are indexed and chatted history against them) leaking across org boundaries
- Make it easier to migrate an entire org from a multi-tenant instance to a dedicated instance.
- Usage data (chat history etc): always strictly scoped to an org and user, hence personal.
- Org System data (user_groups, spaces metadata etc.) - can be shared across multiple users but always strictly scoped to an org
- Global System data - users and org_members are the only system-wide shared data i.e. accessible across orgs.
- Presentation layer and domain concepts (such as features and space types) should not be directly coupled to the persistence layer system logic
Proposal
Structure folders based on a name for the persistence system followed by one or more keys that identify the unique owner of the data. The filename describes the data.
Pattern:
/{persistance_system_name}/{owner_scope_key_1}/../{owner_scope_key_n}/{filename}
Concrete changes:
-
/index/SHARED/{org_id}/{space_id}/-->/index/orgs/{org_id}/{space_id}/ -
[x]
/index/PERSONAL/{user_id}/--> Same as above. Ask Your Docs changes to be achieved by providing every user a personal org. A space that isn't shared with any other users is private. -
/index/THREAD/{org_id}/{space_id}/ -->/index/personal/{user_id}/{space_id} -
/sqlite/PERSONAL/{user_id)/usage.db-->/sqlite/personal/{user_id)/usage.db- authenticated user usage -
/sqlite/SHARED/system.db-->/sqlite/global/system.db- global system -
new -->
/sqlite/orgs/{org_id}/system.db- org system -
settingsorg and user scope - both (?) should be stored in the samesettingstable in/sqlite/orgs/{org_id}/system.db -
settingsglobal scope -/sqlite/global/system.db -
/upload/SHARED/{org_id}/{space_id}-->/upload/org/{org_id}/{space_id} -
/upload/PERSONAL/{user_id}/--> same as above because of the changes to Ask Your Docs. -
/upload/THREAD/{org_id}/space_id}--> /upload/personal/{user_id}/{space_id}
New use cases:
- public chat bot -->
/sqlite/personal/anon-{user_id}/usage.db- anonymous user usage. user_id is a generated guid. - experiment projects (ideally we just want to be able to prefix the root folder. this needs more thought)
- usage:
- index:
- settings:
- spaces:
/sqlite/orgs/{org_id}/experiments/system.db
Partially implementing as part of #207 as this involves org-scoped data. Partial because migrating existing data structure to the new in deployed systems will not be handled. A new DataScope enum has been introduced with backwards-compatible mappings where needed.