docq icon indicating copy to clipboard operation
docq copied to clipboard

RFC: [WIP] Change persistence folder structure

Open janaka opened this issue 1 year ago • 1 comments

Situation

Data persistence on disk isn't consistently separated by scope of ownership.

Current filesystem structure:

  • /index/PERSONAL/{user_id}/ - index files for Ask Your Docs feature

  • /index/SHARED/{org_id}/{space_id}/ - index files for Spaces

  • /sqlite/PERSONAL/{user_id)/usage.db - retrieval and LLM request and response data (chat history etc) for all interactions

  • /sqlite/SHARED/system.db - system data and metadata (orgs, users, user_groups, spaces, and space_groups)

  • /upload/PERSONAL/{user_id}/ - Ask Your Docs feature is hard coded to MANUAL_UPLOAD document. Those files are persisted here.

  • /upload/SHARED/{org_id}/{space_id} - file uploads for any spaces with datasource = MANUAL_UPLOAD are persisted here.

Database table to file mapping:

  • usage.db : settings (user scoped), history_{feature_name}, history_thread_{feature_name}
  • system.db : orgs, org_members, users, settings (none user scoped), space_groups, space_group_members, spaces, space_access, user_groups, user_group_memebers

Tables with joins:

  • orgs <> org_members
  • org_members <> users
  • spaces <> space_access <> users
  • spaces <> space_group_members
  • users <> user_group_members

This structure isn't ideal with the addition of Orgs and given upcoming features such as public chatbots and changing the Ask You Docs functionality to be structured as a personal org.

  • The semantics of space type (PERSONAL or SHARED) don't hold any longer for determining persistence location
  • FeatureType to pass a user_id context around, then using that to decide on a persistence location is also not great.

Goals and Requirements

  • Reduce the risk of org-owned content data (e.g. confidential docs that are indexed and chatted history against them) leaking across org boundaries
  • Make it easier to migrate an entire org from a multi-tenant instance to a dedicated instance.
  • Usage data (chat history etc): always strictly scoped to an org and user, hence personal.
  • Org System data (user_groups, spaces metadata etc.) - can be shared across multiple users but always strictly scoped to an org
  • Global System data - users and org_members are the only system-wide shared data i.e. accessible across orgs.
  • Presentation layer and domain concepts (such as features and space types) should not be directly coupled to the persistence layer system logic

Proposal

Structure folders based on a name for the persistence system followed by one or more keys that identify the unique owner of the data. The filename describes the data.

Pattern:

/{persistance_system_name}/{owner_scope_key_1}/../{owner_scope_key_n}/{filename}

Concrete changes:

  • /index/SHARED/{org_id}/{space_id}/ --> /index/orgs/{org_id}/{space_id}/

  • [x] /index/PERSONAL/{user_id}/ --> Same as above. Ask Your Docs changes to be achieved by providing every user a personal org. A space that isn't shared with any other users is private.

  • /index/THREAD/{org_id}/{space_id}/ --> /index/personal/{user_id}/{space_id}

  • /sqlite/PERSONAL/{user_id)/usage.db --> /sqlite/personal/{user_id)/usage.db - authenticated user usage

  • /sqlite/SHARED/system.db --> /sqlite/global/system.db - global system

  • new --> /sqlite/orgs/{org_id}/system.db - org system

  • settings org and user scope - both (?) should be stored in the same settings table in /sqlite/orgs/{org_id}/system.db

  • settings global scope - /sqlite/global/system.db

  • /upload/SHARED/{org_id}/{space_id} --> /upload/org/{org_id}/{space_id}

  • /upload/PERSONAL/{user_id}/ --> same as above because of the changes to Ask Your Docs.

  • /upload/THREAD/{org_id}/space_id} --> /upload/personal/{user_id}/{space_id}

New use cases:

  • public chat bot --> /sqlite/personal/anon-{user_id}/usage.db - anonymous user usage. user_id is a generated guid.
  • experiment projects (ideally we just want to be able to prefix the root folder. this needs more thought)
    • usage:
    • index:
    • settings:
    • spaces: /sqlite/orgs/{org_id}/experiments/system.db

janaka avatar Feb 04 '24 19:02 janaka

Partially implementing as part of #207 as this involves org-scoped data. Partial because migrating existing data structure to the new in deployed systems will not be handled. A new DataScope enum has been introduced with backwards-compatible mappings where needed.

janaka avatar Feb 04 '24 19:02 janaka