Support export and import knowledge base
Description
Providing the import and export knowledge bases feature to enable user can reused the chunks / knowledge graph across multiple Autoflow instances, avoiding the repeated costs of embedding and knowledge graph extraction.
Design
What kind of files are used to transmit the knowledge base data?
export KB data to csv files?
export related uploads files into a folders named uploads
migration_kb_data
- kb.{kb_id}.uploads.csv
- kb.{kb_id}.documents.csv
- kb.{kb_id}.chunks.csv
- kb.{kb_id}.entities.csv
- kb.{kb_id}.relationships.csv
- uploads
- xxxx.md
- xxxx.pdf
Consideration
- Whether to support import to the existing knowledge base
- the upload / document / user id may be changed.
TODO
- [ ] Support export and import knowledge base via CLI
Related https://github.com/pingcap/autoflow/issues/398
Do we really have such a scenario?
Do we really have such a scenario?
Yes, sometimes when user uses a local and private network environment, it is difficult for them to download docs.pingcap.com or other online docs. This function can help them to download an existing knowledge base and import it to their own self-hosted autoflow easily.
help them to download an existing knowledge base
What would the existing knowledge base be, a internal website or a folder containing a lot of local files? Please provide a detailed description in the issue description.
If the data source is not common, we should use custom script to implement
import it to their own self-hosted autoflow
Why not using upload local file data source? Do we have to use CLI to upload?
What would the existing knowledge base
For examples, TiDB knowledgebase, redis kb, mongodb kb.
Why not using upload local file data source
-
Cost: If we add tidb knowledge by crawl docs.pingcap.com, users should pay again for llm while extract knowledge graphs from about thousand of pages; if we achieve this by upload an about 100MB tidb-user-guide.pdf, it still need llm to extract the whole knowledge graph from this pdf file, it will cost about $50< cost <$100 maybe.
-
LLM Performance Users may not have smartest llm for knowledge graph extraction, for example many users use llama3.* 32B, or self-hosted model. these llm didn't have high performance for extracting and building graphs
Do we have to use CLI to upload?
The ultra solution might be a UI based export/import experience, I think.