[Summary] GraphGen Roadmap

Open ChenZiHong-Gavin opened this issue 3 months ago • 0 comments

Backgraound

To establish GraphGen as an essential tool for training and evaluation data synthesis, its development roadmap focuses on two core pillars: implementing a robust, multi-dimensional data quality assessment and filtering system to ensure the reliability of generated knowledge graphs, and expanding its architecture to support multi-modal and multi-omics data inputs.

If you'd like to work on one of these tasks, please comment below to claim it and create an issue for the feature you'll be implementing.

Features

1 GraphGen Framework

[x] ♻️ Refactor the pipeline around base classes: Reader → KG_Builder → Partitioner → Generator. Data flow: Raw corpus → Reader → Splitter → KG_Builder → Partitioner → Generator → training / evaluation data: https://github.com/open-sciencelab/GraphGen/pull/59, https://github.com/open-sciencelab/GraphGen/pull/58, https://github.com/open-sciencelab/GraphGen/pull/52
[x] 🔧 Operator-orchestration backbone: https://github.com/open-sciencelab/GraphGen/pull/85
[ ] 🔍 Data provenance: ensure every record in the final training/evaluation set can be traced back to its original raw corpus through the full pipeline.

2 Multi-Modal & Multi-Omics

[x] 🧬 Define ImageNode, AudioNode, ProteinNode, etc.
[x] 👁️‍🗨️ Vision–language fusion extraction: use open VLMs to generate "image–caption–entity" triples and write them into the graph: https://github.com/open-sciencelab/GraphGen/pull/69
[ ] 🧪 Multi-omics extraction: process genomics/transcriptomics/proteomics with automatic node-property alignment

3 Data Quality & Curation

[ ] 📊 Multi-dimensional quality metrics with a unified scoring API
[ ] 💓 Graph-quality assessment similar to KGHeartBeat
[ ] 🎯 One-click export of high-quality sub-graphs and high-quality data
[ ] ⚙️ Configurable pipeline: entity disambiguation, fact verification, redundancy removal, schema validation

4 Graph Construction

[ ] 🚀 Incremental & resumable construction

5 Engineering

[x] 📂 Support csv, json, jsonl, txt, pickle, parquet, pdf, and various triple input formats: https://github.com/open-sciencelab/GraphGen/pull/50, https://github.com/open-sciencelab/GraphGen/pull/65, https://github.com/open-sciencelab/GraphGen/pull/82, https://github.com/open-sciencelab/GraphGen/pull/81
[ ] 🗄️ Unified hybrid storage layer: graph DB + object store + Redis cache, switchable with one click
[ ] 💨 Optional RedisGraph or hash storage for real-time read/write on large graphs
[ ] ✅ Data validation powered by Pydantic
[ ] 👓 More inference servers and clients such as Azure and Ollama: https://github.com/open-sciencelab/GraphGen/pull/74
[ ] 🔍 test cases

6 Community Detection & Data Synthesis

[ ] 🔎 Apply multiple community-detection algorithms; generate data from communities and provide typical samples plus visualizations
[ ] 🧠 Community summary → CoT data: use community summaries as few-shot examples to synthesize high-quality chain-of-thought data
[ ] 💬 Multi-turn dialogue synthesis: random-walk sampling → multi-turn Q&A while maintaining context consistency
[ ] 📈 Complexity grading for curriculum learning
[ ] 🕵️‍♂️ Support comparison with baselines

7 UX, Docs & Community

[ ] 📦 Streamlined pip install and usage
[ ] 📓 Jupyter tutorial suite
[ ] 📚 Comprehensive documentation
[ ] 🗃️ Data & user case library
[x] 🤝 Contributor guide & roadmap: clear labels, branching strategy, PR template, code of conduct
[ ] 🌐 More user-friendly web interface

8 Others

[ ] 📝 More standardized prompt & post-processing management; post-processing should be bound to prompts
[ ] 🌐 Improve online connectivity
[ ] 🔗 Enhanced coreference resolution during chunking

Further feature ideas are welcome—feel free to suggest and join the plan!

Sep 22 '25 14:09 ChenZiHong-Gavin