GraphGen
GraphGen copied to clipboard
[Summary] GraphGen Roadmap
Backgraound
To establish GraphGen as an essential tool for training and evaluation data synthesis, its development roadmap focuses on two core pillars: implementing a robust, multi-dimensional data quality assessment and filtering system to ensure the reliability of generated knowledge graphs, and expanding its architecture to support multi-modal and multi-omics data inputs.
If you'd like to work on one of these tasks, please comment below to claim it and create an issue for the feature you'll be implementing.
Features
1 GraphGen Framework
- [x] ♻️ Refactor the pipeline around base classes:
Reader→KG_Builder→Partitioner→Generator. Data flow: Raw corpus → Reader → Splitter → KG_Builder → Partitioner → Generator → training / evaluation data: https://github.com/open-sciencelab/GraphGen/pull/59, https://github.com/open-sciencelab/GraphGen/pull/58, https://github.com/open-sciencelab/GraphGen/pull/52 - [x] 🔧 Operator-orchestration backbone: https://github.com/open-sciencelab/GraphGen/pull/85
- [ ] 🔍 Data provenance: ensure every record in the final training/evaluation set can be traced back to its original raw corpus through the full pipeline.
2 Multi-Modal & Multi-Omics
- [x] 🧬 Define ImageNode, AudioNode, ProteinNode, etc.
- [x] 👁️🗨️ Vision–language fusion extraction: use open VLMs to generate "image–caption–entity" triples and write them into the graph: https://github.com/open-sciencelab/GraphGen/pull/69
- [ ] 🧪 Multi-omics extraction: process genomics/transcriptomics/proteomics with automatic node-property alignment
3 Data Quality & Curation
- [ ] 📊 Multi-dimensional quality metrics with a unified scoring API
- [ ] 💓 Graph-quality assessment similar to KGHeartBeat
- [ ] 🎯 One-click export of high-quality sub-graphs and high-quality data
- [ ] ⚙️ Configurable pipeline: entity disambiguation, fact verification, redundancy removal, schema validation
4 Graph Construction
- [ ] 🚀 Incremental & resumable construction
5 Engineering
- [x] 📂 Support csv, json, jsonl, txt, pickle, parquet, pdf, and various triple input formats: https://github.com/open-sciencelab/GraphGen/pull/50, https://github.com/open-sciencelab/GraphGen/pull/65, https://github.com/open-sciencelab/GraphGen/pull/82, https://github.com/open-sciencelab/GraphGen/pull/81
- [ ] 🗄️ Unified hybrid storage layer: graph DB + object store + Redis cache, switchable with one click
- [ ] 💨 Optional RedisGraph or hash storage for real-time read/write on large graphs
- [ ] ✅ Data validation powered by Pydantic
- [ ] 👓 More inference servers and clients such as Azure and Ollama: https://github.com/open-sciencelab/GraphGen/pull/74
- [ ] 🔍 test cases
6 Community Detection & Data Synthesis
- [ ] 🔎 Apply multiple community-detection algorithms; generate data from communities and provide typical samples plus visualizations
- [ ] 🧠 Community summary → CoT data: use community summaries as few-shot examples to synthesize high-quality chain-of-thought data
- [ ] 💬 Multi-turn dialogue synthesis: random-walk sampling → multi-turn Q&A while maintaining context consistency
- [ ] 📈 Complexity grading for curriculum learning
- [ ] 🕵️♂️ Support comparison with baselines
7 UX, Docs & Community
- [ ] 📦 Streamlined pip install and usage
- [ ] 📓 Jupyter tutorial suite
- [ ] 📚 Comprehensive documentation
- [ ] 🗃️ Data & user case library
- [x] 🤝 Contributor guide & roadmap: clear labels, branching strategy, PR template, code of conduct
- [ ] 🌐 More user-friendly web interface
8 Others
- [ ] 📝 More standardized prompt & post-processing management; post-processing should be bound to prompts
- [ ] 🌐 Improve online connectivity
- [ ] 🔗 Enhanced coreference resolution during chunking
Further feature ideas are welcome—feel free to suggest and join the plan!