GraphGen icon indicating copy to clipboard operation
GraphGen copied to clipboard

[Summary] GraphGen Roadmap

Open ChenZiHong-Gavin opened this issue 3 months ago • 0 comments

Backgraound

To establish GraphGen as an essential tool for training and evaluation data synthesis, its development roadmap focuses on two core pillars: implementing a robust, multi-dimensional data quality assessment and filtering system to ensure the reliability of generated knowledge graphs, and expanding its architecture to support multi-modal and multi-omics data inputs.

If you'd like to work on one of these tasks, please comment below to claim it and create an issue for the feature you'll be implementing.

Features

1 GraphGen Framework

  • [x] ♻️ Refactor the pipeline around base classes: ReaderKG_BuilderPartitionerGenerator. Data flow: Raw corpus → Reader → Splitter → KG_Builder → Partitioner → Generator → training / evaluation data: https://github.com/open-sciencelab/GraphGen/pull/59, https://github.com/open-sciencelab/GraphGen/pull/58, https://github.com/open-sciencelab/GraphGen/pull/52
  • [x] 🔧 Operator-orchestration backbone: https://github.com/open-sciencelab/GraphGen/pull/85
  • [ ] 🔍 Data provenance: ensure every record in the final training/evaluation set can be traced back to its original raw corpus through the full pipeline.

2 Multi-Modal & Multi-Omics

  • [x] 🧬 Define ImageNode, AudioNode, ProteinNode, etc.
  • [x] 👁️‍🗨️ Vision–language fusion extraction: use open VLMs to generate "image–caption–entity" triples and write them into the graph: https://github.com/open-sciencelab/GraphGen/pull/69
  • [ ] 🧪 Multi-omics extraction: process genomics/transcriptomics/proteomics with automatic node-property alignment

3 Data Quality & Curation

  • [ ] 📊 Multi-dimensional quality metrics with a unified scoring API
  • [ ] 💓 Graph-quality assessment similar to KGHeartBeat
  • [ ] 🎯 One-click export of high-quality sub-graphs and high-quality data
  • [ ] ⚙️ Configurable pipeline: entity disambiguation, fact verification, redundancy removal, schema validation

4 Graph Construction

  • [ ] 🚀 Incremental & resumable construction

5 Engineering

  • [x] 📂 Support csv, json, jsonl, txt, pickle, parquet, pdf, and various triple input formats: https://github.com/open-sciencelab/GraphGen/pull/50, https://github.com/open-sciencelab/GraphGen/pull/65, https://github.com/open-sciencelab/GraphGen/pull/82, https://github.com/open-sciencelab/GraphGen/pull/81
  • [ ] 🗄️ Unified hybrid storage layer: graph DB + object store + Redis cache, switchable with one click
  • [ ] 💨 Optional RedisGraph or hash storage for real-time read/write on large graphs
  • [ ] ✅ Data validation powered by Pydantic
  • [ ] 👓 More inference servers and clients such as Azure and Ollama: https://github.com/open-sciencelab/GraphGen/pull/74
  • [ ] 🔍 test cases

6 Community Detection & Data Synthesis

  • [ ] 🔎 Apply multiple community-detection algorithms; generate data from communities and provide typical samples plus visualizations
  • [ ] 🧠 Community summary → CoT data: use community summaries as few-shot examples to synthesize high-quality chain-of-thought data
  • [ ] 💬 Multi-turn dialogue synthesis: random-walk sampling → multi-turn Q&A while maintaining context consistency
  • [ ] 📈 Complexity grading for curriculum learning
  • [ ] 🕵️‍♂️ Support comparison with baselines

7 UX, Docs & Community

  • [ ] 📦 Streamlined pip install and usage
  • [ ] 📓 Jupyter tutorial suite
  • [ ] 📚 Comprehensive documentation
  • [ ] 🗃️ Data & user case library
  • [x] 🤝 Contributor guide & roadmap: clear labels, branching strategy, PR template, code of conduct
  • [ ] 🌐 More user-friendly web interface

8 Others

  • [ ] 📝 More standardized prompt & post-processing management; post-processing should be bound to prompts
  • [ ] 🌐 Improve online connectivity
  • [ ] 🔗 Enhanced coreference resolution during chunking

Further feature ideas are welcome—feel free to suggest and join the plan!

ChenZiHong-Gavin avatar Sep 22 '25 14:09 ChenZiHong-Gavin