[Question]: No keyword or question was found in dataSet afer files loaded by customized ingestion pipeline
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
Describe your problem
there is no keyword or question data in chunk
but i found them in logs
Hi @leoliu2031! I'm Dosu and I’m helping the ragflow team.
When using a custom ingestion pipeline, keywords and questions are generated and shown in logs as raw fields ('keywords', 'questions'), but for them to appear in the final chunk data, they must be normalized and stored as 'important_kwd' and 'question_kwd' lists. This normalization is handled by the Tokenizer component or by explicit post-processing in the pipeline. If these steps are missing or misconfigured, the chunk will not include the expected fields, even though logs show the extraction succeeded.
To resolve this, make sure your custom pipeline includes a Tokenizer step (or equivalent logic) before storing chunks, and that it converts 'keywords'/'questions' to 'important_kwd'/'question_kwd' and removes the raw fields. Also, check that the auto_keywords and auto_questions flags are set correctly in your parser config. If the pipeline skips this normalization, the UI and APIs will not display the keywords or questions, even though they were generated upstream. See the relevant code and pipeline behavior for details:
Tokenizer logic
Chunk normalization and storage
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
i config the data set with customized ingestion pipeline
Your pipeline design—with parallel LLM nodes for summarization, keywords, and questions—matches the advanced template, but there’s a known limitation in RAGFlow custom pipelines: parallel Extractor nodes don’t merge their outputs automatically. Instead, only the last Extractor’s output may be preserved, causing fields like keywords or questions to be lost before reaching the Tokenizer/Indexer. The Tokenizer node is responsible for normalizing raw 'keywords' and 'questions' into 'important_kwd' and 'question_kwd' fields for chunk storage and UI/API visibility, but it only works if those fields survive upstream merging.
Currently, the UI and backend force serial flows, and Extractor nodes are destructive—they overwrite the chunk instead of merging fields. This means your keywords/questions may be generated (and shown in logs) but not included in the final chunk data unless you manually merge outputs before the Tokenizer. There’s no built-in workaround in v0.22.0; fixing this requires both UI and backend changes to support true parallel flows and merging outputs from multiple Extractors before tokenization and indexing. This limitation and a feature request for built-in extraction after custom chunking are tracked here: issue details.
For now, to ensure keywords and questions appear in your chunks, use a serial pipeline where each Extractor passes all previous fields forward, or manually merge outputs before the Tokenizer. If you need both custom chunking and reliable built-in extraction, this is not yet supported out of the box.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Can you share the pipeline json with us via [email protected]
I have made some test and I would sugges you to try higher Vector similarity weight. e.g. 0.5 or above