[Question]: mineru integration issue
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
Describe your problem
The return results from pipeline and vlm-http-client in mineru are not the same, and they are not handled in the code. The parsed output is zipped, and if a file name contains special characters, it will be forcibly converted to _
The discrepancies between pipeline and vlm-http-client results in MinerU integration—especially around zipped outputs and filename handling—stem from how outputs are packaged and how filenames are sanitized. Both backends use the same API and zipping logic, controlled by the response_format_zip flag, which returns results as a zip archive that RAGFlow then unpacks. Filename normalization on the RAGFlow side only removes spaces (not all special characters), so if you see all special characters converted to underscores, that likely happens inside MinerU itself or on the server side, not in RAGFlow’s code. There are currently no user-facing options in RAGFlow to configure or override this normalization or the zipped output behavior. For now, if you need more control, you may need to post-process the outputs or adjust filenames before parsing. For more details, see the relevant code in MinerUParser (source).
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
文件名包含特殊字符,会被强制转换为 _
这个行为是在 MinerU 后端发生的,不是 RAGFlow 代码导致的。RAGFlow 只会去除文件名中的空格,不会把特殊字符全部替换成下划线。特殊字符被强制转换为下划线的逻辑是在 MinerU 服务端实现的,目前 RAGFlow 没有参数可以配置或改变这个行为。如果需要避免这种情况,可以在上传前手动重命名文件,或者在解析后对输出结果做二次处理。相关代码和行为可以参考 MinerUParser 实现。
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other