dify icon indicating copy to clipboard operation
dify copied to clipboard

Multi-column PDF cannot be segmented

Open Lxx-c opened this issue 1 year ago • 1 comments

Self Checks

  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [X] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

Regarding the knowledge base document segmentation function, I found during use that when I uploaded documents in PDF format, multi-column PDFs often could not be recognized, but single-column ones had no problem. I would like to ask if DIFY has any relevant plans for the recognition and processing of multi-column PDFs.

2. Additional context or comments

The segmentation function of the knowledge base document, multi-column PDF content is often blank and invalid when segmenting, and it does not recognize any content. But often some files are indeed multi-column format, they are mostly documents of authoritative institutions or governments. I need the segmentation function to support multi-column PDF document format.

3. Can you help us with this feature?

  • [ ] I am interested in contributing to this feature.

Lxx-c avatar Jul 15 '24 07:07 Lxx-c

@JohnJyong Hello, I would like to inquire about the recognition of multi column PDF files. Do you have any plans on your end? PDF does not involve images or tables, of course, it would be even better if it could be supported.

Lxx-c avatar Jul 31 '24 09:07 Lxx-c