BookStack icon indicating copy to clipboard operation
BookStack copied to clipboard

PDF import/indexing

Open helson22 opened this issue 3 years ago • 4 comments

Describe the feature you'd like

As said here https://github.com/BookStackApp/BookStack/issues/1270#issuecomment-463523175

You don't want to encourage users to make link lists leading to documents. Instead content should be created/copied into the editor. Ok. Nevertheless indexing PDF-files would be a great feature on-top.

Additionally, importing e.g. Word or PDF files directly into the editor would also be great (optionally with delete unwanted HTML-code

Describe the benefits this would bring to existing BookStack users

Better experience / less work.

Can the goal of this request already be achieved via other means?

No and not satisfying enough :-)

Have you searched for an existing open/closed issue?

  • [X] I have searched for existing issues and none cover my fundemental request

How long have you been using BookStack?

1 to 5 years

Additional context

No response

helson22 avatar Oct 05 '22 16:10 helson22

Thanks for the request, although this is not something I'd be keen to include support for since:

  • It widens the scope, and lessens focus, to what we'd be considering documentation content within the platform.
  • Support could vary depending on the format and structure of a specific PDF document, adding variability to such a feature working.
  • Support would be added for certain formats, introducing variability to how different attachments/formats are treated.
  • There are likely cases where this would be not desired, requiring additional levels of control to be exposed which themselves can be a burden.

ssddanbrown avatar Oct 08 '22 12:10 ssddanbrown

This sounds like it can be done with the API. Use some kind of PDF to HTML library, pass it to HTML to Markdown and then use the bookstack API to import.

IceWreck avatar Oct 14 '22 06:10 IceWreck

Can anyone please help document the process of using the API to import content? or point me to the documentation that I am struggling to find?

Please and thank you!

manicmarvin avatar Oct 12 '23 02:10 manicmarvin

Some PDF can be parsed, some need to be run through OCR. This is a big ask. OCR isn't automatic, it requires human review. For this reason I would agree that you need to normalize your dataset before you import. The best open source document conversion library is PanDoc, and it doesn't support PDF.

How can I convert PDFs to other formats using pandoc? You can’t. You can try opening the PDF in Word or Google Docs and saving in a format from which pandoc can convert directly.

A9G-Data-Droid avatar Apr 15 '24 23:04 A9G-Data-Droid