BookStack
BookStack copied to clipboard
PDF import/indexing
Describe the feature you'd like
As said here https://github.com/BookStackApp/BookStack/issues/1270#issuecomment-463523175
You don't want to encourage users to make link lists leading to documents. Instead content should be created/copied into the editor. Ok. Nevertheless indexing PDF-files would be a great feature on-top.
Additionally, importing e.g. Word or PDF files directly into the editor would also be great (optionally with delete unwanted HTML-code
Describe the benefits this would bring to existing BookStack users
Better experience / less work.
Can the goal of this request already be achieved via other means?
No and not satisfying enough :-)
Have you searched for an existing open/closed issue?
- [X] I have searched for existing issues and none cover my fundemental request
How long have you been using BookStack?
1 to 5 years
Additional context
No response
Thanks for the request, although this is not something I'd be keen to include support for since:
- It widens the scope, and lessens focus, to what we'd be considering documentation content within the platform.
- Support could vary depending on the format and structure of a specific PDF document, adding variability to such a feature working.
- Support would be added for certain formats, introducing variability to how different attachments/formats are treated.
- There are likely cases where this would be not desired, requiring additional levels of control to be exposed which themselves can be a burden.
This sounds like it can be done with the API. Use some kind of PDF to HTML library, pass it to HTML to Markdown and then use the bookstack API to import.
Can anyone please help document the process of using the API to import content? or point me to the documentation that I am struggling to find?
Please and thank you!
Some PDF can be parsed, some need to be run through OCR. This is a big ask. OCR isn't automatic, it requires human review. For this reason I would agree that you need to normalize your dataset before you import. The best open source document conversion library is PanDoc, and it doesn't support PDF.
How can I convert PDFs to other formats using pandoc? You can’t. You can try opening the PDF in Word or Google Docs and saving in a format from which pandoc can convert directly.