uwazi
uwazi copied to clipboard
Upload any document
Is your feature request related to a problem? Please describe. At the moment Uwazi only supports direct upload of PDF files. Other file types need to be uploaded via attachments and consumed only in media an image fields.
Describe the solution you'd like To be able to upload any file to Uwazi. Text files should be converted to PDF so they can be consumed as every other document in Uwazi (full text search, text referencing, machine learning, etc)
As far as I have been able to research the best open source tool for document conversion is LibreOffice. It supports hundreds of formats and a headless mode for command line usage. I did some testing for several file formats and the results so far are very good quality. I could find some documentation about the supported formats here: https://help.libreoffice.org/latest/en-US/text/shared/guide/convertfilters.html?DbPAR=SHARED#bm_id541554406270299
The basic command is soffice --headless --convert-to pdf file.extension
. LibreOffice does a great job at inferring the file mime type based on the extension.
While it supports some magic for XLS, CSV and other tabular data formats, I would limit the conversion support for popular text files for now:
- .txt
- .html
- .doc
- .docx
- .rtf
- .wps
- .dot
- .wpt
- .wri
- .odt
- .docm
- .pdb
- .pages
- .epub
- .eml
Also some presentation formats:
- .ppt
- .pps
- .key
- .odp
- .pptx
- .ppsx
Other file formats should also be allowed to be uploaded directly to Uwazi, but only as an attachment, without any attempt to convert them.
The workflow should respect the original file as an attachment and add the PDF version as main file. Needless to say that PDF file then needs to be sent to our regular PDF processing pipeline (extract and index text, language detection).
TBD. A technical question that remains open is whether directly accessing libreoffice's cli from Uwazi, or if this conversions should also be deployed as external services coordinated via the distributed jobs. Currently, we use a similar tool (poppler) for PDF to text conversion. It is installed as a direct and necessary dependency of Uwazi and accessed directly from the app code.
A nice to have that can be kept apart from this MVP would be proper support for direct upload of media files.
Another potential great feature is db/tabular data visors. We could develop a table view when the main file has this format or do some conversion trick to transform them into PDFs as well.