dedalo
dedalo copied to clipboard
new feature: apply OCR to uploaded pdfs
Hi,
We have developed a functionality to apply OCR (image to text conversion) to PDFs that are uploaded to Dedalo. When uploading a pdf, the option to apply OCR or not and a drop-down menu to select the OCR language is shown.
To do this, we use the ocrmypdf tool, which must be installed on the same server as Dedalo, since it is invoked from Dedalo via the shell_exec function.
We would like you to review these changes so that they can be incorporated into the official Dedalo code and that they can be used in any Dedalo installation, since we consider that it is a functionality that can be quite interesting for any area.
Above all, we are interested in the revision of the tool_upload file class.tool_upload.php. Both the way of obtaining the path of the file to be uploaded and the conversion between the Dedalo languages and the OCR languages, which we do manually, we think can be greatly improved. As well as anything else that you consider can be improved or optimized.
We can comment on what you need about it.
We hope this is of interest to the community and can be incorporated into the official Dedalo distribution.
Best
Fantastic tool! I’ll try it. Many thanks.
On Dec 7, 2023, at 6:53 AM, pricua @.***> wrote:
Hi,
We have developed a functionality to apply OCR (image to text conversion) to PDFs that are uploaded to Dedalo. When uploading a pdf, the option to apply OCR or not and a drop-down menu to select the OCR language is shown.
To do this, we use the ocrmypdf tool, which must be installed on the same server as Dedalo, since it is invoked from Dedalo via the shell_exec function.
We would like you to review these changes so that they can be incorporated into the official Dedalo code and that they can be used in any Dedalo installation, since we consider that it is a functionality that can be quite interesting for any area.
Above all, we are interested in the revision of the tool_upload file class.tool_upload.php. Both the way of obtaining the path of the file to be uploaded and the conversion between the Dedalo languages and the OCR languages, which we do manually, we think can be greatly improved. As well as anything else that you consider can be improved or optimized.
We can comment on what you need about it.
We hope this is of interest to the community and can be incorporated into the official Dedalo distribution.
Best
You can view, comment on, or merge this pull request online at:
https://github.com/renderpci/dedalo/pull/64
Commit Summary
3a88328 https://github.com/renderpci/dedalo/pull/64/commits/3a8832890febe7c64dd0f31469f4e8eac99c1a5c new feature: apply OCR to uploaded pdfs File Changes (5 files https://github.com/renderpci/dedalo/pull/64/files) M core/services/service_upload/js/render_edit_service_upload.js https://github.com/renderpci/dedalo/pull/64/files#diff-1186898516378b074a473ae66b2abb3039ed00ce6c1507731ce2e11f36eb8eb5 (72) M core/services/service_upload/js/service_upload.js https://github.com/renderpci/dedalo/pull/64/files#diff-845c6621f548c389d9edf0295c5eb174188b594f9de2895d5398160d5271e34e (951) M tools/tool_upload/class.tool_upload.php https://github.com/renderpci/dedalo/pull/64/files#diff-6a05c6f18f203a7afc7d4d635e261b47dd81f61347a1589f9d093ff9ab98b309 (35) M tools/tool_upload/js/render_tool_upload.js https://github.com/renderpci/dedalo/pull/64/files#diff-0b905fa028898f3fd0c1a3c27c13d37914ea4fdeb0080a8ce6f17d30820fcf98 (776) M tools/tool_upload/js/tool_upload.js https://github.com/renderpci/dedalo/pull/64/files#diff-bb207a54de409fa5399693ddd06cab1addfd7ba43e5a60a2aa8a376a7c9f8357 (186) Patch Links:
https://github.com/renderpci/dedalo/pull/64.patch https://github.com/renderpci/dedalo/pull/64.diff — Reply to this email directly, view it on GitHub https://github.com/renderpci/dedalo/pull/64, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATRPMLXHPBTHC26XAU5EXKTYIGU5HAVCNFSM6AAAAABAK5W6NGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTANRQGE3TCNY. You are receiving this because you are subscribed to this thread.
Hi @pricua
Thanks a lot about this new feature. It will be useful to all projects and I thinks that is possible integrate it into the main code.
We are reviewing your code, and some comments arise.
- The labels has to be translatable, Dédalo is used in different countries and languages, so, every label has to be translatable in this way:
From:
const combobox_label = ui.create_dom_element({
element_type : 'label',
class_name : 'label',
inner_html : '<label>Lenguaje</label>',
parent : form
})
to:
const combobox_label = ui.create_dom_element({
element_type : 'label',
class_name. : 'label',
inner_html : get_label.language || 'Language',
parent : form
})
The main fallback for labels are English, and if you need add some labels that are not into the ontology, please tell me, I will open it.
Take account that, If the label is inside the tool, you will need to call with tool method as:
self.get_tool_label('language ') || 'Language'
and please, don't add more html tags than necessary:
<label>Lenguaje</label>
is not necessary, the ui.create_dom_element() will create the label node, so adding this tag the result will be:
<label><label>Lenguaje</label></label>
Try to keep simple.
Finally. We need time to test it, thanks again and I will back with more.
Hi @pricua
Well, full integration has been done!
Just want to point a few things about the final integration:
- Never use a
var
to define a global variables in Dédalo. You can create an object in the instance and change it /recover it .... at any time, so is more easy to maintain and move between instances. - Don't include specific processes in the general classes, the OCR process applies only to PDFs, so use the component_pdf class to make this process specific. If you include the exec() in the tool_upload.php all uploaded files will check if they has the property... and when we want to find the process it will not be obvious that a specific process was defined in a general class... using the specific component will be clearer and more obvious and easier to find, besides the scope of the process is clear, all the things about PDF in the component_pdf.
Please review the actual code and compare it with your commit.
The code was integrated into the pricua-v6_developer branch.
Feel free to comment or suggest something else. We will merge into the master branch at the end of this week (Friday 7 June 2024)
And thanks for improve Dédalo features... :)
Best
Hi,
Thank you for incorporating my contribution to the official version of Dedalo. It is a pleasure to be able to contribute to the community.
And thanks also for the clarifications, I will keep it in mind for future developments.
Best!