ph-submissions OCR with Google Vision and Tesseract

OCR with Google Vision and Tesseract

Open lizfischer opened this issue 2 years ago • 22 comments

The Programming Historian has received the following tutorial on 'OCR with Google Vision API and Tesseract' by Isabelle Gribomont @isag91. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/ocr-with-google-vision-and-tesseract

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

[Permission to Publish]

@isag91: Would you please post the following statement in this thread?

I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

Feb 16 '22 16:02 lizfischer

(I have co-assigned myself here so that I can shadow the editorial process).

Feb 18 '22 11:02 anisa-hawes

Many thanks Liz !

I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

Feb 18 '22 12:02 isag91

Great, thanks @isag91! Here is my initial feedback.

I didn't have any issues getting the code to work or getting set up. I was able to follow the steps & use Google Cloud Vision to OCR some of my own PDFs and the ones you provided. What would be helpful is more explanation of the "why" that goes along with the "how" at each step of this process, most importantly when to use Google Vision, when to use just Tesseract, and when to combine the two. I think rooting the lesson more in an example could help on that front!

I wonder whether shifting the framing slightly would help with this. Instead of framing it as a lesson about how to use Vision, maybe it is more about how and when to combine Vision and Tesseract. Right now, the Tesseract part feels out of scope, since the title and first half of the lesson suggest it is only about Vision. But your comparisons of the two tools is what I think most people will find interesting and helpful, so I suggest leaning into it!

I've written up more specific feedback below, but I would be more than happy to talk through it sometime if you like.

Minor suggestions

[x] The first half of the lesson is very heavy on Google Cloud UI explanations. Since UIs are subject to frequent change, I suggest instead linking to Google Cloud documentation pages that explain how to do the tasks. For example, for the "create a Google Cloud project" step, you could link to this page: https://cloud.google.com/resource-manager/docs/creating-managing-projects#console
[x] Link to the documentation for required Google cloud libraries.
[ ] In the image captions, describe not just the source but the function of the image. Explain, for example, that select lines are highlighted (and what that highlight means)
[x] Change title to something like "OCR with the Google Cloud Vision API and Tesseract" to reflect the full name of the Google tool & the full scope of the lesson.

Bigger Suggestions

[x] Move the Comparisons section and the first paragraph of the "Combining Tesseract's..." section up into the introduction. Combine and expand both those and the "pros and cons" section to give a clearer picture of why someone would want to take this combined approach (or Vision-only) approach, when Tesseract alone is free.
[x] Related to that, Google Cloud products are tricky because they're "free" but with caveats. In the "Cons" section you talk about the number of pages per month you can send through Vision for free, and later on, you talk about the cost of using Cloud Storage. It sounds like the storage is not always free? A short dedicated section upfront on how to stay within the free tier would be really helpful.
[x] One potential way you could address these first two points is by expanding the introduction and breaking it into sub-sections that address:
- The kinds of problems this lesson is useful for (maybe root this lesson more in whatever project the sample data comes from?)
- Why a single tool by itself is not a good solution (the things each is good and bad at, which you do talk about some with regards to certain fonts & ligatures/diacritics)
- Why combining these two tools is a good solution
- The downside of Google Cloud (& how to keep yourself on the free tier)
[x] The section currently called "The code" could also use more sub-headings that break this problem into smaller steps. Along with that, breaking up the large code chunks to correspond with those steps-- some of the functions are very long, and it would be nice if those were walked through slower so it's easier to understand what each piece is doing. Right now, it seems to encourage just copy/pasting the batch_OCR_local_dir function.

Some questions I have

[x] How did you pick the batch size in the code block after paragraph 20? Is it important to have a small batch size? For me, this resulted in a lot of individual output files which isn't what I was expecting.
[x] Why/when do we want to download the plaintext vs JSON? What might someone use the confidence value for? Are there other things in the JSON that might be useful?
[ ] What are some potential next steps after we've sent our documents through? Where does this fit into a project pipeline?
[x] When is combining the two tools a good thing to do? With the example you gave, it wasn't clear to me from the example you gave that the combined Tesseract + Google Vision results were better than either tool on its own. For example, in the JHS Henry & Joan document, the Vision-only process gives:
```
A BRIEF ACCOUNT OF THE EXAMINATION OF THE TOMB OF
ING BENRY IV. IN THE CATHEDRAL OF CANTERBURY, AUGUST
1, 1882.
```
which is fairly faithful both to the text and its layout on the page. But for that same section, the Vision + Tesseract process gives:
```
 A BRIET
RIEF ACCOUNT OF
HE 1
XAMINATION
T THD 0
OMB
F
ING EENRY IV. IN THE CATHEDRAL OF CANTERBURY, AUGUST
1, 1832.
```
Both have some transcription errors (neither got "HENRY"), but the combination output seems much worse-- it duplicated "BRIEF" and didn't get "EXAMINATION OF THE" correct, where the Vision-only output did. Can you walk through the outputs more and talk about how to evaluate the output quality/decide when to use a combination of tools, and talk about when it might not be useful to combined the two tools?

Typos & small clarifications

Introduction

[x] first sentence, should read "confronted with"
[x] first sentence, remove "the" before analysis
[x] third sentence, remove "the" before Optical
[x] second paragraph, what does "optimal results" mean in this context?
[x] second paragraph, link to those other Google services?

Cons

[x] third bullet, "Although"

Upload data to Google Cloud Storage

[x] "Google storage" wasn't in the sidebar for me, it was called "Cloud Storage" and I had to click Browse instead of Create Bucket.

Google Vision API vs Tesseract

[x] I didn't understand the line number part of the second example: "The line numbers do not appear at the end of their respective lines, but are grouped together in the middle of the text." I don't see this in the example, can you clarify what you mean?

Combining

[x] Par 45 correct spelling of "Vision"
[x] Par 46, name of package "pdf2image"

Feb 21 '22 22:02 lizfischer

@isag91 and I have been in touch over email, but for the other folks watching this ticket: we're hoping to have a revised version by the end of the month 🙂

Mar 17 '22 19:03 lizfischer

@lizfischer Following the errors introduced by the combined method in one of the examples, I have been exploring another option to combine Tesseract and Google Vision. It's taking me some time to integrate it fully in the current workflow and I'm debating whether to keep the 'new' method only or introduce both. In any case, I will provide an updated version by the end of next week and we can discuss it then. Many thanks for your patience :).

Mar 31 '22 17:03 isag91

Hi Liz,

Many thanks again for your great feedback, it is much appreciated. I have made substantial changes to the lesson as a result. I answer some of your comments below:

Minor suggestions - I have explained the function of the images in the body of the text instead of the caption so that the same thing is not repeated three times.

Bigger suggestions - I have moved the comparisons in the introductions and included the output of the two combined methods in the comparison tables. - I have expanded the pros and cons lists and added more details about Storage pricing there. - I haven't rooted the tutorial in a specific project since good quality OCR is relevant for any type of text based analysis, but I have tried to highlight the relevance of the tutorial more clearly. - As suggested I have broken up the code in smaller steps.

Questions - Regarding the mistakes of the combined method, I have now modified it to solve this issue (and introduced a second way to combine the two tools).

Let me know if anything needs clarifying and if you have any further suggestions for improvement.

Apr 14 '22 16:04 isag91

Sorry for the slow response on this @isag91! It's the very end of the semester here. A quick update, I've looked through this version and like the changes you've made! I tested the code and all seemed well on that front. I'm going to go through and make a couple of small edits for clarity and then we should be set to start peer review.

May 05 '22 17:05 lizfischer

Our first reviewer for this lesson will be Laura Mandell @mandellc320 !

As a reminder, our reviewer guidelines can be found here, and you can view the lesson here.

You can post your feedback as a comment in this thread. You'll find paragraph numbers on the right edge of the lesson preview, which can be helpful for referring to specific parts of the lesson. Looking forward to your review, and thank you for your time!

May 12 '22 20:05 lizfischer

Our second reviewer for this lesson will be Ryan Cordell @rccordell -- see the comment above for some helpful links!

May 31 '22 14:05 lizfischer

Hello all,

It was really wonderful reviewing this tutorial—I am always excited to learn about combined methods for OCR. With some caveats, which I will describe below, I was able to follow the tutorial to OCR PDFs of historical newspapers (the primary documents I study) with very promising results.

Lesson Framing

I think the introduction does a wonderful job of framing the need for combining OCR methods by outlining the strengths and drawbacks of Google Vision and Tesseract. Users will get a very clear sense of applications for this pipeline and the examples are very useful for grounding the technical discussions in a historical use case.

Stylistic Suggestions

[ ] In the Google Vision session, I found the rapid fire "install this then that then that" really overwhelming, even though I do this kind of stuff a lot. I suspect for a less experienced user that feeling would be even more acute. I think what's needed is some explanation of what each step is accomplishing. What are all these different services I'm activating? How might I use them beyond this single tutorial?
[ ] I am not usually a Python user—I do most of my programming in R—so take this with a grain of salt, but I found the "Python setup" section much more confusing than it should be. Trying to import the libraries threw so many errors in the terminal until I decided to use Python through Anaconda, which in my case required doing lots of updates as I hadn't used either in awhile. I was happy to doggedly persist and search Stack Overflow when I encountered errors, but a less experienced user might bail here. Perhaps there are some existing PH lessons that would get users right to where they need to be in order to jump in here? If so, those could be linked to at the beginning of this section.
[ ] The section on "JSON files ouputs" is super interesting but could be introduced more clearly, e.g. "in addition to the OCR text files added to your local output folder, this process creates additional metadata in JSON files added to the online storage bucket." I might recommend including a screenshot showing what this would look like.
[ ] In the section after the packages are imported for the combined method, there are a series of functions created. The lesson does a pretty good job of explaining what each function does, but with the rapid fire "first create this function then create this function then create this function" it's easy for the narrative of what's happening to get a bit lost, particularly as the functions invoke each other—it's a bit of a nesting doll of functions! I think the lesson needs a bit more signposting, particularly calling out the way that functions call each other, so that the more complex operations are clear.
[ ] The explanation in the "Second combined method" section (beginning "Then, we can create a function that will use the JSON response files…") is a great model to adapt in other sections, like those where I've asked for more details in the above comments. It's detailed and explains what each step is doing and why. This kind of explanation helps each step feel motivated and transferrable to other tasks, rather than seeming like code to simply copy and paste for this one lesson.

Technical Issues/Suggestions

[ ] in the Tesseract section, the lesson does not explain the eng+lat or eng+enm arguments, which can throw errors, particularly if users are using their own data. The former gave me the error OCR engine does not have language data for the following requested languages: lat
[ ] The beginning of the "First combined method" section encourages readers to install several packages and recommends using conda to do so, but the actual examples might confuse readers, particularly the line from PIL import Image—because while the package invoked is PIL users have to install pillow package. I know the prose says "pillow" but this section was still a bit confusing because the code doesn't match the prose. I was maybe working too fast but started to get frustrated when pip install PIL only threw errors—until I realized the installation should be pip install pillow . I would recommend clarifying this and perhaps suggesting specific installation methods.
[ ] Trying to run from tesserocr import PyTessBaseAPI after the tesseract package was installed via pip returned the error symbol not found in flat namespace '__ZN9tesseract11TessBaseAPID1Ev'. I was unable to fix this error for awhile, though it seems related to running Tesseract on an M1 processor mac. There is a bug report about this precise issue in the tesserocr Github repository, but the guidance is unhelpful & even aggressive: https://github.com/tesseract-ocr/tesseract/issues/3826. I finally uninstalled tesserocr via pip and installed again via conda install -c conda-forge tesserocr and it seems to have worked, but I could imagine many people giving up at this point in the lesson—it took me awhile to figure out what was causing the error because all the packages seemed to be installed fine.

External Issues/Suggestions

I wasn't sure precisely how to frame these comments, because neither is really tied to the specifics of this tutorial, but both are issues I encountered while working through it that others might also. I don't think these are errors for the authors of this tutorial to correct, but they might merit a "keep an eye out for" or something along those lines?

[ ] There were some details in the linked Working with batches of PDF files lesson that were unclear—I can clarify if useful. I know PH prefers not to duplicate instruction across lessons, but I did have to find some workarounds to the initial setup in that other lesson.
[ ] This is not really related to the quality of the lesson, but FWIW combined method 1 struggles if you're using historical newspapers as input—in most cases the OCR is a severely truncated version of the full page. I think the stacking method is getting confused by the tight columns—if I look at the sequence PDFs they include many places where one section is actually spread across multiple columns, rather than a single column down the full length of the file. I attached a photo just in case that doesn't make sense. Again, that's not really the fault of the lesson so much as an artifact of the data I chose to use while testing. Combined method 2 seems to work very well though with newspapers.

Jun 09 '22 19:06 rccordell

Thank you for your thoughtful review, @rccordell!

@isag91, don't feel you have to respond to this feedback right away. Once the second review has come in, I'll summarize the takeaways from both and we'll work from there.

Jun 10 '22 00:06 lizfischer

Hello @rccordell. My name is Anisa, I'm the PH Publishing Assistant. I note that you mention experiencing some difficulties with the lesson Working with batches of PDF files, and I'd be interested to hear more so that I can help to solve or correct the issue. If you have time, please send me a note: admin [at] programminghistorian.org. Thank you + very best wishes, Anisa

Jun 10 '22 13:06 anisa-hawes

@rccordell, many thanks for your great comments! I'm looking forward to working with your suggestions to improve the tutorial!

Jun 22 '22 16:06 isag91

Our second reviewer will be @cneud 😃

As a reminder, our reviewer guidelines can be found here, and you can view the lesson here.

Aug 03 '22 15:08 lizfischer

Hi @isag91, hi @lizfischer, here is my review for this lesson:

General remarks

OCR for historical documents remains a complex task, and imo the lesson makes a very important point in that tailoring and tinkering with combinations of different tools is often required to determine suitable workflows and obtain useful results. E.g. the combination of different methods/tools for document layout analysis (detecting the text - or other - regions, lines, etc.) and for the actual text recognition is a very common practice to improve the quality of the OCR output.

However, this is also where I struggle a bit with the content of the lesson. In principle, there are alternatives to both Tesseract/Google Vision for layout analysis/segmentation, that are more suitable for historic documents. E.g. OCR-D, or similar OCR frameworks that also have a GUI, like Transkribus, OCR4all or eScriptorium/Kraken, also offer the option to combine different methods and pretrained models for layout analysis and text recognition, but from within an integrated environment and thus requiring much less custom code to be written. On the other hand, this is the Programming Historian, and imo the lesson does a very good job in introducing highly relevant Python techniques for working with document images and data, or the Google APIs, on the basis of typical data for Digital Humanists. The code samples are all very well documented and really serve to illustrate core coding concepts. Overall, I found the lesson clearly written, well structured and straightforward to follow.

Edit: Being actively involved in some of these alternatives (OCR-D), I am definitely heavily biased here. So please feel free to ignore any remarks here! My suggestion would be to perhaps just mention that there are alternatives, but which have a higher learning curve, or are less robust/still in development.

I continue with comments and remarks per section.

Introduction

I very much like how the introduction frames the need for OCR across diverse use cases.

Being very picky, the statements

Google Cloud Vision is probably the best out-of-the-box tool when it comes to character recognition

and

method that will create high-quality OCR outputs for most documents

seem a bit bold to me, especially since there is no further reference or justification given. I believe it may be more helpful to follow the line of argument "there is no one size fits all solution" here as well, and rather stress the fact that different materials and use cases will typically require different approaches.

Pros and Cons

This comparison is super helpful and I really like how the benefits and drawbacks also include aspects such as sustainability and versatility.

Again,

[Google Vision] is probably the most accurate ‘out-of-the-box’ tool when it comes to character recognition

seems a bit too positive (also from my experience using it), perhaps one could also just say

[Google Vision] is one of the most accurate ‘out-of-the-box’ tools when it comes to character recognition

Also,

There is usually no need to develop and train your own model

applies equally to both Google Vision, Tesseract or mostly any OCR software. What may be more valuable to at least briefly mention here instead, is whether the tools allow training or fine-tuning a recognition model. This is the case with most other tools, but not Google Vision. I a similar sense, the community will benefit more from examples and contributions for other, open source, OCR tools and methods.

Something that I encounter a lot is the notion that Tesseract is somehow a "Google" product as in

Although it is also developed by Google, Tesseract is open source

While Ray Smith initially developed Tesseract at HP, and now works at Google, since its release as open-source, Tesseract maintenance and development has shifted a lot to voluntary community contributors, who I believe should get some recognition for their considerable efforts. E.g. the last commit to Tesseract by Ray Smith/Google dates back to 2018, and only 563 commits are from Google compared to 4964 by community contributors.

Edit: Edit: Being actively involved in some of these alternatives (OCR-D), I am definitely heavily biased here. So please feel free to ignore any remarks here! My suggestion would be to add something like "...is open source and actively maintained by a community." or similar.

Tesseract does not perform as well with complex characters

Here (and for the following paragraph) it might be illustrative to briefly explain what is considered a "complex" character in this scope - I assume e.g. ligatures or historical characters?

Combined methods

~~Before going into the comparison of examples, I would have liked to know what version and recognition model(s) were used to derive the results shown here. E.g. for Tesseract, it is possible to combine multiple recognition models in one run like eng+isl which in my test led to correct recognition of þ and æ with Tesseract, whereas umlauts are correctly recognized by addition of ger like eng+isl+ger.~~

Edit: I somehow completely missed that this is usage is covered due to skipping the "ocrmypdf" part of the process. Maybe the usage of this important language parameter could also be mentioned in the text below.

Also, two of the three examples are title pages, which may be a bit misleading as these are generally the most difficult to recognize for Tesseract, which struggles more with high variation of font size on a single page than Google Vision.

Lastly, I wonder if it is possible to highlight the differences in the outputs of both tools by coloring them. This would make it much easier to quickly spot the errors.

... columns always result in a completely erroneous output from Google Vision, since the tool rarely takes vertical text separations into account and reads across columns

This is a very important observation that demonstrates the impact of (in)correct layout analysis on the OCR result !

Preparation

Minor remark

However, too large a batch size could cause python to “crash” if your computer’s short-term memory gets overwhelmed

Python should be capitalized here.

Considering vocab: what is a computer's "short-term memory" - would "memory" or "RAM" not be more fitting?

Combining Tesseract’s layout recognition and Google Vision’s character recognition

This part is where the "juice" is, but it was also the most difficult to follow for me. If it were possible to somehow illustrate some of the steps in the process visually, I think it could help readers a lot to understand better what is being done, and why.

cutting up the text regions to re-arrange them vertically

Personally, I would at least mention here that one of the drawbacks of this method is that any mapping from the the source facsimile/PDF to the resulting text is lost.

The method to combine the Google Vision output with the (normalised) coordinates from Tesseract is quite dense and sophisticated, with a lot of very important information packed into a few paragraphs. Especially the part on the coordinate normalisation could perhaps be broken down more or moved into its own separate (sub)section which would make it a bit easier to follow the process imo.

I believe the coordinate normalisation is also one of the major advantages of the second method and can have applications beyond the ones shown here (e.g. for further processing or visualization of results), which may be worthwhile mentioning.

In the final code block before the conclusion, there is a typo:

output_dir_cm2='/PAHT/LOCAL/DIRECTORY/TO/combined_II_txt/'

should be

output_dir_cm2='/PATH/LOCAL/DIRECTORY/TO/combined_II_txt/'

Conclusion

As mentioned on the top, I slightly disagree with

when dealing with historical sources, it is rare to encounter tools that were designed with your material in mind

as there are indeed suitable alternatives (mentioned above) that - contrary to Tesseract and Google Vision - are developed specifically for historical documents. What are other reasons for prefering the Tesseract+Google Vision approach over these? I think it would be good to briefly reflect on this again after going through the full process. Is it that these alternatives are too difficult or complex to setup? Or is it due to other reasons?

Edit: Being actively involved in some of these alternatives, I am definitely heavily biased here. So please feel free to ignore any remarks here! I just think it is helpful to mention that there are community efforts (where people can e.g. participate). And I can also think of many good reasons to follow this lesson rather than, e.g. installing OCR-D ;)

However,

Therefore, it is often useful to consider how different tools can be made interoperable to create novel solutions.

this I wholeheartedly agree with!

Overall, I really liked the lesson and how deep it goes into the details of the process. It has extensive but very clear code examples, which really demonstrate how to create novel solutions by combining different methods and tools. While the lesson is quite ambitious wrt to the coding, this is also what I see as the most valuable, as many of the concepts introduced here can be applied in various other applications and use cases that Digital Humanists will frequently encounter. Kudos!

Thank you for providing me with the option to review this lesson. I found it really interesting, and learned a lot from it. I hope my comments and remarks are helpful, and am at your disposal for any comments and questions you might have.

Edited: I have edited and added some remarks regarding my suggestions towards alternatives, some of which I am involved with, after realizing how biased this was. I am just too grateful for the many contributions and advances in OCR for historical documents that especially open and community driven projects have brought in recent years. But whether these are mentioned or not has certainly nothing to do with the excellent quality of this lesson. Sincere apologies for this mistake!

Aug 11 '22 13:08 cneud

Thank you to @cneud for your review! @isag91, I'll put together a summary of all feedback early next week :)

Aug 12 '22 17:08 lizfischer

@cneud Many thanks for your thoughtful comments, they are much appreciated!

Aug 16 '22 15:08 isag91

@rccordell -- Do you know what version of Tesseract you were using? I'm wondering about the language support error you encountered.

Aug 24 '22 17:08 lizfischer

Sorry for the delay on this, @isag91! I took care of some of the typos our reviewers pointed out, so don't worry about those. In terms of what else to address, I've combined & grouped comments here:

Intro

[x] ¶2: Instead of "one-size-fits-all", perhaps "one-size-fits-most" is more accurate to your goal, since a big takeaway of the lesson is that there are many different approaches to try out.
[x] ¶2-4: Soften claims about Google Cloud Vision's efficacy-- instead of "the best" and "most accurate" maybe "one of the best," or more explanation of what it is best at. Also clarifying what being the "most accurate" at character recognition means (I think you mean most accurate at recognizing individual characters, as opposed to being the most accurate at the entire task of OCR, which includes layout detection)
[x] ¶2-4: Clarify Tesseract's status as an open-source project, not really a Google product. (thank you @cneud for that clarification! I have wondered about this bit of history, myself!)
[x] ¶2-4: Explain what is meant by a "complex character"

Python Setup Section

[x] Consider changing the setup section to have all installs done via Conda. Installing packages via plain ol' pip seems to cause problems (if I recall correctly, I also encountered this issue, though I don't think I flagged it before). It may be best to say that you recommend installing via conda & including those commands in the code blocks.
[x] Ensure all required packages are included in the install section-- currently, PIL is not. If PIL is only necessary for the combined method, adding a "setup" subsection there may make more sense.

Tesseract Section

[x] ¶20-23: Clarify the ocrmypdf language arguments-- maybe link to a relevant documentation page & mention briefly how users can change them to suit their own data.

Google Cloud Section

[x] ¶25: More explanation of the Google Cloud services that are being activated-- what are they, what function do they serve, why are we activating them?

JSON Outputs section

[x] ¶51: Clarify file locations in the JSON Output section.
[x] ¶52: Include a code block with a snippet of JSON between ¶51 and ¶52-- you do a good job of explaining the contents of the JSON in ¶52-55, but having an example section in the body of the lesson will be helpful.

Combined Methods section

[x] ¶60 or 61: Briefly discuss drawbacks of rearranging the text vertically, especially the loss of mapping from source PDF to the text output
[x] ¶61-68: In the first combined methods section, explain the relationship between functions more thoroughly, which you do so well in the second combined method section.
[x] ¶71-74: Break coordinate normalization into its own little sub-section, or into smaller paragraphs.

Aug 24 '22 17:08 lizfischer

Many thanks for this very useful summary @lizfischer.

Sep 03 '22 09:09 isag91

Hi Liz, Thanks for your patience. It's all done. In addition to your comments, I have reworded bits of the introduction to add some nuances on the purpose of this method. I'm also wondering if we shouldn't add "Python" in the title, since it is a main component of the tutorial. Maybe "OCR with Python, Google Vision and Tesseract" would be more accurate. Many thanks !

Oct 06 '22 22:10 isag91

Thanks @isag91! I'll look over the changes this weekend and start this on the next stage of the process. I see your point about having Python in the title-- I'm going to ponder that for a minute, but I think you may be right.

Oct 14 '22 20:10 lizfischer

[x] Difficulty: Intermediate
[x] Activity: Transform
[x] Topics: APIS, Python, Data Manipulation
[x] Abstract: Google Vision and Tesseract are both popular and powerful OCR tools, but they each have their weaknesses. In this lesson, you will learn how to combine the two to make the most of their individual strengths and achieve even more accurate OCR results.
[x] Bio: Isabelle Gribomont is a Digital Humanities Researcher at the Royal Library of Belgium, and the Center for Natural Language Processing at UCLouvain
[ ] ORCiD: https://orcid.org/0000-0001-7443-5849
[x] ~~Permission to Publish~~ [Now replaced by authorial copyright declaration form]

@anisa-hawes

Nov 07 '22 21:11 lizfischer

Thank you, @lizfischer!

I've updated the YAML header with the difficulty:, activity:, topics:, and abstract: you've supplied. I've also added in alt-text for the lesson image you've chosen, see avatar_alt:.

The next step of the workflow is copyediting. I'll also double-check the lesson's typesetting and replace any external links to the live web with perma.cc archival links. I will post a comment here to update you and @isag91 when the copyedits are complete and ready for review 🙂

When the copyedits have been agreed, I'll send Isabelle an authorial copyright declaration form to sign (this form replaces the Permission to Publish statement we've used in the past).

I'll then tag Alex to let him know the lesson is ready for his final read-through and checks.

As @isag91 is already a member of the Project Team, we don't need to add a new entry in the ph_authors.yml file this time, but I will ask Alex to add in Isabelle's orcid: 0000-0001-7443-5849 which isn't included at the moment.

@isag91, I note that your existing bio is slightly different to the one provided above. It currently reads: Isabelle Gribomont is a researcher in Digital Humanities at the University of Louvain and the Royal Library of Belgium. so just let us know if you'd like us to adjust it to include the extra details?

Nov 10 '22 13:11 anisa-hawes

Many thanks @anisa-hawes. The previous bio is fine too, there is no need to change it.

Nov 21 '22 14:11 isag91

@anisa-hawes just checking in on this lesson. Is it ready for my review?

Dec 08 '22 15:12 hawc2

@anisa-hawes just checking in on this lesson. Is it ready for my review?

Thank you, @hawc2! Almost!

Hello @isag91 and @lizfischer ,

I've prepared the copyedits + generated perma.cc links for this lesson. You can review the changes I've made in the rich diff. Please let me know if you're happy with the adjustments? I have left a few comments, indicating where any additions or clarifications are needed.

You might have seen the Issue in Jekyll where we have recently agreed to commit to providing alt text for figure images. I've inserted a template alt="Visual description of figure image" where you can slot the descriptive alt-text in.

Next step is our authorial copyright declaration form which is an opportunity to acknowledge copyright and grant us permission to publish. I'll email you the form now @isag91.

Anisa

[x] Alt-text complete
[x] Authorial copyright declaration received

Dec 08 '22 17:12 anisa-hawes

Happy New Year, @hawc2!

Just a note to confirm that everything is complete from my side.

Sustainability + accessibility actions all done:

[x] Copyediting
[x] Typesetting
[x] Addition of Perma.cc links
[x] Addition of alt-text for all figures
[x] Receipt of authorial copyright agreement

Liz has already selected and uploaded an image for the lesson (incl. avatar_alt:).

[ ] @lizfischer Please could you prepare x2 Tweets for our Twitter Bot?

As Isabelle is already a member of the Project Team, we don't need to add an entry to ph_authors.yml. However, we would be grateful if you could add in orcid: 0000-0001-7443-5849.

Meanwhile, I'm aware that Liz and Isabelle had some thoughts about adjusting the title of this lesson. Have you all three discussed this/come to a decision elsewhere?

Jan 05 '23 16:01 anisa-hawes

Thanks @anisa-hawes. I'll take a look at this shortly and get back to everyone with any last thoughts/questions

Jan 09 '23 16:01 hawc2

@lizfischer will the title for this lesson be changed? If so we should figure that out now, as it could affect other things like names of images.

One thought I had while reading this lesson through is that the section headings are deeply nested, and it might be easier to read if there are more than two main sections ("Intro" and "OCR" as of right now). At the very least, it might help in a few places to give sections more descriptive headings. "OCR" strikes me as too cryptic, and would be more accessible to a general audience if it was at least listed as "Optical Character Recognition." Many sections could be more descriptively titled than that, so it's clear what different sections about Google Vision discuss and how they progress the lesson. Does that make sense?

Jan 10 '23 22:01 hawc2

ph-submissions ph-submissions copied to clipboard

OCR with Google Vision and Tesseract

Anti-Harassment Policy

Minor suggestions

Bigger Suggestions

Some questions I have

Typos & small clarifications

Introduction

Cons

Upload data to Google Cloud Storage

Google Vision API vs Tesseract

Combining

Lesson Framing

Stylistic Suggestions

Technical Issues/Suggestions

External Issues/Suggestions

General remarks

Introduction

Pros and Cons

Combined methods

Preparation

Combining Tesseract’s layout recognition and Google Vision’s character recognition

Conclusion

ph-submissions
ph-submissions copied to clipboard