tesseract.js
tesseract.js copied to clipboard
Version 4 Development and Changes
Overview
While bug fixes continue to be released for Version 3, all breaking changes will be released in Version 4, which is currently under development in the branch named dev/v4. This branch should be usable at present by users eager to use any new features, however there is no guarantee that additional breaking changes will not be implemented. Note that using this branch also requires using the Tesseract.js-core branch dev/v4.
Summary
Breaking Changes
-
createWorker
is now async- In most code this means
worker = Tesseract.createWorker()
should be replaced withworker = await Tesseract.createWorker()
- Calling with invalid
workerPath
orcorePath
now produces error/rejected promise (#654)
- In most code this means
-
load
is no longer needed (createWorker
now returns worker pre-loaded) -
detect
returnsnull
values when OS detection fails rather than throwing error (#526) -
getPDF
function replaced bypdf
recognize option (#488)
Major New Features
- Processed images created by Tesseract can be retrieved using
imageColor
,imageGrey
, andimageBinary
options (#588)- See image-processing.html example for usage
- Image rotation options
rotateAuto
androtateRadians
have been added, which significantly improve accuracy on certain documents- See Issue #648 example of how auto-rotation improves accuracy
- See image-processing.html example for usage of
rotateAuto
option
- Tesseract parameters (usually set using
worker.setParameters
) can now be set for single jobs usingworker.recognize
options (#665)- For example, a single job can be set to recognize only numbers using
worker.recognize(image, {tessedit_char_whitelist: "0123456789"})
- As these settings are reverted after the job, this allows for using different parameters for specific jobs when working with schedulers
- For example, a single job can be set to recognize only numbers using
Detail
New Output Format Interface
A single, unified interface has been added for specifying all output formats. output
is now the 3rd argument to recognize
(see example below). This replaces the separate getPDF
function, as well as various setParameters
options (tessjs_create_box
, tessjs_create_hocr
, tessjs_create_osd
, tessjs_create_tsv
, and tessjs_create_unlv
).
const outputOpts = {
text: true,
blocks: true,
hocr: true,
tsv: true,
box: false,
unlv: false,
osd: false,
pdf: false,
imageColor: false,
imageGrey: false,
imageBinary: false
};
const res = await worker.recognize(files[0], undefined, outputOpts);
Note: the default output formats (text
, blocks
, hocr
, and tsv
) are not changing between v3 and v4, so this change only impacts users who want non-default options. This also means that users who want text and pdf outputs only need to specify {pdf: true}
, as text is already a default.