tesseract
tesseract copied to clipboard
How create pdf searchable from image?
Its possible make a pdf searchable from image by coding?
Check out #193, in short you need to install the latest prerelease since the current stable release doesn't have the functionality. Here's a usage example taken from the test cases:
TesseractEngine _engine; // preconfigured engine instance
public void CanRenderResultsIntoPdfFile()
{
var resultPath = TestResultRunFile(@"ResultRenderers\PDF\phototest");
using (var renderer = ResultRenderer.CreatePdfRenderer(resultPath, DataPath)) {
var examplePixPath = this.TestFilePath("Ocr/phototest.tif");
ProcessFile(renderer, examplePixPath);
}
var expectedOutputFilename = Path.ChangeExtension(resultPath, "pdf");
Assert.That(File.Exists(expectedOutputFilename), $"Expected a PDF file \"{expectedOutputFilename}\" to have been created; but non was found.");
}
private void ProcessFile(IResultRenderer renderer, string filename)
{
var imageName = Path.GetFileNameWithoutExtension(filename);
using (var pix = Pix.LoadFromFile(filename)) {
using (renderer.BeginDocument(imageName)) {
Assert.AreEqual(renderer.PageNumber, -1);
using (var page = _engine.Process(pix, imageName)) {
var addedPage = renderer.AddPage(page);
Assert.That(addedPage, Is.True);
Assert.That(renderer.PageNumber, Is.EqualTo(0));
}
}
Assert.AreEqual(renderer.PageNumber, 0);
}
}
Hello, first of all i have to give you a lot of thanks for your hard work. I tested this and Works fine.
Regards
I must misunderstand the API. An incomplete PDF is generated by the following example. Where did I go wrong?
using (IResultRenderer renderer = Tesseract.ResultRenderer.CreatePdfRenderer(@"Output", @"Data\tessdata\"))
{
using (renderer.BeginDocument("PDF Test"))
{
string configurationFilePath = "Data";
using (TesseractEngine engine = new TesseractEngine(configurationFilePath, "eng", EngineMode.TesseractAndCube))
{
string tifFile = @"C:\example.tif";
using (Pix img = Pix.LoadFromFile(tifFile))
{
using (Page page = engine.Process(img))
{
renderer.AddPage(page);
}
}
}
}
}
The thing that was apparently keeping my original example from working was it did not provide a second parameter to the engine’s Process method:
using (Pix img = Pix.LoadFromFile(tifFile))
{
using (Page page = engine.Process(img, "PDF Test")) // NEEDS SECOND PARAMETER.
{
renderer.AddPage(page);
}
}
Hi Charles, I can't get this function to work. Throws invalidopertation, due to this statement being = 0 in the BeginDocument method if (Interop.TessApi.Native.ResultRendererBeginDocument(Handle, titlePtr) == 0)
Code I'm testing is as follows.
public void ConvertToPDF()
{
using (IResultRenderer renderer = Tesseract.ResultRenderer.CreatePdfRenderer(@"E:\Convert\output", @"./tessdata"))
{
using (renderer.BeginDocument("test"))
{
string configurationFilePath = @"./tessdata";
using (TesseractEngine engine = new TesseractEngine(configurationFilePath, "eng", EngineMode.TesseractAndCube))
{
using (var tifFile = new Bitmap(@"E:\Convert\Page-1.tif"))
{
using (var img = PixConverter.ToPix(tifFile))
{
using (var page = engine.Process(img, "test"))
{
renderer.AddPage(page);
}
}
}
}
}
}
}
Hello Doug,
I was experiencing the same problem you're describing. The pdf renderer needs some font files, the documentation says these are probably in your tessdata folder. For me that wasn't the case, I didn't have those files. We fixed it by copying the pdf font files from the unit test directory in the develop branch of this repo. Works like charm now.
Tried the searchable pdf generation in using Tesseract OCR , but generating pdf with text and images are hidden.
Included the pdf config file and pdf.ttf inside tessdata folder.
Please help me on this, why the engine is rendering a searchable pdf with text and invisible image.
using (IResultRenderer renderer = Tesseract.PdfResultRenderer.CreatePdfRenderer(@"D:\out18", @"C:\tessdata\"))
{
using (renderer.BeginDocument("Serachablepdftest"))
{
string configurationFilePath = @"C:\tessdata";
string configfile = Path.Combine(@"C:\tessdata", "pdf");
using (TesseractEngine engine = new TesseractEngine(configurationFilePath, "eng", EngineMode.TesseractAndLstm, configfile))
{
using (var imagefile = new Bitmap(@"C:\file-page1.jpg"))
{
using (var img = PixConverter.ToPix(imagefile))
{
using (var page = engine.Process(img, "Serachablepdftest"))
{
renderer.AddPage(page);
}
}
}
}
}
}``
I'm unable to load the IResultRenderer class, it seems unable to find it. Can anyone assist?
Are you using the latest prerelease?
On Sat., 3 Feb. 2018, 10:08 Kieran Maher, [email protected] wrote:
I'm unable to load the IResultRenderer class, it seems unable to find it. Can anyone assist?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/charlesw/tesseract/issues/302#issuecomment-362735557, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPzyMf-xBlvkPq1w1AcAN6IySnKPphQks5tQ5VugaJpZM4LDx0_ .
Hi, ensure you're included the pdf font file in your tessdata directory and ensure your using the latest prerelease of this project.
On Tue., 13 Mar. 2018, 03:50 mechachi, [email protected] wrote:
I'm trying to test this functionality but keep running into a 'Failed to begin document' error on renderer.BeginDocument. I'm sure it's something i'm missing or doing wrong but I'm going in circles at this point. If anyone could chime in it would be helpful. I'm using charlessw sample code from Dec 5 in this thread.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/charlesw/tesseract/issues/302#issuecomment-372380704, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPzyJnD18UcMZkDUAwD7Ht6NSO4P5Oaks5tdqdOgaJpZM4LDx0_ .
- @charlesw, maybe this should be closed
- I'd also like that we write the pdf to a stream rather than file location
