tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

How create pdf searchable from image?

Open ALVERNEPAIVA opened this issue 8 years ago • 14 comments

Its possible make a pdf searchable from image by coding?

ALVERNEPAIVA avatar Dec 05 '16 01:12 ALVERNEPAIVA

Check out #193, in short you need to install the latest prerelease since the current stable release doesn't have the functionality. Here's a usage example taken from the test cases:

TesseractEngine _engine; // preconfigured engine instance

 public void CanRenderResultsIntoPdfFile()
{
	var resultPath = TestResultRunFile(@"ResultRenderers\PDF\phototest");
	using (var renderer = ResultRenderer.CreatePdfRenderer(resultPath, DataPath)) {
		var examplePixPath = this.TestFilePath("Ocr/phototest.tif");
		ProcessFile(renderer, examplePixPath);
	}

	var expectedOutputFilename = Path.ChangeExtension(resultPath, "pdf");
	Assert.That(File.Exists(expectedOutputFilename), $"Expected a PDF file \"{expectedOutputFilename}\" to have been created; but non was found.");
}

private void ProcessFile(IResultRenderer renderer, string filename)
{
	var imageName = Path.GetFileNameWithoutExtension(filename);
	using (var pix = Pix.LoadFromFile(filename)) {
		using (renderer.BeginDocument(imageName)) {
			Assert.AreEqual(renderer.PageNumber, -1);
			using (var page = _engine.Process(pix, imageName)) {
				var addedPage = renderer.AddPage(page);

				Assert.That(addedPage, Is.True);
				Assert.That(renderer.PageNumber, Is.EqualTo(0));
			}
		}

		Assert.AreEqual(renderer.PageNumber, 0);
	}
}

charlesw avatar Dec 05 '16 21:12 charlesw

Hello, first of all i have to give you a lot of thanks for your hard work. I tested this and Works fine.

Regards

cesarDevOp avatar Jan 21 '17 20:01 cesarDevOp

I must misunderstand the API. An incomplete PDF is generated by the following example. Where did I go wrong?

using (IResultRenderer renderer = Tesseract.ResultRenderer.CreatePdfRenderer(@"Output", @"Data\tessdata\"))
{
     using (renderer.BeginDocument("PDF Test"))
     {
           string configurationFilePath = "Data";
           using (TesseractEngine engine = new TesseractEngine(configurationFilePath, "eng", EngineMode.TesseractAndCube))
           {
                   string tifFile = @"C:\example.tif";
                   using (Pix img = Pix.LoadFromFile(tifFile))
                   {
                         using (Page page = engine.Process(img))
                         {
                             renderer.AddPage(page);
                         }
                   }
           }
     }
}

tdhintz avatar Mar 02 '17 17:03 tdhintz

The thing that was apparently keeping my original example from working was it did not provide a second parameter to the engine’s Process method:

using (Pix img = Pix.LoadFromFile(tifFile))
{
    using (Page page = engine.Process(img, "PDF Test")) // NEEDS SECOND PARAMETER.
    {
        renderer.AddPage(page);
    }
}

tdhintz avatar Mar 03 '17 17:03 tdhintz

Hi Charles, I can't get this function to work. Throws invalidopertation, due to this statement being = 0 in the BeginDocument method if (Interop.TessApi.Native.ResultRendererBeginDocument(Handle, titlePtr) == 0)

Code I'm testing is as follows.

 public void ConvertToPDF()
        {
            using (IResultRenderer renderer = Tesseract.ResultRenderer.CreatePdfRenderer(@"E:\Convert\output", @"./tessdata"))
            {
                using (renderer.BeginDocument("test"))
                {
                    string configurationFilePath = @"./tessdata";
                    using (TesseractEngine engine = new TesseractEngine(configurationFilePath, "eng", EngineMode.TesseractAndCube))
                    {
                        using (var tifFile = new Bitmap(@"E:\Convert\Page-1.tif"))
                        {
                            using (var img = PixConverter.ToPix(tifFile))
                            {
                                using (var page = engine.Process(img, "test"))
                                {
                                    renderer.AddPage(page);
                                }
                            }
                        }
                    }
                }
            }
        }

DougHardy avatar Mar 24 '17 04:03 DougHardy

Hello Doug,

I was experiencing the same problem you're describing. The pdf renderer needs some font files, the documentation says these are probably in your tessdata folder. For me that wasn't the case, I didn't have those files. We fixed it by copying the pdf font files from the unit test directory in the develop branch of this repo. Works like charm now.

pasmanh avatar Mar 30 '17 07:03 pasmanh

out18.pdf file-page1

Tried the searchable pdf generation in using Tesseract OCR , but generating pdf with text and images are hidden.

daddy1989 avatar Apr 18 '17 06:04 daddy1989

Included the pdf config file and pdf.ttf inside tessdata folder.

daddy1989 avatar Apr 18 '17 06:04 daddy1989

Please help me on this, why the engine is rendering a searchable pdf with text and invisible image.

daddy1989 avatar Apr 18 '17 06:04 daddy1989

 using (IResultRenderer renderer = Tesseract.PdfResultRenderer.CreatePdfRenderer(@"D:\out18", @"C:\tessdata\"))
                {
                    using (renderer.BeginDocument("Serachablepdftest"))
                    {
                        string configurationFilePath = @"C:\tessdata";
                        string configfile = Path.Combine(@"C:\tessdata", "pdf");
                        using (TesseractEngine engine = new TesseractEngine(configurationFilePath, "eng", EngineMode.TesseractAndLstm, configfile))
                        {
                            using (var imagefile = new Bitmap(@"C:\file-page1.jpg"))
                            {
                                using (var img = PixConverter.ToPix(imagefile))
                                {
                                    using (var page = engine.Process(img, "Serachablepdftest"))
                                    {
                                        renderer.AddPage(page);
                                    }
                                }
                            }
                        }
                    }
                }``

daddy1989 avatar Apr 18 '17 06:04 daddy1989

I'm unable to load the IResultRenderer class, it seems unable to find it. Can anyone assist?

kmaher9 avatar Feb 02 '18 23:02 kmaher9

Are you using the latest prerelease?

On Sat., 3 Feb. 2018, 10:08 Kieran Maher, [email protected] wrote:

I'm unable to load the IResultRenderer class, it seems unable to find it. Can anyone assist?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/charlesw/tesseract/issues/302#issuecomment-362735557, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPzyMf-xBlvkPq1w1AcAN6IySnKPphQks5tQ5VugaJpZM4LDx0_ .

charlesw avatar Feb 03 '18 02:02 charlesw

Hi, ensure you're included the pdf font file in your tessdata directory and ensure your using the latest prerelease of this project.

On Tue., 13 Mar. 2018, 03:50 mechachi, [email protected] wrote:

I'm trying to test this functionality but keep running into a 'Failed to begin document' error on renderer.BeginDocument. I'm sure it's something i'm missing or doing wrong but I'm going in circles at this point. If anyone could chime in it would be helpful. I'm using charlessw sample code from Dec 5 in this thread.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/charlesw/tesseract/issues/302#issuecomment-372380704, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPzyJnD18UcMZkDUAwD7Ht6NSO4P5Oaks5tdqdOgaJpZM4LDx0_ .

charlesw avatar Mar 12 '18 20:03 charlesw

  • @charlesw, maybe this should be closed
  • I'd also like that we write the pdf to a stream rather than file location

lastlink avatar Jul 06 '21 13:07 lastlink