pdfsearch icon indicating copy to clipboard operation
pdfsearch copied to clipboard

A full text search library for PDFs.

trafficstars

Pure Go Full Text Search of PDF Files

This library implements full text search for PDFs.

  • The public APIs are in index_search.go.

The are some command lines programs that demonstrate the library's functionality.

  • examples/pdf_search_demo.go demonstrates the main APIs.
  • examples/index.go builds an index over a set of PDFs.
  • examples/search.go searches the index build by examples/index.go.

Binary versions (executables) of these three programs are available in releases. There are 64-bit binaries for Windows, Mac and Linux. The binaries do not require a UniDoc license.

Installation

git clone https://github.com/PaperCutSoftware/pdfsearch

Replace uniDocLicenseKey and companyName in unidoc_glue.go with valid UniDoc license fields.

cd pdfsearch/examples
go build pdf_search_demo.go
go build index.go
go build search.go

examples/pdf_search_demo.go

Usage: ./pdf_search_demo -f <PDF path> <search term>

Example: ./pdf_search_demo -f PDF32000_2008.pdf cubic Bézier curve

The example will search PDF32000_2008.pdf for cubic Bézier curve.

pdf_search_demo.go shows how to use the APIs in index_search.go to

  • create indexes over PDFs,
  • search those indexes using full-text search, and
  • mark up PDFs with the locations of the search matches on pages.

examples/index.go

Usage: ./index <file pattern>

Example: ./index ~/climate/**/*.pdf

The example creates an on-disk index over the PDFs in ~/climate/ and its subdirectories.

examples/search.go

Usage: ./search <search term>

Example: ./search integrated assessment model

The example searches the on-disk index created by examples/index.go for integrated assessment model.

Libraries

index_search.go uses UniDoc for PDF parsing and bleve for search.

Talks about this library

GopherCon AU 2019