aspen icon indicating copy to clipboard operation
aspen copied to clipboard

🔎 📖 ✨ Custom, private search engine for text documents built with NextJS/React/ES6/ES7

Aspen

Aspen lets you search a large corpus of plain text files via the browser.

license build status

example

  • Powerful search query support through Elasticsearch query string syntax
  • Performs some basic cleanup of plaintext data and can extract document titles
  • Responsive UI that works on mobile
  • Runs in Docker

Getting Started using Docker Compose

1. Collect your documents

Put all your files in one place, like ~/ebooks/:

$ tree ~/ebooks
/Users/ian/ebooks
└── Project\ Gutenberg/
    ├── Beowulf.txt
    ├── Dracula.txt
    ├── Frankenstein.txt

2. Run Aspen & Elasticsearch

$ docker-compose up -d
Creating network "aspen_default" with the default driver
Creating elasticsearch ... done
Creating aspen         ... done

3. Convert any non-plaintext (PDFs, MS Word) documents to plaintext

Use the included convert utility, which wraps Apache Tika, to convert them to plaintext. Pass it a filename relative to your data directory:

$ ls ~/ebooks
Project Gutenberg Test.docx

$ docker-compose run aspen convert Test.docx
Starting elasticsearch ... done
Test.docx doesn't exist, trying /data/Test.docx
Creating /data/Test.txt...
...
OK

$ ls ~/ebooks
Project Gutenberg Test.docx         Test.txt

4. Import content into Elasticsearch

Start by resetting Elasticsearch to make sure everything is working:

$ docker-compose run aspen es-reset
Starting elasticsearch ... done
Results from DELETE: { acknowledged: true }
✓ Done.

Now import all .txt documents. The import script will try to figure out the title of the document automatically:

$ docker-compose run aspen import
Starting elasticsearch ... done
→ Base directory is /app/public/data
▲ Ignoring non-text path: Test.docx
→ Test.txt → Test Document
→ Project Gutenberg/Beowulf.txt → The Project Gutenberg EBook of Beowulf
→ Project Gutenberg/Dracula.txt → The Project Gutenberg EBook of Dracula, by Bram Stoker
→ Project Gutenberg/Frankenstein.txt → Project Gutenberg's Frankenstein, by Mary Wollstonecraft (Godwin) Shelley
✓ Done!

You can also run import with a directory or file name relative to the data directory. For example, import Project\ Gutenberg or import Project\ Gutenberg\Dracula.txt.

Sometimes plaintext documents act strangely. Maybe bin/import can't extract a title or maybe the search highlights are off. The file might have the wrong line endings or one of those annoying UTF-8 BOM headers. Try running dos2unix on your text files to fix them.

5. Done!

Go to http://localhost:3000/ and start searching!

Development Setup

1. Install dependencies

It's easiest to use Elasticsearch via Docker.

You can get Node and Yarn via Homebrew on Mac, or you can download Node.js v8.5 or later and npm install -g yarn to get Yarn.

For document conversation (bin/convert) you'll want:

  1. Apache Tika
  2. UnRTF
  3. Par

On macOS you can brew install node tika unrtf par.

2. Clone the repo

$ git clone [email protected]:statico/aspen.git
$ cd aspen
$ yarn install

3. Set up Elasticsearch and import your data

See steps 1-4 in the above "Using Docker" section. In short, get your text files together in one place, set up Elasticsearch, and import them with the bin/import command.

4. Start the web app

Aspen is built using Next.js, which is Node + ES6 + Express + React + hot reloading + lots more. Simply run:

$ yarn run dev

...and go to http://localhost:3000

If you are working on server.js and want automatic server restarting, do:

$ yarn global add nodemon
$ nodemon -w server.js -w lib -x yarn -- run dev

Development Notes

  • This started as an Angular 1 + CoffeeScript example. I recently migrated it to use Next.js, ES6 and React. You can view a full diff here.
  • I'm still using Elasticsearch 1.7 because I haven't bothered to learn the newer versions.

Links