sparse_ngrams
sparse_ngrams copied to clipboard
Search index algorithm for GitHub code search
sparse_ngrams: GitHub code search indexing
Work in progress
sparse_ngrams is a C++ library that contains a search substring and regexp
algorithms that are scalable for code search indexing and used in GitHub Codesearch. It's indended to
reduce the indexing and query response times compared to zoekt (which is used by Sourcegraph) and Russ Cox's trigram search. The solution is meant to be scalable to billions lines of code with <100ms latency. More on code search project is TBD.
- Easy: First-class, easy to use dependency and carefully documented APIs.
- Fast: We do care about speed of the algorithms and provide reasonable implementations.
- Well tested: We test all algorithms with a unified framework, under sanitizers and fuzzing.
- Benchmarked: We gather benchmarks for all implementations to better understand good and bad spots.
Table of Contents
- Quick Start
- Testing
- Documentation
- License
Quick Start
You can use cmake with add_subdirectory. Includes are in include,
sources are in src folders.
We support all C++17 compliant modern compilers (GCC, Clang, MSVC).
Testing
To test and benchmark, we use Google benchmark library. Simply do in the root directory:
# Check out the libraries.
$ git clone https://github.com/google/benchmark.git
$ git clone https://github.com/google/googletest.git
$ mkdir build && cd build
$ cmake -DCMAKE_BUILD_TYPE=Release -DSPARSE_NGRAMS_TESTING=on -DBENCHMARK_ENABLE_GTEST_TESTS=off -DBENCHMARK_ENABLE_TESTING=off ..
$ make -j
$ ctest -j4 --output-on-failure
Documentation
TBD.
License
The code is made available under the Boost License 1.0.