nlp
nlp copied to clipboard
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang
Natural Language Processing
Implementations of selected machine learning algorithms for natural language processing in golang. The primary focus for the package is the statistical semantics of plain-text documents supporting semantic analysis and retrieval of semantically similar documents.
Built upon the Gonum package for linear algebra and scientific computing with some inspiration taken from Python's scikit-learn and Gensim.
Check out the companion blog post or the Go documentation page for full usage and examples.
Features
- LSA (Latent Semantic Analysis aka Latent Semantic Indexing (LSI)) implementation using truncated SVD (Singular Value Decomposition) for dimensionality reduction.
- Fast comparison and retrieval of semantically similar documents using SimHash(random hyperplanes/sign random projection) algorithm with multi-index and Forest schemes for LSH (Locality Sensitive Hashing) to support fast, approximate cosine similarity/angular distance comparisons and approximate nearest neighbour search using significantly less memory and processing time.
- Random Indexing (RI) and Reflective Random Indexing (RRI) (which extends RI to support indirect inference) for scalable Latent Semantic Analysis (LSA) over large, web-scale corpora.
- Latent Dirichlet Allocation (LDA) using a parallelised implementation of the fast SCVB0 (Stochastic Collapsed Variational Bayesian inference) algorithm for unsupervised topic extraction.
- PCA (Principal Component Analysis)
- TF-IDF weighting to account for frequently occuring words
- Sparse matrix implementations used for more efficient memory usage and processing over large document corpora.
- Stop word removal to remove frequently occuring English words e.g. "the", "and"
- Feature hashing ('the hashing trick') implementation (using MurmurHash3) for reduced memory requirements and reduced reliance on training data
- Similarity/distance measures to calculate the similarity/distance between feature vectors.
Planned
- Expanded persistence support
- Stemming to treat words with common root as the same e.g. "go" and "going"
- Clustering algorithms e.g. Heirachical, K-means, etc.
- Classification algorithms e.g. SVM, KNN, random forest, etc.
References
- Rosario, Barbara. Latent Semantic Indexing: An overview. INFOSYS 240 Spring 2000
- Latent Semantic Analysis, a scholarpedia article on LSA written by Tom Landauer, one of the creators of LSA.
- Thomo, Alex. Latent Semantic Analysis (Tutorial).
- Latent Semantic Indexing. Standford NLP Course
- Charikar, Moses S. "Similarity Estimation Techniques from Rounding Algorithms" in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing - STOC ’02, 2002, p. 380.
- M. Bawa, T. Condie, and P. Ganesan, “LSH forest: self-tuning indexes for similarity search,” Proc. 14th Int. Conf. World Wide Web - WWW ’05, p. 651, 2005.
- A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” VLDB ’99 Proc. 25th Int. Conf. Very Large Data Bases, vol. 99, no. 1, pp. 518–529, 1999.
- Kanerva, Pentti, Kristoferson, Jan and Holst, Anders (2000). Random Indexing of Text Samples for Latent Semantic Analysis
- Rangan, Venkat. Discovery of Related Terms in a corpus using Reflective Random Indexing
- Vasuki, Vidya and Cohen, Trevor. Reflective random indexing for semi-automatic indexing of the biomedical literature
- QasemiZadeh, Behrang and Handschuh, Siegfried. Random Indexing Explained with High Probability
- Foulds, James; Boyles, Levi; Dubois, Christopher; Smyth, Padhraic; Welling, Max (2013). Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation