awesome-text-ml
awesome-text-ml copied to clipboard
A curated list of ML awesome frameworks & libraries for text data
Awesome software for Text ML data:image/s3,"s3://crabby-images/67aad/67aad24db041f8e850d074e0216eaf8ecbf7fa20" alt="Awesome"
A curated list of awesome ML frameworks and text embeddings. Focused on SOTA libraries which are actively maintained on GitHub.
Frameworks and libraries
:snake: Python
Text processing
-
HanLP - Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification via one unified interface. https://bbs.hankcs.com/
-
flair - A powerful NLP library for state-of-the-art natural language processing (NLP) models, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification.
-
sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.
-
stanza - Official Stanford NLP Python Library for Many Human Languages. https://stanfordnlp.github.io/stanza/
Pipelines / block-programming
- texthero - Text preprocessing, representation and visualization from zero to hero. https://texthero.org/
Distributed computing
- spark-nlp - Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. https://nlp.johnsnowlabs.com/
Machine Learning
-
sklearn - Scikit-learn is a Python module for machine learning built on top of SciPy, including tools for text vectorization and vector space compression. https://scikit-learn.org/stable/
-
gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. https://radimrehurek.com/gensim/
-
nlpaug - Augmenting nlp for your machine learning projects.
-
AugLy - A data augmentations library from Facebook research for audio, image, text, and video.
Deep Learning
-
Transformers - Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. https://huggingface.co/transformers
-
fairseq - Facebook AI Research Sequence-to-Sequence Toolkit written in Python. https://fairseq.readthedocs.io/en/latest/
-
bert-as-service - Mapping a variable-length sentence to a fixed-length vector using BERT model. https://bert-as-service.readthedocs.io
-
Kashgari - Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Natural Language Understanding
-
Snips NLU - Snips Python library to extract meaning from text. https://snips-nlu.readthedocs.io
-
IKY - A python chatbot framework with Natural Language Understanding and Artificial Intelligence.
-
rasa - Framework to automate text- and voice-based conversations: NLU, dialogue management, chatbots. https://rasa.com/docs/rasa/
-
ParlAI - A framework for training and evaluating AI models on a variety of openly available dialogue datasets. https://parl.ai/
-
DeepPavlov - An open source library for deep learning end-to-end dialog systems and chatbots. https://deeppavlov.ai/
-
Rhino - On-device speech-to-intent engine powered by deep learning. https://picovoice.ai/
-
langchain - Building applications with LLMs (large language models) through composability. https://langchain.readthedocs.io
-
NeMo - NeMo: a toolkit for conversational AI. https://nvidia.github.io/NeMo/
Text mining
- dedupe - A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Visualizations
- Scattertext - Beautiful visualizations of how language differs among document types.
Big language models
- BIG-bench - Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models.
C++
Text processing
Currently empty ๐ชน
Knowledge ๐
Learning 101
- Virgilio - Virgilio is an open-source initiative, aiming to mentor and guide anyone in the world of the Data Science.
Multiple languages
- Awesome Sentiment Analysis - Repository with all what is necessary for sentiment analysis and related areas
Python (and Python Notebooks)
-
practicalAI - A practical approach to machine learning to enable everyone to learn, explore and build. https://practicalai.me
-
nlp-recipes - Comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.
No longer maintained
-
NeuronBlocks - NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego.
-
artificial-adversary - Tool to generate adversarial text examples and test machine learning models against them.
-
DELTA - DELTA is a deep learning based natural language and speech processing platform. https://delta-didi.readthedocs.io/
-
EventForecast - Time series prediction and text analysis using Keras LSTM, plus clustering, association rules mining.
-
lazynlp - Library to scrape and clean web pages to create massive datasets.
-
MeTA: ModErn Text Analysis - A Modern C++ Data Sciences Toolkit. https://meta-toolkit.org