awesome-text-ml icon indicating copy to clipboard operation
awesome-text-ml copied to clipboard

A curated list of ML awesome frameworks & libraries for text data

Awesome software for Text ML Awesome

A curated list of awesome ML frameworks and text embeddings. Focused on SOTA libraries which are actively maintained on GitHub.

Frameworks and libraries

:snake: Python

Text processing

  • HanLP - Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification via one unified interface. https://bbs.hankcs.com/

  • flair - A powerful NLP library for state-of-the-art natural language processing (NLP) models, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification.

  • sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.

  • stanza - Official Stanford NLP Python Library for Many Human Languages. https://stanfordnlp.github.io/stanza/

Pipelines / block-programming

  • texthero - Text preprocessing, representation and visualization from zero to hero. https://texthero.org/

Distributed computing

  • spark-nlp - Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. https://nlp.johnsnowlabs.com/

Machine Learning

  • sklearn - Scikit-learn is a Python module for machine learning built on top of SciPy, including tools for text vectorization and vector space compression. https://scikit-learn.org/stable/

  • gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. https://radimrehurek.com/gensim/

  • nlpaug - Augmenting nlp for your machine learning projects.

  • AugLy - A data augmentations library from Facebook research for audio, image, text, and video.

Deep Learning

  • Transformers - Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. https://huggingface.co/transformers

  • fairseq - Facebook AI Research Sequence-to-Sequence Toolkit written in Python. https://fairseq.readthedocs.io/en/latest/

  • bert-as-service - Mapping a variable-length sentence to a fixed-length vector using BERT model. https://bert-as-service.readthedocs.io

  • Kashgari - Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Natural Language Understanding

  • Snips NLU - Snips Python library to extract meaning from text. https://snips-nlu.readthedocs.io

  • IKY - A python chatbot framework with Natural Language Understanding and Artificial Intelligence.

  • rasa - Framework to automate text- and voice-based conversations: NLU, dialogue management, chatbots. https://rasa.com/docs/rasa/

  • ParlAI - A framework for training and evaluating AI models on a variety of openly available dialogue datasets. https://parl.ai/

  • DeepPavlov - An open source library for deep learning end-to-end dialog systems and chatbots. https://deeppavlov.ai/

  • Rhino - On-device speech-to-intent engine powered by deep learning. https://picovoice.ai/

  • langchain - Building applications with LLMs (large language models) through composability. https://langchain.readthedocs.io

  • NeMo - NeMo: a toolkit for conversational AI. https://nvidia.github.io/NeMo/

Text mining

  • dedupe - A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Visualizations

  • Scattertext - Beautiful visualizations of how language differs among document types.

Big language models

  • BIG-bench - Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models.

C++

Text processing

Currently empty ๐Ÿชน

Knowledge ๐Ÿ“š

Learning 101

  • Virgilio - Virgilio is an open-source initiative, aiming to mentor and guide anyone in the world of the Data Science.

Multiple languages

Python (and Python Notebooks)

  • practicalAI - A practical approach to machine learning to enable everyone to learn, explore and build. https://practicalai.me

  • nlp-recipes - Comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.

No longer maintained

  • NeuronBlocks - NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego.

  • artificial-adversary - Tool to generate adversarial text examples and test machine learning models against them.

  • DELTA - DELTA is a deep learning based natural language and speech processing platform. https://delta-didi.readthedocs.io/

  • EventForecast - Time series prediction and text analysis using Keras LSTM, plus clustering, association rules mining.

  • lazynlp - Library to scrape and clean web pages to create massive datasets.

  • MeTA: ModErn Text Analysis - A Modern C++ Data Sciences Toolkit. https://meta-toolkit.org