awesome-textmining-materials-science icon indicating copy to clipboard operation
awesome-textmining-materials-science copied to clipboard

Collection of papers on text mining for materials science

Awesome text mining ⛏️ for materials science

A collection of papers on text mining for materials science. Note: this is a work in progress, I will constantly update this page.

If you find an interesting paper and would like to add it here, please create a PR request.

Tools and codes

Plain text

  • spaCy: Fast NLP toolkit with pre-built deep learning models for tokenization, NER, POS, dependency parsing, word2vec, etc.
  • textacy: Pre-/post- processing of text used in conjunction with spaCy, such as text normalization, garbage text cleaning, extraction of ngrams, entities, etc.
  • ChemDataExtractor: A full-fledged toolkit for sentence segmentation, tokenization, chemical NER, and extracting chemical information.

PDF files

  • PDFMiner: A pure Python implementation of PDF parser.
  • textract: A bundle of markup-to-plain-text converters including PDF files.

OCR tools

  • tesseract: An open-source C++ OCR tool based on LSTM that supports many languages.
  • Google Cloud OCR: Google Cloud OCR is highly accurate for books but may suffer from bad recognition accuracy for chemical/materials science symbols and equations.

Image data extraction

Datasets/databases

On synthesis

NLP annotations

NLP pipelines

Named Entity Recognition

Text classification/categorization

Data analysis

Synthesis data analysis/planning

Chemical knowledge base/graph