MatText
MatText copied to clipboard
Text-based modeling of materials.
MatText: A framework for text-based materials modeling
MatText is a framework for text-based materials modeling. It supports
- conversion of crystal structures in to text representations
- transformations of crystal structures for sensitivity analyses
- decoding of text representations to crystal structures
- tokenization of text-representation of crystal structures
- pre-training, finetuning and testing of language models on text-representations of crystal structures
- analysis of language models trained on text-representations of crystal structures
Local Installation
We recommend that you create a virtual conda environment on your computer in which you install the dependencies for this package. To do so head over to Miniconda and follow the installation instructions there.
Install development version
Clone this repository (you need git for this, if you get a missing command error for git you can install it with sudo apt-get install git)
git clone https://github.com/lamalab-org/mattext.git
cd mattext
pip install -e .
If you want to use the Local Env representation, you will also need to install OpenBabel, e.g. using
conda install openbabel -c conda-forge
Getting started
Converting crystals into text
from mattext.representations import TextRep
from pymatgen.core import Structure
# Load structure from a CIF file
from_file = "InCuS2_p1.cif"
structure = Structure.from_file(from_file, "cif")
# Initialize TextRep Class
text_rep = TextRep.from_input(structure)
requested_reps = [
"cif_p1",
"slices",
"atom_sequences",
"atom_sequences_plusplus",
"crystal_text_llm",
"zmatrix"
]
# Get the requested text representations
requested_text_reps = text_rep.get_requested_text_reps(requested_reps)
Pretrain
python main.py -cn=pretrain model=pretrain_example +model.representation=composition +model.dataset_type=pretrain30k +model.context_length=32
Running a benchmark
python main.py -cn=benchmark model=benchmark_example +model.dataset_type=filtered +model.representation=composition +model.dataset=perovskites +model.checkpoint=path/to/checkpoint
The + symbol before a configuration key indicates that you are adding a new key-value pair to the configuration. This is useful when you want to specify parameters that are not part of the default configuration.
To override the existing default configuration, use ++, for e.g., ++model.pretrain.training_arguments.per_device_train_batch_size=32. Refer to the docs for more examples and advanced ways to use the configs with config groups.
Using data
The MatText datasets can be easily obtained from HuggingFace, for example
from datasets import load_dataset
dataset = load_dataset("n0w0f/MatText", "pretrain300k")
👐 Contributing
Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.
👋 Attribution
Citation
If you use MatText in your work, please cite
@misc{alampara2024mattextlanguagemodelsneed,
title={MatText: Do Language Models Need More than Text & Scale for Materials Modeling?},
author={Nawaf Alampara and Santiago Miret and Kevin Maik Jablonka},
year={2024},
eprint={2406.17295},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci}
url={https://arxiv.org/abs/2406.17295},
}
⚖️ License
The code in this package is licensed under the MIT License.
💰 Funding
This project has been supported by the Carl Zeiss Foundation as well as Intel and Merck.