Add LSTMDoubleFit model for low-dimensional perovskite design
β Description
π Overview
This PR contributes the Feature-Guided Inverse Design (LSTMDoubleFit) model for the inverse design of organic A-site cations in low-dimensional perovskites.
The project integrates descriptor calculation, LSTM-based generative learning, and feature-constrained molecular optimization into a unified Paddle-based workflow.
This work reproduces and extends the study:
Feature-Guided Inverse Design of Organic A-Site Cations for Perovskite Dimensional Engineering, Wei-jie Wu et al., 2025.
π§ Model Workflow
-
Descriptor Calculation (
Cal.py)- Calculates molecular descriptors (e.g., ATSC1pe, MATS2c, SlogP_VSA2) from input SMILES.
- Results are stored in CSV files under
Modeldata/.
-
Dataset Preparation
- Before training, merge all CSV files under the
Modeldata/directory into a single dataset:
The merged filecat Modeldata/*.csv > Modeldata.csvModeldata.csvwill serve as the unified training dataset.
- Before training, merge all CSV files under the
-
Model Training and Generation (
Best_Seq2seq.py)- Implements an LSTM-based sequence-to-sequence model for SMILES reconstruction and generation.
- Inputs: one-hot encoded SMILES sequences + three physicochemical descriptors.
- Outputs: property-conditioned SMILES sequences (new organic cations).
-
Feature-Guided DoubleFit Model (
MolecularDoubleFitting.py)- Performs secondary regression to enforce propertyβstructure consistency.
- Refines generated molecules according to target perovskite dimensional features.
-
Postprocessing
- Generated molecules are filtered, ranked, and optionally validated through structural optimization workflows.
π Directory Structure
project/ βββ Feature-Guided Inverse Design of LDPs/ βββ Best_Seq2seq.py # Main LSTM model: training & molecular generation βββ Cal_ATSC1pe_MATS2c.py # Descriptor calculator (ATSC1pe, MATS2c) βββ Cal_SlogP_VSA2.py # Descriptor calculator (SlogP_VSA2) βββ MolecularDoubleFitting.py # Feature-guided molecular fitting model βββ MSEcalculation.py # Evaluation metrics βββ ModelandDataAnalysis.py # Dataset statistics & analysis βββ Modeldata/ # Folder containing split CSV datasets βββ GreatMolecular.xlsx # High-quality generated molecules βββ NewMolecules.xlsx # Newly generated candidates βββ README.md # Project documentation βββ data_parts/ # (Optional) Split dataset parts (<100 MB each)
βοΈ How to Run
1. Environment
pip install paddlepaddle scikit-learn pandas numpy tqdm rdkit
2. Prepare dataset
Merge CSV files in Modeldata/ into a single file:
cat Modeldata/*.csv > Modeldata.csv
3. Train and generate molecules
python Best_Seq2seq.py
4. Feature-guided molecular refinement
python MolecularDoubleFitting.py
π Dataset Note
The full dataset (~200 MB) was split into smaller CSV files under Modeldata/
to comply with GitHubβs 100MB per-file limit.
They must be merged before training as described above.
π Results
LSTM reconstruction accuracy: >95%
Enhanced novelty and property diversity in generated cations
Generated organic A-site cations exhibit favorable dimensional preferences for RP- and DJ-type perovskites.
π‘ Key Contributions
DoubleFit Learning Mechanism: Joint optimization of molecular structure and descriptor features.
Feature-Constrained Generation: Enables directionally controlled molecular design.
Descriptor-Integrated Workflow: Fully compatible with PaddlePaddle for training and inference.
π§βπ» Author
Weijie Wu
South China Normal University
Thanks for your contribution!
Thanks for your contribution! Please fetch the newest version repo codes and pull your codes.We recommend to use the ppmat architecture to fit your model. If these is some problem of adaption, please contact us!