PaddleMaterials Add LSTMDoubleFit model for low-dimensional perovskite design

✅ Description

📘 Overview

This PR contributes the Feature-Guided Inverse Design (LSTMDoubleFit) model for the inverse design of organic A-site cations in low-dimensional perovskites.
The project integrates descriptor calculation, LSTM-based generative learning, and feature-constrained molecular optimization into a unified Paddle-based workflow.

This work reproduces and extends the study:

Feature-Guided Inverse Design of Organic A-Site Cations for Perovskite Dimensional Engineering, Wei-jie Wu et al., 2025.

🧠 Model Workflow

Descriptor Calculation (Cal.py)
- Calculates molecular descriptors (e.g., ATSC1pe, MATS2c, SlogP_VSA2) from input SMILES.
- Results are stored in CSV files under Modeldata/.
Dataset Preparation
- Before training, merge all CSV files under the Modeldata/ directory into a single dataset:
```
cat Modeldata/*.csv > Modeldata.csv
```
  The merged file Modeldata.csv will serve as the unified training dataset.
Model Training and Generation (Best_Seq2seq.py)
- Implements an LSTM-based sequence-to-sequence model for SMILES reconstruction and generation.
- Inputs: one-hot encoded SMILES sequences + three physicochemical descriptors.
- Outputs: property-conditioned SMILES sequences (new organic cations).
Feature-Guided DoubleFit Model (MolecularDoubleFitting.py)
- Performs secondary regression to enforce property–structure consistency.
- Refines generated molecules according to target perovskite dimensional features.
Postprocessing
- Generated molecules are filtered, ranked, and optionally validated through structural optimization workflows.

📁 Directory Structure

project/ └── Feature-Guided Inverse Design of LDPs/ ├── Best_Seq2seq.py # Main LSTM model: training & molecular generation ├── Cal_ATSC1pe_MATS2c.py # Descriptor calculator (ATSC1pe, MATS2c) ├── Cal_SlogP_VSA2.py # Descriptor calculator (SlogP_VSA2) ├── MolecularDoubleFitting.py # Feature-guided molecular fitting model ├── MSEcalculation.py # Evaluation metrics ├── ModelandDataAnalysis.py # Dataset statistics & analysis ├── Modeldata/ # Folder containing split CSV datasets ├── GreatMolecular.xlsx # High-quality generated molecules ├── NewMolecules.xlsx # Newly generated candidates ├── README.md # Project documentation └── data_parts/ # (Optional) Split dataset parts (<100 MB each)

⚙️ How to Run

1. Environment

pip install paddlepaddle scikit-learn pandas numpy tqdm rdkit
2. Prepare dataset
Merge CSV files in Modeldata/ into a single file:
cat Modeldata/*.csv > Modeldata.csv
3. Train and generate molecules
python Best_Seq2seq.py
4. Feature-guided molecular refinement
python MolecularDoubleFitting.py
📊 Dataset Note
The full dataset (~200 MB) was split into smaller CSV files under Modeldata/
to comply with GitHub’s 100MB per-file limit.
They must be merged before training as described above.
🚀 Results
LSTM reconstruction accuracy: >95%
Enhanced novelty and property diversity in generated cations
Generated organic A-site cations exhibit favorable dimensional preferences for RP- and DJ-type perovskites.
💡 Key Contributions
DoubleFit Learning Mechanism: Joint optimization of molecular structure and descriptor features.
Feature-Constrained Generation: Enables directionally controlled molecular design.
Descriptor-Integrated Workflow: Fully compatible with PaddlePaddle for training and inference.
🧑‍💻 Author
Weijie Wu
South China Normal University

Nov 13 '25 12:11 Wei-jie-Wu

Thanks for your contribution!

Nov 13 '25 12:11 paddle-bot[bot]

All committers have signed the CLA.

Nov 13 '25 12:11 CLAassistant

Thanks for your contribution! Please fetch the newest version repo codes and pull your codes.We recommend to use the ppmat architecture to fit your model. If these is some problem of adaption, please contact us!

Nov 18 '25 02:11 leeleolay