table_extractor copied to clipboard
Extracts tables into json format from HTML/XML files
Code and data used in the paper, A Machine Learning Approach to Zeolite Synthesis Enabled by Automatic Literature Data Extraction
There are two main components to this repository:
- table_extractor code
- zeolite synthesis data
1. Table Extraction Code
This code extracts tables into json format from HTML/XML files. These HTML/XML files need to be supplied by the researcher. The code is written in Python3. To run the code:
- Fork this repository
- Download the Olivetti group materials science FastText word embeddings
- Available here:
- Download all 4 files and place in the tableextractor/bin folder
- Install all dependencies
- json, pandas, spacy, bs4, gensim, numpy, unidecode, sklearn, scipy, traceback
- Place all files in tableextractor/data
- Use Jupyter (Table Extractor Tutorial) to run the code
The code takes in a list of files and corresponding DOIs and returns a list of all tables extracted from the files as JSON objects. Currently, the code supports files from ACS, APS, Elsevier, Wiley, Springer, and RSC.
2. Zeolite Synthesis Data
The germanium containing zeolite data set used in the paper is publicly available in both Excel and CSV formats. Here is a description of each feature:
doi- DOI of the paper the synthesis route comes from
Si:B- molar amount of each element/compound/molecule used in the synthesis. Amounts are normalized to give Si=1 or Ge=1 if Si=0
Time- crystallization time in hours
Temp- crystallization temperature in °C
SDA Type- name given to the organic structure directing agent (OSDA) molecule in the paper
SMILES- the SMILES representation of the OSDA molecule
SDA_Vol- the DFT calculated molar volume of the OSDA molecule in bohr^3
SDA_SA- the DFT calculated surface area of the OSDA molecule in bohr^2
SDA_KFI- the DFT calculated Kier flexibility index of the OSDA molecule
From?- the location within a paper the compositional information is extracted. Either Table, Text, or Supplemental
Extracted- Products of the synthesis as they appear in the paper
Zeo1- the primary zeolite (zeotype) material made in the synthesis
Zeo2- the secondary zeolite (zeotype) material made in the synthesis
Dense1- the primary dense phase made in the synthesis
Dense2- the secondary dense phase made in the synthesis
Am- whether an amorphous phase is made in (or remains after) the synthesis
Other- any other unidentified phases made in the synthesis
ITQ- whether the synthesis made a zeolite in the ITQ series
FD1- the framework density of Zeo1
MR1- the maximum ring size of Zeo1
FD2- the framework density of Zeo2
MR2- the framework density of Zeo2
If you use this code or data, please cite the following as appropriate.
A Machine Learning Approach to Zeolite Synthesis Enabled by Automatic Literature Data Extraction Zach Jensen, Edward Kim, Soonhyoung Kwon, Terry Z. H. Gani, Yuriy Román-Leshkov, Manuel Moliner, Avelino Corma, and Elsa Olivetti ACS Central Science Article ASAP DOI: 10.1021/acscentsci.9b00193