OrigamiNet
OrigamiNet copied to clipboard
Public implementation of our CVPR Paper "OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page TextRecognition by learning to unfold"
OrigamiNet
Public implementation of our CVPR 2020 paper:
"OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page TextRecognition by learning to unfold"
Getting Started
OrigamiNet has been implemented and tested with Python 3.6 and PyTorch 1.3. All project configuration is handled using Gin.
First clone the repo:
git clone https://github.com/IntuitionMachines/OrigamiNet.git
Then install the dependencies with:
pip install -r requirements.txt
Replicating Experiments
IAM
-
Register at the FKI's webpage here.
-
After obtaining the username and password, we provide a script to download and setup the dataset, crop paragraph images and generate corresponding paragraph transcriptions by concatenating each line transcription. Run:
bash iam/iam.sh $IAM_USER $IAM_PASS $IAM_DEST
where $IAM_USER
and $IAM_PASS
are the username and password from FKI website, IAM_DEST
is the destination folder where the dataset will be saved (the folder will be created by the script if it doesn't exist).
- Run the training script using provided configuration:
python train.py --gin iam/iam.gin
Note: if you want to use horovod
, run as following:
horovodrun -n $N_GPU -H localhost:$N_GPU python train.py --gin iam/iam.gin
where $N_GPU
is the number of gpus to be used (visible GPUs can be controlled by setting CUDA_VISIBLE_DEVICES
)
ICDAR2017 HTR
- Download and set up the dataset using the provided script:
bash ich17/ich.sh $ICH_DEST
ICH_DEST
is the destination folder where the dataset will be saved. The folder will be created by the script if it doesn't exist.
- Run the training script using provided configuration:
python train.py --gin ich17/ich.gin
Results
In the following table CER and nCER are respectively the micro and macro averaged Character Error Rate. BLEU is the marco-averaged character-level BLEU score.
Paper results
Dataset | wmul | Size | CER (%) | nCER (%) | BLEU |
---|---|---|---|---|---|
IAM | 1.5 | 750x750 | 4.7 | 4.84 | 91.15 |
ICDAR | 1.8 | 1400x1000 | 6.80 | 5.87 | 92.67 |
Additional results
Dataset | wmul | Size | CER (%) | nCER (%) | BLEU |
---|---|---|---|---|---|
IAM | 1.0 | 750x750 | 4.85 | 4.95 | 90.87 |
IAM | 2.0 | 750x750 | 4.41 | 4.54 | 91.25 |
IAM | 3.0 | 750x750 | 4.29 | 4.41 | 91.84 |
IAM | 4.0 | 750x750 | 4.07 | 4.18 | 92.21 |
ICDAR | 2.4 | 1400x1000 | 6.01 | 5.30 | 93.64 |
These experiments were done with a batch_size
of 8. We also obtained promising results with a batch_size
of 4, as the proposed architecure does not utilize BatchNorm operations.
Synthetic hard-to-segment IAM variants
In the paper, two IAM variants with hard-to-segment text-lines were presented. These results can be replicated as follows:
Compact lines
- Make a copy of the
pargs
folder, which contains the extracted paragraph images:
cp -r iam/pargs/ iam/pargsCL
- To generate IAM with touching lines, use
image-magick
to resize images to half the height using seam carving.
The following line runs the conversion in parallel to speed up the process:
find iam/pargsCL -iname "*.png" -type f -print0 | parallel --progress -0 -j +0 "mogrify -liquid-rescale 100x50%\! {}"
Rotated and warped
- Make a copy of the
pargs
folder, which contains the extracted paragraph images:
cp -r iam/pargs/ iam/pargsPW
- To generate IAM with a random projection and wavy text-lines:
find iam/pargsPW -iname "*.png" -type f -print0 | parallel --progress -0 -j +0 "python dist.py {}"
Results
Dataset | wmul | Size | CER (%) |
---|---|---|---|
Compact lines | 1.0 | 750x750 | 6.0 |
Rotated and warped | 1.0 | 750x750 | 5.6 |
Single line results
To be as useful as possible, we show how to perform single-line recognition based on the code. This essentially resembles the GTR model. Assuming lines from IAM and thier transcriptions are stored in iam/lines/
, run as
python train.py --gin iam/iam_ln.gin
Results
Results on the IAM single-line test set
Dataset | nlyrs | Size | CER (%) |
---|---|---|---|
IAM lines | 12 | 32x600 | 5.26 |
IAM lines | 18 | 32x600 | 4.84 |
IAM lines | 24 | 32x600 | 4.76 |
Gin Options
This is a brief list of the most important gin options. For full config files see iam/iam.gin
or ich17/ich.gin
-
dist
: The parallel traning method. We currently support three possible values:-
DP
uses DataParallel -
DDP
uses DistributedDataParallel -
HVD
uses horovod
-
-
n_channels
: number of channels per image -
o_classes
: The size of the target vocabulary (i.e. number of symbols in the alphabet) -
GradCheck
: Whether or not to use gradient checkpointing-
0
disabled -
1
enabled, light version which offers good memory saving with small slowdown -
2
enabled, higher memory saving, but noticeably slower than1
-
-
get_images.max_h
andget_images.max_w
: Target height and width for each image, images will be resized to this target dimentions while maintaining aspect ratio by padding. -
train.AMP
: Whether Automatic Mixed Precision (by Nvidia apex) is enabled -
train.WdB
: Whether Wandb logging is enabled -
train.train_data_list
andtrain.test_data_list
: Path to file containing list of training or testing images -
train.train_data_path
andtrain.test_data_path
: Path to folder containing the training or testing images -
train.train_batch_size
andtrain.val_batch_size
: the batch size used during training and validation, the interpretation of this value vaires according todist
option-
DP
thetrain.batch_size
is the total batch size -
DDP
orHVD
thentrain.batch_size
is the batch size per process (total batch size istrain.batch_size*#Processes
-
-
train.workers
: Number of worker for the PyTorchDataLoader
-
train.continue_model
: Path to checkpoint to continue from -
train.num_iter
: Total number of training iterations -
train.valInterval
: Perform validation every how many batches -
OrigamiNet.nlyrs
: #layers in the GTR model -
OrigamiNet.reduceAxis
: Final axis of reduction -
OrigamiNet.wmul
: Channel multipler, numer of channels in each channel will be multiplied by this value -
OrigamiNet.lszs
: Number of channels for each layer, this is a dictionary of formatleyer_id:channels
, unspecified layers are assumed constant -
s1/Upsample.size
: Size of penultimate layer -
s1/Upsample.size
: Size of the last layer -
OrigamiNet.lreszs
: Resampling stages in the model
Acknowledgements
Some code is borrowed from the deep-text-recognition-benchmark, which is under the Apache 2.0 license.
Network architecture was visualized using PlotNeuralNet
This work was sponsored by Intuition Machines, Inc.
Citation
@inproceedings{yousef2020origaminet,
title={OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page TextRecognition by learning to unfold},
author={Yousef, Mohamed and Bishop, Tom E.},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}