sightseq
sightseq copied to clipboard
The vanilla cnn downsampling architecture cannot recover spatial information of a image
The convolutional part of the architecture act as a encoder part, it capture image's contexture information, the architecture should ensemble a decoder part (deconvolution layer or RNN layer) to recover image's spatial information.
The current CNN architecture implemented here is classified two categories by the pooling size of the image's width: the one is densenet121
, it compress image's width by 1/8, the other one is densenet_cifar
by 1/4. So the current network architecture cannot handle the situation where the text have different width.