SAMText
SAMText copied to clipboard
The official repo for the technical report "Scalable Mask Annotation for Video Text Spotting"
Scalable Mask Annotation for Video Text Spotting
This is the official repository of the paper Scalable Mask Annotation for Video Text Spotting.
Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, Dacheng Tao
News | Abstract | Method | Usage | Results | Statement
News
02/05/2023
-
The paper is post on arxiv! The code will be made public available once cleaned up.
-
Relevant Project:
DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer | Code
Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, Dacheng Tao
DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting | Code
Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, Dacheng Tao
I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection | Code
Bo Du, Jian Ye, Jing Zhang, Juhua Liu, Dacheng Tao
Other applications of ViTAE inlcude: ViTPose | Remote Sensing | Matting | VSA | Video Object Segmentation
Abstract
Video text spotting refers to localizing, recognizing, and tracking textual elements such as captions, logos, license plates, signs, and other forms of text within consecutive video frames. However, current datasets available for this task rely on quadrilateral ground truth annotations, which may result in including excessive background content and inaccurate text boundaries. Furthermore, methods trained on these datasets often produce prediction results in the form of quadrilateral boxes, which limits their ability to handle complex scenarios such as dense or curved text. To address these issues, we propose a scalable mask annotation pipeline called SAMText for video text spotting.SAMText leverages the SAM model to generate mask annotations for scene text images or video frames at scale. Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips sourced from existing datasets and over 9 million mask annotations. We have also conducted a thorough statistical analysis of the generated masks and their quality, identifying several research topics that could be further explored based on this dataset.
Method
Usage
The code and dataset will be released soon.
Results
The Quality of Generated Masks
Visualization of Generated Masks
In Figure 2, we show some visualization results of the generated masks in five datasets using the SAMText pipeline. The top row shows the scene text frames while the bottom row shows the generated masks. As can be seen, the generated masks possess fewer background components and align more precisely with the text boundaries than the bounding boxes. As a result, the generated mask annotations facilitate conducting more comprehensive research on this dataset, e.g., video text segmentation and video text spotting using mask annotations.
Dataset Statistics and Analysis
The size distribution.
The IoU and COV distribution.
The spatial distribution.
Statement
This project is for research purpose only. For any other questions please contact [email protected].
Citation
If you find SAMText helpful, please consider giving this repo a star:star: and citing:
@inproceedings{SAMText,
title={Scalable Mask Annotation for Video Text Spotting},
author={Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, Dacheng Tao},
booktitle={arxiv},
year={arXiv preprint arXiv:2305.01443}
}