arxiv-daily
arxiv-daily copied to clipboard
New submissions for Fri, 16 Sep 22
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
A novel illumination condition varied image dataset-Food Vision Dataset (FVD) for fair and reliable consumer acceptability predictions from food
- Authors: Swarna Sethu (1), Dongyi Wang (1 and 2) ((1) Department of Biological & Agricultural engineering, University of Arkansas, Fayetteville, (2) Department of Food & Science and Department of Biological & Agricultural engineering, University of Arkansas, Fayetteville)
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.06967
- Pdf link: https://arxiv.org/pdf/2209.06967
- Abstract Recent advances in artificial intelligence promote a wide range of computer vision applications in many different domains. Digital cameras, acting as human eyes, can perceive fundamental object properties, such as shapes and colors, and can be further used for conducting high-level tasks, such as image classification, and object detections. Human perceptions have been widely recognized as the ground truth for training and evaluating computer vision models. However, in some cases, humans can be deceived by what they have seen. Well-functioned human vision relies on stable external lighting while unnatural illumination would influence human perception of essential characteristics of goods. To evaluate the illumination effects on human and computer perceptions, the group presents a novel dataset, the Food Vision Dataset (FVD), to create an evaluation benchmark to quantify illumination effects, and to push forward developments of illumination estimation methods for fair and reliable consumer acceptability prediction from food appearances. FVD consists of 675 images captured under 3 different power and 5 different temperature settings every alternate day for five such days.
Efficient Perception, Planning, and Control Algorithms for Vision-Based Automated Vehicles
- Authors: Der-Hau Lee
- Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.07042
- Pdf link: https://arxiv.org/pdf/2209.07042
- Abstract Owing to resource limitations, efficient computation systems have long been a critical demand for those designing autonomous vehicles. Additionally, sensor cost and size restrict the development of self-driving cars. This paper presents an efficient framework for the operation of vision-based automatic vehicles; a front-facing camera and a few inexpensive radars are the required sensors for driving environment perception. The proposed algorithm comprises a multi-task UNet (MTUNet) network for extracting image features and constrained iterative linear quadratic regulator (CILQR) modules for rapid lateral and longitudinal motion planning. The MTUNet is designed to simultaneously solve lane line segmentation, ego vehicle heading angle regression, road type classification, and traffic object detection tasks at an approximate speed of 40 FPS when an RGB image of size 228 x 228 is fed into it. The CILQR algorithms then take processed MTUNet outputs and radar data as their input to produce driving commands for lateral and longitudinal vehicle automation guidance; both optimal control problems can be solved within 1 ms. The proposed CILQR controllers are shown to be more efficient than the sequential quadratic programming (SQP) methods and can collaborate with the MTUNet to drive a car autonomously in unseen simulation environments for lane-keeping and car-following maneuvers. Our experiments demonstrate that the proposed autonomous driving system is applicable to modern automobiles.
PROB-SLAM: Real-time Visual SLAM Based on Probabilistic Graph Optimization
- Authors: Xianwei Meng, Bonian Li
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.07061
- Pdf link: https://arxiv.org/pdf/2209.07061
- Abstract Traditional SLAM algorithms are typically based on artificial features, which lack high-level information. By introducing semantic information, SLAM can own higher stability and robustness rather than purely hand-crafted features. However, the high uncertainty of semantic detection networks prohibits the practical functionality of high-level information. To solve the uncertainty property introduced by semantics, this paper proposed a novel probability map based on the Gaussian distribution assumption. This map transforms the semantic binary object detection into probability results, which help establish a probabilistic data association between artificial features and semantic info. Through our algorithm, the higher confidence will be given higher weights in each update step while the edge of the detection area will be endowed with lower confidence. Then the uncertainty is undermined and has less effect on nonlinear optimization. The experiments are carried out in the TUM RGBD dataset, results show that our system improves ORB-SLAM2 by about 15% in indoor environments' errors. We have demonstrated that the method can be successfully applied to environments containing dynamic objects.
FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D Object Detection
- Authors: Chaokang Jiang, Guangming Wang, Jinxing Wu, Yanzi Miao, Hesheng Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.07419
- Pdf link: https://arxiv.org/pdf/2209.07419
- Abstract Promising complementarity exists between the texture features of color images and the geometric information of LiDAR point clouds. However, there still present many challenges for efficient and robust feature fusion in the field of 3D object detection. In this paper, first, unstructured 3D point clouds are filled in the 2D plane and 3D point cloud features are extracted faster using projection-aware convolution layers. Further, the corresponding indexes between different sensor signals are established in advance in the data preprocessing, which enables faster cross-modal feature fusion. To address LiDAR points and image pixels misalignment problems, two new plug-and-play fusion modules, LiCamFuse and BiLiCamFuse, are proposed. In LiCamFuse, soft query weights with perceiving the Euclidean distance of bimodal features are proposed. In BiLiCamFuse, the fusion module with dual attention is proposed to deeply correlate the geometric and textural features of the scene. The quantitative results on the KITTI dataset demonstrate that the proposed method achieves better feature-level fusion. In addition, the proposed network shows a shorter running time compared to existing methods.
Keyword: transformer
On the interplay of adversarial robustness and architecture components: patches, convolution and attention
- Authors: Francesco Croce, Matthias Hein
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.06953
- Pdf link: https://arxiv.org/pdf/2209.06953
- Abstract In recent years novel architecture components for image classification have been developed, starting with attention and patches used in transformers. While prior works have analyzed the influence of some aspects of architecture components on the robustness to adversarial attacks, in particular for vision transformers, the understanding of the main factors is still limited. We compare several (non)-robust classifiers with different architectures and study their properties, including the effect of adversarial training on the interpretability of the learnt features and robustness to unseen threat models. An ablation from ResNet to ConvNeXt reveals key architectural changes leading to almost $10%$ higher $\ell_\infty$-robustness.
Efficient Quantized Sparse Matrix Operations on Tensor Cores
- Authors: Shigang Li, Kazuki Osawa, Torsten Hoefler
- Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.06979
- Pdf link: https://arxiv.org/pdf/2209.06979
- Abstract The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.
PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer
- Authors: Qibo Qiu, Haiming Gao, Wei Hua, Gang Huang, Xiaofei He
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.06994
- Pdf link: https://arxiv.org/pdf/2209.06994
- Abstract Lane detection is one of the fundamental modules in self-driving. In this paper we employ a transformer-only method for lane detection, thus it could benefit from the blooming development of fully vision transformer and achieves the state-of-the-art (SOTA) performance on both CULane and TuSimple benchmarks, by fine-tuning the weight fully pre-trained on large datasets. More importantly, this paper proposes a novel and general framework called PriorLane, which is used to enhance the segmentation performance of the fully vision transformer by introducing the low-cost local prior knowledge. PriorLane utilizes an encoder-only transformer to fuse the feature extracted by a pre-trained segmentation model with prior knowledge embeddings. Note that a Knowledge Embedding Alignment (KEA) module is adapted to enhance the fusion performance by aligning the knowledge embedding. Extensive experiments on our Zjlab dataset show that Prior-Lane outperforms SOTA lane detection methods by a 2.82% mIoU, and the code will be released at: https://github. com/vincentqqb/PriorLane.
Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?
- Authors: Yi Wang, Zhiwen Fan, Tianlong Chen, Hehe Fan, Zhangyang Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.07026
- Pdf link: https://arxiv.org/pdf/2209.07026
- Abstract Vision Transformers (ViTs) have proven to be effective, in solving 2D image understanding tasks by training over large-scale image datasets; and meanwhile as a somehow separate track, in modeling the 3D visual world too such as voxels or point clouds. However, with the growing hope that transformers can become the "universal" modeling tool for heterogeneous data, ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable. That invites an (over-)ambitious question: can we close the gap between the 2D and 3D ViT architectures? As a piloting study, this paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture, with only minimal customization at the input and output levels without redesigning the pipeline. To build a 3D ViT from its 2D sibling, we "inflate" the patch embedding and token sequence, accompanied with new positional encoding mechanisms designed to match the 3D data geometry. The resultant "minimalist" 3D ViT, named Simple3D-Former, performs surprisingly robustly on popular 3D tasks such as object classification, point cloud segmentation and indoor scene detection, compared to highly customized 3D-specific designs. It can hence act as a strong baseline for new 3D ViTs. Moreover, we note that pursing a unified 2D-3D ViT design has practical relevance besides just scientific curiosity. Specifically, we demonstrate that Simple3D-Former naturally enables to exploit the wealth of pre-trained weights from large-scale realistic 2D images (e.g., ImageNet), which can be plugged in to enhancing the 3D task performance "for free".
AutoUpdate: Automatically Recommend Code Updates for Android Apps
- Authors: Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, Patanamon Thongtanunam, Li Li
- Subjects: Software Engineering (cs.SE)
- Arxiv link: https://arxiv.org/abs/2209.07048
- Pdf link: https://arxiv.org/pdf/2209.07048
- Abstract Android developers frequently update source code to improve the performance, security, or maintainability of Android apps. Such Android code updating activities are intuitively repetitive, manual, and time-consuming. In this paper, we propose AutoUpdate, a Transformer-based automated code update recommendation approach for Android Apps, which takes advantage of code abstraction (Abs) and Byte-Pair Encoding (BPE) techniques to represent source code. Since this is the first work to automatically update code in Android apps, we collect a history of 209,346 updated method pairs from 3,195 real-world Android applications available on Google Play stores that span 14 years (2008-2022). Through an extensive experiment on our curated datasets, the results show that AutoUpdate(1) achieves a perfect prediction of 25% based on the realistic time-wise evaluation scenario, which outperforms the two baseline approaches; (2) gains benefits at least 17% of improvement by using both Abs and BPE; (3) is able to recommend code updates for various purposes (e.g., fixing bugs, adding new feature, refactoring methods). On the other hand, the models (4) could produce optimistically high accuracy due to the unrealistic evaluation scenario (i.e., random splits), suggesting that researchers should consider time-wise evaluation scenarios in the future; (5) are less accurate for a larger size of methods with a larger number of changed tokens, providing a research opportunity for future work. Our findings demonstrate the significant advancement of NMT-based code update recommendation approaches for Android apps.
Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training
- Authors: Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan, Tsung-Hui Chang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.07098
- Pdf link: https://arxiv.org/pdf/2209.07098
- Abstract Medical vision-and-language pre-training provides a feasible solution to extract effective vision-and-language representations from medical images and texts. However, few studies have been dedicated to this field to facilitate medical vision-and-language understanding. In this paper, we propose a self-supervised learning paradigm with multi-modal masked autoencoders (M$^3$AE), which learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts. There are three key designs to make this simple approach work. First, considering the different information densities of vision and language, we adopt different masking ratios for the input image and text, where a considerably larger masking ratio is used for images. Second, we use visual and textual features from different layers to perform the reconstruction to deal with different levels of abstraction in visual and language. Third, we develop different designs for vision and language decoders (i.e., a Transformer for vision and a multi-layer perceptron for language). To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results demonstrate the effectiveness of our approach, where state-of-the-art results are achieved on all downstream tasks. Besides, we conduct further analysis to better verify the effectiveness of different components of our approach and various settings of pre-training. The source code is available at~\url{https://github.com/zhjohnchan/M3AE}.
BadRes: Reveal the Backdoors through Residual Connection
- Authors: Mingrui He, Tianyu Chen, Haoyi Zhou, Shanghang Zhang, Jianxin Li
- Subjects: Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2209.07125
- Pdf link: https://arxiv.org/pdf/2209.07125
- Abstract Generally, residual connections are indispensable network components in building CNNs and Transformers for various downstream tasks in CV and VL, which encourages skip shortcuts between network blocks. However, the layer-by-layer loopback residual connections may also hurt the model's robustness by allowing unsuspecting input. In this paper, we proposed a simple yet strong backdoor attack method - BadRes, where the residual connections play as a turnstile to be deterministic on clean inputs while unpredictable on poisoned ones. We have performed empirical evaluations on four datasets with ViT and BEiT models, and the BadRes achieves 97% attack success rate while receiving zero performance degradation on clean data. Moreover, we analyze BadRes with state-of-the-art defense methods and reveal the fundamental weakness lying in residual connections.
Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention
- Authors: Jingwei Zhao, Gus Xia, Ye Wang
- Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2209.07140
- Pdf link: https://arxiv.org/pdf/2209.07140
- Abstract We propose Beat Transformer, a novel Transformer encoder architecture for joint beat and downbeat tracking. Different from previous models that track beats solely based on the spectrogram of an audio mixture, our model deals with demixed spectrograms with multiple instrument channels. This is inspired by the fact that humans perceive metrical structures from richer musical contexts, such as chord progression and instrumentation. To this end, we develop a Transformer model with both time-wise attention and instrument-wise attention to capture deep-buried metrical cues. Moreover, our model adopts a novel dilated self-attention mechanism, which achieves powerful hierarchical modelling with only linear complexity. Experiments demonstrate a significant improvement in demixed beat tracking over the non-demixed version. Also, Beat Transformer achieves up to 4% point improvement in downbeat tracking accuracy over the TCN architectures. We further discover an interpretable attention pattern that mirrors our understanding of hierarchical metrical structures.
HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator
- Authors: Younggyo Seo, Kimin Lee, Fangchen Liu, Stephen James, Pieter Abbeel
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.07143
- Pdf link: https://arxiv.org/pdf/2209.07143
- Abstract Video prediction is an important yet challenging problem; burdened with the tasks of generating future frames and learning environment dynamics. Recently, autoregressive latent video models have proved to be a powerful video prediction tool, by separating the video prediction into two sub-problems: pre-training an image generator model, followed by learning an autoregressive prediction model in the latent space of the image generator. However, successfully generating high-fidelity and high-resolution videos has yet to be seen. In this work, we investigate how to train an autoregressive latent video prediction model capable of predicting high-fidelity future frames with minimal modification to existing models, and produce high-resolution (256x256) videos. Specifically, we scale up prior models by employing a high-fidelity image generator (VQ-GAN) with a causal transformer model, and introduce additional techniques of top-k sampling and data augmentation to further improve video prediction quality. Despite the simplicity, the proposed method achieves competitive performance to state-of-the-art approaches on standard video prediction benchmarks with fewer parameters, and enables high-resolution video prediction on complex and large-scale datasets. Videos are available at https://sites.google.com/view/harp-videos/home.
Number of Attention Heads vs Number of Transformer-Encoders in Computer Vision
- Authors: Tomas Hrycej, Bernhard Bermeitinger, Siegfried Handschuh
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.07221
- Pdf link: https://arxiv.org/pdf/2209.07221
- Abstract Determining an appropriate number of attention heads on one hand and the number of transformer-encoders, on the other hand, is an important choice for Computer Vision (CV) tasks using the Transformer architecture. Computing experiments confirmed the expectation that the total number of parameters has to satisfy the condition of overdetermination (i.e., number of constraints significantly exceeding the number of parameters). Then, good generalization performance can be expected. This sets the boundaries within which the number of heads and the number of transformers can be chosen. If the role of context in images to be classified can be assumed to be small, it is favorable to use multiple transformers with a low number of heads (such as one or two). In classifying objects whose class may heavily depend on the context within the image (i.e., the meaning of a patch being dependent on other patches), the number of heads is equally important as that of transformers.
ÚFAL CorPipe at CRAC 2022: Effectivity of Multilingual Models for Coreference Resolution
- Authors: Milan Straka, Jana Straková
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.07278
- Pdf link: https://arxiv.org/pdf/2209.07278
- Abstract We describe the winning submission to the CRAC 2022 Shared Task on Multilingual Coreference Resolution. Our system first solves mention detection and then coreference linking on the retrieved spans with an antecedent-maximization approach, and both tasks are fine-tuned jointly with shared Transformer weights. We report results of fine-tuning a wide range of pretrained models. The center of this contribution are fine-tuned multilingual models. We found one large multilingual model with sufficiently large encoder to increase performance on all datasets across the board, with the benefit not limited only to the underrepresented languages or groups of typologically relative languages. The source code is available at https://github.com/ufal/crac2022-corpipe.
A Light Recipe to Train Robust Vision Transformers
- Authors: Edoardo Debenedetti, Vikash Sehwag, Prateek Mittal
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.07399
- Pdf link: https://arxiv.org/pdf/2209.07399
- Abstract In this paper, we ask whether Vision Transformers (ViTs) can serve as an underlying architecture for improving the adversarial robustness of machine learning models against evasion attacks. While earlier works have focused on improving Convolutional Neural Networks, we show that also ViTs are highly suitable for adversarial training to achieve competitive performance. We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset. The canonical training recipe for ViTs recommends strong data augmentation, in part to compensate for the lack of vision inductive bias of attention modules, when compared to convolutions. We show that this recipe achieves suboptimal performance when used for adversarial training. In contrast, we find that omitting all heavy data augmentation, and adding some additional bag-of-tricks ($\varepsilon$-warmup and larger weight decay), significantly boosts the performance of robust ViTs. We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k. Additionally, investigating the reasons for the robustness of our models, we show that it is easier to generate strong attacks during training when using our recipe and that this leads to better robustness at test time. Finally, we further study one consequence of adversarial training by proposing a way to quantify the semantic nature of adversarial perturbations and highlight its correlation with the robustness of the model. Overall, we recommend that the community should avoid translating the canonical training recipes in ViTs to robust training and rethink common training choices in the context of adversarial training.
Examining Large Pre-Trained Language Models for Machine Translation: What You Don't Know About It
- Authors: Lifeng Han, Gleb Erofeev, Irina Sorokina, Serge Gladkoff, Goran Nenadic
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2209.07417
- Pdf link: https://arxiv.org/pdf/2209.07417
- Abstract Pre-trained language models (PLMs) often take advantage of the monolingual and multilingual dataset that is freely available online to acquire general or mixed domain knowledge before deployment into specific tasks. Extra-large PLMs (xLPLMs) are proposed very recently to claim supreme performances over smaller-sized PLMs such as in machine translation (MT) tasks. These xLPLMs include Meta-AI's wmt21-dense-24-wide-en-X and NLLB. \textit{In this work, we examine if xLPLMs are absolutely superior to smaller-sized PLMs in fine-tuning toward domain-specific MTs.} We use two different in-domain data of different sizes: commercial automotive in-house data and \textbf{clinical} shared task data from the ClinSpEn2022 challenge at WMT2022. We choose popular Marian Helsinki as smaller sized PLM and two massive-sized Mega-Transformers from Meta-AI as xLPLMs. Our experimental investigation shows that 1) on smaller sized in-domain commercial automotive data, xLPLM wmt21-dense-24-wide-en-X indeed shows much better evaluation scores using S\textsc{acre}BLEU and hLEPOR metrics than smaller-sized Marian, even though its score increase rate is lower than Marian after fine-tuning; 2) on relatively larger-size well prepared clinical data fine-tuning, the xLPLM NLLB \textbf{tends to lose} its advantage over smaller-sized Marian on two sub-tasks (clinical terms and ontology concepts) using ClinSpEn offered metrics METEOR, COMET, and ROUGE-L, and totally lost to Marian on Task-1 (clinical cases) on all metrics including S\textsc{acre}BLEU and BLEU; 3) \textbf{metrics do not always agree} with each other on the same tasks using the same model outputs.
On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition
- Authors: Farrukh Rahman, Ömer Mubarek, Zsolt Kira
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2209.07474
- Pdf link: https://arxiv.org/pdf/2209.07474
- Abstract Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and SomethingSomething-V2) and perform thorough analysis and ablation studies to explain this observation using the predominant features of video transformer architectures. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well. Our experiments inform our recommendation that semi-supervised learning video work should consider the use of video transformers in the future.
Hydra Attention: Efficient Attention with Many Heads
- Authors: Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Judy Hoffman
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.07484
- Pdf link: https://arxiv.org/pdf/2209.07484
- Abstract While transformers have begun to dominate many tasks in vision, applying them to large images is still computationally difficult. A large reason for this is that self-attention scales quadratically with the number of tokens, which in turn, scales quadratically with the image size. On larger images (e.g., 1080p), over 60% of the total computation in the network is spent solely on creating and applying attention matrices. We take a step toward solving this issue by introducing Hydra Attention, an extremely efficient attention operation for Vision Transformers (ViTs). Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as there are features, Hydra Attention is computationally linear in both tokens and features with no hidden constants, making it significantly faster than standard self-attention in an off-the-shelf ViT-B/16 by a factor of the token count. Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases, actually improves it.
Medical Image Segmentation using LeViT-UNet++: A Case Study on GI Tract Data
- Authors: Praneeth Nemani, Satyanarayana Vollala
- Subjects: Neural and Evolutionary Computing (cs.NE)
- Arxiv link: https://arxiv.org/abs/2209.07515
- Pdf link: https://arxiv.org/pdf/2209.07515
- Abstract Gastro-Intestinal Tract cancer is considered a fatal malignant condition of the organs in the GI tract. Due to its fatality, there is an urgent need for medical image segmentation techniques to segment organs to reduce the treatment time and enhance the treatment. Traditional segmentation techniques rely upon handcrafted features and are computationally expensive and inefficient. Vision Transformers have gained immense popularity in many image classification and segmentation tasks. To address this problem from a transformers' perspective, we introduced a hybrid CNN-transformer architecture to segment the different organs from an image. The proposed solution is robust, scalable, and computationally efficient, with a Dice and Jaccard coefficient of 0.79 and 0.72, respectively. The proposed solution also depicts the essence of deep learning-based automation to improve the effectiveness of the treatment
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
- Authors: Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2209.07526
- Pdf link: https://arxiv.org/pdf/2209.07526
- Abstract This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.
Keyword: scene understanding
Dynamic X-Ray Vision in Mixed Reality
- Authors: Hung-Jui Guo, Jonathan Z. Bakdash, Laura R. Marusich, Balakrishnan Prabhakaran
- Subjects: Human-Computer Interaction (cs.HC)
- Arxiv link: https://arxiv.org/abs/2209.07025
- Pdf link: https://arxiv.org/pdf/2209.07025
- Abstract X-ray vision, a technique that allows users to see through walls and other obstacles, is a popular technique for Augmented Reality (AR) and Mixed Reality (MR). In this paper, we demonstrate a dynamic X-ray vision window that is rendered in real-time based on the user's current position and changes with movement in the physical environment. Moreover, the location and transparency of the window are also dynamically rendered based on the user's eye gaze. We build this X-ray vision window for a current state-of-the-art MR Head-Mounted Device (HMD) -- HoloLens 2 by integrating several different features: scene understanding, eye tracking, and clipping primitive.
Keyword: visual reasoning
VIPHY: Probing "Visible" Physical Commonsense Knowledge
- Authors: Shikhar Singh, Ehsan Qasemi, Muhao Chen
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2209.07000
- Pdf link: https://arxiv.org/pdf/2209.07000
- Abstract In recent years, vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks (e.g. attributes, location). While such tasks measure the requisite knowledge to ground and reason over a given visual instance, they do not, however, measure the ability of VLMs to retain and generalize such knowledge. In this work, we evaluate their ability to acquire "visible" physical knowledge -- the information that is easily accessible from images of static scenes, particularly across the dimensions of object color, size and space. We build an automatic pipeline to derive a comprehensive knowledge resource for calibrating and probing these models. Our results indicate a severe gap between model and human performance across all three tasks. Furthermore, our caption pretrained baseline (CapBERT) significantly outperforms VLMs on both size and spatial tasks -- highlighting that despite sufficient access to ground language with visual modality, they struggle to retain such knowledge. The dataset and code are available at https://github.com/Axe--/ViPhy .