arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Mon, 12 Sep 22

Open DongZhouGu opened this issue 2 years ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

There is no result

Keyword: transformer

Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL

  • Authors: Taku Yamagata, Ahmed Khalil, Raul Santos-Rodriguez (Intelligent System Laboratory, University of Bristol)
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2209.03993
  • Pdf link: https://arxiv.org/pdf/2209.03993
  • Abstract Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results by converting the RL task to a supervised learning task. Decision Transformer (DT) combines the conditional policy approach and Transformer architecture to show competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL that learns the optimal policy from sub-optimal trajectories. The issue becomes significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not suffer the same issue; however, they suffer from unstable learning behaviours, especially when it employs function approximation in an off-policy learning setting. In this paper, we propose Q-learning Decision Transformer (QDT) that addresses the shortcomings of DT by leveraging the benefit of Dynamic Programming (Q-learning). QDT utilises the Dynamic Programming (Q-learning) results to relabel the return-to-go in the training data. We then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We demonstrate the issue of DT and the advantage of QDT in a simple environment. We also evaluate QDT in the more complex D4RL benchmark showing good performance gains.

Multilingual Transformer Language Model for Speech Recognition in Low-resource Languages

  • Authors: Li Miao, Jian Wu, Piyush Behre, Shuangyu Chang, Sarangarajan Parthasarathy
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2209.04041
  • Pdf link: https://arxiv.org/pdf/2209.04041
  • Abstract It is challenging to train and deploy Transformer LMs for hybrid speech recognition 2nd pass re-ranking in low-resource languages due to (1) data scarcity in low-resource languages, (2) expensive computing costs for training and refreshing 100+ monolingual models, and (3) hosting inefficiency considering sparse traffic. In this study, we present a new way to group multiple low-resource locales together and optimize the performance of Multilingual Transformer LMs in ASR. Our Locale-group Multilingual Transformer LMs outperform traditional multilingual LMs along with reducing maintenance costs and operating expenses. Further, for low-resource but high-traffic locales where deploying monolingual models is feasible, we show that fine-tuning our locale-group multilingual LMs produces better monolingual LM candidates than baseline monolingual LMs.

TEACH: Temporal Action Composition for 3D Humans

  • Authors: Nikos Athanasiou, Mathis Petrovich, Michael J. Black, Gül Varol
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.04066
  • Pdf link: https://arxiv.org/pdf/2209.04066
  • Abstract Given a series of natural language descriptions, our task is to generate 3D human motions that correspond semantically to the text, and follow the temporal order of the instructions. In particular, our goal is to enable the synthesis of a series of actions, which we refer to as temporal action composition. The current state of the art in text-conditioned motion synthesis only takes a single action or a single sentence as input. This is partially due to lack of suitable training data containing action sequences, but also due to the computational complexity of their non-autoregressive model formulation, which does not scale well to long sequences. In this work, we address both issues. First, we exploit the recent BABEL motion-text collection, which has a wide range of labeled actions, many of which occur in a sequence with transitions between them. Next, we design a Transformer-based approach that operates non-autoregressively within an action, but autoregressively within the sequence of actions. This hierarchical formulation proves effective in our experiments when compared with multiple baselines. Our approach, called TEACH for "TEmporal Action Compositions for Human motions", produces realistic human motions for a wide variety of actions and temporal compositions from language descriptions. To encourage work on this new task, we make our code available for research purposes at $\href{teach.is.tue.mpg.de}{\textrm{our website}}$.

Adapting to Non-Centered Languages for Zero-shot Multilingual Translation

  • Authors: Zhi Qu, Taro Watanabe
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2209.04138
  • Pdf link: https://arxiv.org/pdf/2209.04138
  • Abstract Multilingual neural machine translation can translate unseen language pairs during training, i.e. zero-shot translation. However, the zero-shot translation is always unstable. Although prior works attributed the instability to the domination of central language, e.g. English, we supplement this viewpoint with the strict dependence of non-centered languages. In this work, we propose a simple, lightweight yet effective language-specific modeling method by adapting to non-centered languages and combining the shared information and the language-specific information to counteract the instability of zero-shot translation. Experiments with Transformer on IWSLT17, Europarl, TED talks, and OPUS-100 datasets show that our method not only performs better than strong baselines in centered data conditions but also can easily fit non-centered data conditions. By further investigating the layer attribution, we show that our proposed method can disentangle the coupled representation in the correct direction.

EchoCoTr: Estimation of the Left Ventricular Ejection Fraction from Spatiotemporal Echocardiography

  • Authors: Rand Muhtaseb, Mohammad Yaqub
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.04242
  • Pdf link: https://arxiv.org/pdf/2209.04242
  • Abstract Learning spatiotemporal features is an important task for efficient video understanding especially in medical images such as echocardiograms. Convolutional neural networks (CNNs) and more recent vision transformers (ViTs) are the most commonly used methods with limitations per each. CNNs are good at capturing local context but fail to learn global information across video frames. On the other hand, vision transformers can incorporate global details and long sequences but are computationally expensive and typically require more data to train. In this paper, we propose a method that addresses the limitations we typically face when training on medical video data such as echocardiographic scans. The algorithm we propose (EchoCoTr) utilizes the strength of vision transformers and CNNs to tackle the problem of estimating the left ventricular ejection fraction (LVEF) on ultrasound videos. We demonstrate how the proposed method outperforms state-of-the-art work to-date on the EchoNet-Dynamic dataset with MAE of 3.95 and $R^2$ of 0.82. These results show noticeable improvement compared to all published research. In addition, we show extensive ablations and comparisons with several algorithms, including ViT and BERT. The code is available at https://github.com/BioMedIA-MBZUAI/EchoCoTr.

Pre-training image-language transformers for open-vocabulary tasks

  • Authors: AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.04372
  • Pdf link: https://arxiv.org/pdf/2209.04372
  • Abstract We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.

Improved Masked Image Generation with Token-Critic

  • Authors: José Lezama, Huiwen Chang, Lu Jiang, Irfan Essa
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2209.04439
  • Pdf link: https://arxiv.org/pdf/2209.04439
  • Abstract Non-autoregressive generative transformers recently demonstrated impressive image generation performance, and orders of magnitude faster sampling than their autoregressive counterparts. However, optimal parallel sampling from the true joint distribution of visual tokens remains an open challenge. In this paper we introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregressive generative transformer. Given a masked-and-reconstructed real image, the Token-Critic model is trained to distinguish which visual tokens belong to the original image and which were sampled by the generative transformer. During non-autoregressive iterative sampling, Token-Critic is used to select which tokens to accept and which to reject and resample. Coupled with Token-Critic, a state-of-the-art generative transformer significantly improves its performance, and outperforms recent diffusion models and GANs in terms of the trade-off between generated image quality and diversity, in the challenging class-conditional ImageNet generation.

Keyword: scene understanding

MassMIND: Massachusetts Maritime INfrared Dataset

  • Authors: Shailesh Nirgudkar, Michael DeFilippo, Michael Sacarny, Michael Benjamin, Paul Robinette
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2209.04097
  • Pdf link: https://arxiv.org/pdf/2209.04097
  • Abstract Recent advances in deep learning technology have triggered radical progress in the autonomy of ground vehicles. Marine coastal Autonomous Surface Vehicles (ASVs) that are regularly used for surveillance, monitoring and other routine tasks can benefit from this autonomy. Long haul deep sea transportation activities are additional opportunities. These two use cases present very different terrains -- the first being coastal waters -- with many obstacles, structures and human presence while the latter is mostly devoid of such obstacles. Variations in environmental conditions are common to both terrains. Robust labeled datasets mapping such terrains are crucial in improving the situational awareness that can drive autonomy. However, there are only limited such maritime datasets available and these primarily consist of optical images. Although, Long Wave Infrared (LWIR) is a strong complement to the optical spectrum that helps in extreme light conditions, a labeled public dataset with LWIR images does not currently exist. In this paper, we fill this gap by presenting a labeled dataset of over 2,900 LWIR segmented images captured in coastal maritime environment under diverse conditions. The images are labeled using instance segmentation and classified in seven categories -- sky, water, obstacle, living obstacle, bridge, self and background. We also evaluate this dataset across three deep learning architectures (UNet, PSPNet, DeepLabv3) and provide detailed analysis of its efficacy. While the dataset focuses on the coastal terrain it can equally help deep sea use cases. Such terrain would have less traffic, and the classifier trained on cluttered environment would be able to handle sparse scenes effectively. We share this dataset with the research community with the hope that it spurs new scene understanding capabilities in the maritime environment.

Keyword: visual reasoning

There is no result

DongZhouGu avatar Sep 12 '22 04:09 DongZhouGu