arxiv-daily
arxiv-daily copied to clipboard
New submissions for Mon, 18 Dec 23
Keyword: human object interaction
There is no result
Keyword: visual relation detection
There is no result
Keyword: object detection
Integration of Robotics, Computer Vision, and Algorithm Design: A Chinese Poker Self-Playing Robot
- Authors: Kuan-Huang Yu
- Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
- Arxiv link: https://arxiv.org/abs/2312.09455
- Pdf link: https://arxiv.org/pdf/2312.09455
- Abstract This paper presents Chinese Poker Self-Playing Robot, an integrated system enabling a TM5-900 robotic arm to independently play the four-person card game Chinese poker. The robot uses a custom sucker mechanism to pick up and play cards. An object detection model based on YOLOv5 is utilized to recognize the suit and number of 13 cards dealt to the robot. A greedy algorithm is developed to divide the 13 cards into optimal hands of 3, 5, and 5 cards to play. Experiments demonstrate that the robot can successfully obtain the cards, identify them using computer vision, strategically select hands to play using the algorithm, and physically play the selected cards in the game. The system showcases effective integration of mechanical design, computer vision, algorithm design, and robotic control to accomplish the complex task of independently playing cards.
SlowTrack: Increasing the Latency of Camera-based Perception in Autonomous Driving Using Adversarial Examples
- Authors: Chen Ma, Ningfei Wang, Qi Alfred Chen, Chao Shen
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2312.09520
- Pdf link: https://arxiv.org/pdf/2312.09520
- Abstract In Autonomous Driving (AD), real-time perception is a critical component responsible for detecting surrounding objects to ensure safe driving. While researchers have extensively explored the integrity of AD perception due to its safety and security implications, the aspect of availability (real-time performance) or latency has received limited attention. Existing works on latency-based attack have focused mainly on object detection, i.e., a component in camera-based AD perception, overlooking the entire camera-based AD perception, which hinders them to achieve effective system-level effects, such as vehicle crashes. In this paper, we propose SlowTrack, a novel framework for generating adversarial attacks to increase the execution time of camera-based AD perception. We propose a novel two-stage attack strategy along with the three new loss function designs. Our evaluation is conducted on four popular camera-based AD perception pipelines, and the results demonstrate that SlowTrack significantly outperforms existing latency-based attacks while maintaining comparable imperceptibility levels. Furthermore, we perform the evaluation on Baidu Apollo, an industry-grade full-stack AD system, and LGSVL, a production-grade AD simulator, with two scenarios to compare the system-level effects of SlowTrack and existing attacks. Our evaluation results show that the system-level effects can be significantly improved, i.e., the vehicle crash rate of SlowTrack is around 95% on average while existing works only have around 30%.
Semantic-Aware Transformation-Invariant RoI Align
- Authors: Guo-Ye Yang, George Kiyohiro Nakayama, Zi-Kai Xiao, Tai-Jiang Mu, Xiaolei Huang, Shi-Min Hu
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.09609
- Pdf link: https://arxiv.org/pdf/2312.09609
- Abstract Great progress has been made in learning-based object detection methods in the last decade. Two-stage detectors often have higher detection accuracy than one-stage detectors, due to the use of region of interest (RoI) feature extractors which extract transformation-invariant RoI features for different RoI proposals, making refinement of bounding boxes and prediction of object categories more robust and accurate. However, previous RoI feature extractors can only extract invariant features under limited transformations. In this paper, we propose a novel RoI feature extractor, termed Semantic RoI Align (SRA), which is capable of extracting invariant RoI features under a variety of transformations for two-stage detectors. Specifically, we propose a semantic attention module to adaptively determine different sampling areas by leveraging the global and local semantic relationship within the RoI. We also propose a Dynamic Feature Sampler which dynamically samples features based on the RoI aspect ratio to enhance the efficiency of SRA, and a new position embedding, \ie Area Embedding, to provide more accurate position information for SRA through an improved sampling area representation. Experiments show that our model significantly outperforms baseline models with slight computational overhead. In addition, it shows excellent generalization ability and can be used to improve performance with various state-of-the-art backbones and detection methods.
3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V
- Authors: Dingning Liu, Xiaomeng Dong, Renrui Zhang, Xu Luo, Peng Gao, Xiaoshui Huang, Yongshun Gong, Zhihui Wang
- Subjects: Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2312.09738
- Pdf link: https://arxiv.org/pdf/2312.09738
- Abstract In this work, we present a new visual prompting method called 3DAxiesPrompts (3DAP) to unleash the capabilities of GPT-4V in performing 3D spatial tasks. Our investigation reveals that while GPT-4V exhibits proficiency in discerning the position and interrelations of 2D entities through current visual prompting techniques, its abilities in handling 3D spatial tasks have yet to be explored. In our approach, we create a 3D coordinate system tailored to 3D imagery, complete with annotated scale information. By presenting images infused with the 3DAP visual prompt as inputs, we empower GPT-4V to ascertain the spatial positioning information of the given 3D target image with a high degree of precision. Through experiments, We identified three tasks that could be stably completed using the 3DAP method, namely, 2D to 3D Point Reconstruction, 2D to 3D point matching, and 3D Object Detection. We perform experiments on our proposed dataset 3DAP-Data, the results from these experiments validate the efficacy of 3DAP-enhanced GPT-4V inputs, marking a significant stride in 3D spatial task execution.
End-to-End Training of Neural Networks for Automotive Radar Interference Mitigation
- Authors: Christian Oswald, Mate Toth, Paul Meissner, Franz Pernkopf
- Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
- Arxiv link: https://arxiv.org/abs/2312.09790
- Pdf link: https://arxiv.org/pdf/2312.09790
- Abstract In this paper we propose a new method for training neural networks (NNs) for frequency modulated continuous wave (FMCW) radar mutual interference mitigation. Instead of training NNs to regress from interfered to clean radar signals as in previous work, we train NNs directly on object detection maps. We do so by performing a continuous relaxation of the cell-averaging constant false alarm rate (CA-CFAR) peak detector, which is a well-established algorithm for object detection using radar. With this new training objective we are able to increase object detection performance by a large margin. Furthermore, we introduce separable convolution kernels to strongly reduce the number of parameters and computational complexity of convolutional NN architectures for radar applications. We validate our contributions with experiments on real-world measurement data and compare them against signal processing interference mitigation methods.
SeeThruFinger: See and Grasp Anything with a Soft Touch
- Authors: Fang Wan, Chaoyang Song
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2312.09822
- Pdf link: https://arxiv.org/pdf/2312.09822
- Abstract We present SeeThruFinger, a soft robotic finger with an in-finger vision for multi-modal perception, including visual perception and tactile sensing, for geometrically adaptive and real-time reactive grasping. Multi-modal perception of intrinsic and extrinsic interactions is critical in building intelligent robots that learn. Instead of adding various sensors for different modalities, a preferred solution is to integrate them into one elegant and coherent design, which is a challenging task. This study leverages the Soft Polyhedral Network design as a robotic finger, capable of omni-directional adaptation with an unobstructed view of the finger's spatial deformation from the inside. By embedding a miniature camera underneath, we achieve the visual perception of the external environment by inpainting the finger mask using E2FGV, which can be used for object detection in the downstream tasks for grasping. After contacting the objects, we use real-time object segmentation algorithms, such as XMem, to track the soft finger's spatial deformations. We also learned a Supervised Variational Autoencoder to enable tactile sensing of 6D forces and torques for reactive grasp. As a result, we achieved multi-modal perception, including visual perception and tactile sensing, and soft, adaptive object grasping within a single vision-based soft finger design compatible with multi-fingered robotic grippers.
Keyword: transformer
Acoustic models of Brazilian Portuguese Speech based on Neural Transformers
- Authors: Marcelo Matheus Gauy, Marcelo Finger
- Subjects: Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
- Arxiv link: https://arxiv.org/abs/2312.09265
- Pdf link: https://arxiv.org/pdf/2312.09265
- Abstract An acoustic model, trained on a significant amount of unlabeled data, consists of a self-supervised learned speech representation useful for solving downstream tasks, perhaps after a fine-tuning of the model in the respective downstream task. In this work, we build an acoustic model of Brazilian Portuguese Speech through a Transformer neural network. This model was pretrained on more than $800$ hours of Brazilian Portuguese Speech, using a combination of pretraining techniques. Using a labeled dataset collected for the detection of respiratory insufficiency in Brazilian Portuguese speakers, we fine-tune the pretrained Transformer neural network on the following tasks: respiratory insufficiency detection, gender recognition and age group classification. We compare the performance of pretrained Transformers on these tasks with that of Transformers without previous pretraining, noting a significant improvement. In particular, the performance of respiratory insufficiency detection obtains the best reported results so far, indicating this kind of acoustic model as a promising tool for speech-as-biomarker approach. Moreover, the performance of gender recognition is comparable to the state of the art models in English.
Weight subcloning: direct initialization of transformers using larger pretrained ones
- Authors: Mohammad Samragh, Mehrdad Farajtabar, Sachin Mehta, Raviteja Vemulapalli, Fartash Faghri, Devang Naik, Oncel Tuzel, Mohammad Rastegari
- Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.09299
- Pdf link: https://arxiv.org/pdf/2312.09299
- Abstract Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained model of the same size and specification to increase the convergence and training speed. However, what if no pretrained model of the required size is available? In this paper, we introduce a simple yet effective technique to transfer the knowledge of a pretrained model to smaller variants. Our approach called weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. Weight subcloning involves an operation on the pretrained model to obtain the equivalent initialized scaled-down model. It consists of two key steps: first, we introduce neuron importance ranking to decrease the embedding dimension per layer in the pretrained model. Then, we remove blocks from the transformer model to match the number of layers in the scaled-down network. The result is a network ready to undergo training, which gains significant improvements in training speed compared to random initialization. For instance, we achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
LLM-MARS: Large Language Model for Behavior Tree Generation and NLP-enhanced Dialogue in Multi-Agent Robot Systems
- Authors: Artem Lykov, Maria Dronova, Nikolay Naglov, Mikhail Litvinov, Sergei Satsevich, Artem Bazhenov, Vladimir Berman, Aleksei Shcherbak, Dzmitry Tsetserukou
- Subjects: Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2312.09348
- Pdf link: https://arxiv.org/pdf/2312.09348
- Abstract This paper introduces LLM-MARS, first technology that utilizes a Large Language Model based Artificial Intelligence for Multi-Agent Robot Systems. LLM-MARS enables dynamic dialogues between humans and robots, allowing the latter to generate behavior based on operator commands and provide informative answers to questions about their actions. LLM-MARS is built on a transformer-based Large Language Model, fine-tuned from the Falcon 7B model. We employ a multimodal approach using LoRa adapters for different tasks. The first LoRa adapter was developed by fine-tuning the base model on examples of Behavior Trees and their corresponding commands. The second LoRa adapter was developed by fine-tuning on question-answering examples. Practical trials on a multi-agent system of two robots within the Eurobot 2023 game rules demonstrate promising results. The robots achieve an average task execution accuracy of 79.28% in compound commands. With commands containing up to two tasks accuracy exceeded 90%. Evaluation confirms the system's answers on operators questions exhibit high accuracy, relevance, and informativeness. LLM-MARS and similar multi-agent robotic systems hold significant potential to revolutionize logistics, enabling autonomous exploration missions and advancing Industry 5.0.
Quantifying the Uncertainty of Sensitivity Coefficients Computed from Uncertain Compound Admittance Matrix and Noisy Grid Measurements
- Authors: Rahul Gupta
- Subjects: Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2312.09351
- Pdf link: https://arxiv.org/pdf/2312.09351
- Abstract The power-flow sensitivity coefficients (PFSCs) are widely used in the power system for expressing linearized dependencies between the controlled (i.e., the nodal voltages, lines currents) and control variables (e.g., active and reactive power injections, transformer tap positions, etc.). The PFSCs are often computed by knowing the compound admittance matrix of a given network and the grid states. However, when the branch parameters (or admittance matrix) are inaccurate or known with limited accuracy, the computed PFSCs can not be relied upon. Uncertain PFSCs, when used in control, can lead to infeasible control set-points. In this context, this paper presents a method to quantify the uncertainty of the PFSCs from uncertain branch parameters and noisy grid-state measurements that can be used for formulating safe control schemes. We derive an analytical expression using the error-propagation principle. The developed tool is numerically validated using Monte Carlo simulations.
MANTIS at #SMM4H 2023: Leveraging Hybrid and Ensemble Models for Detection of Social Anxiety Disorder on Reddit
- Authors: Sourabh Zanwar, Daniel Wiechmann, Yu Qiao, Elma Kerz
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2312.09451
- Pdf link: https://arxiv.org/pdf/2312.09451
- Abstract This paper presents our system employed for the Social Media Mining for Health 2023 Shared Task 4: Binary classification of English Reddit posts self-reporting a social anxiety disorder diagnosis. We systematically investigate and contrast the efficacy of hybrid and ensemble models that harness specialized medical domain-adapted transformers in conjunction with BiLSTM neural networks. The evaluation results outline that our best performing model obtained 89.31% F1 on the validation set and 83.76% F1 on the test set.
Picking the Underused Heads: A Network Pruning Perspective of Attention Head Selection for Fusing Dialogue Coreference Information
- Authors: Zhengyuan Liu, Nancy F. Chen
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2312.09541
- Pdf link: https://arxiv.org/pdf/2312.09541
- Abstract The Transformer-based models with the multi-head self-attention mechanism are widely used in natural language processing, and provide state-of-the-art results. While the pre-trained language backbones are shown to implicitly capture certain linguistic knowledge, explicitly incorporating structure-aware features can bring about further improvement on the downstream tasks. However, such enhancement often requires additional neural components and increases training parameter size. In this work, we investigate the attention head selection and manipulation strategy for feature injection from a network pruning perspective, and conduct a case study on dialogue summarization. We first rank attention heads in a Transformer-based summarizer with layer-wise importance. We then select the underused heads through extensive analysis, and inject structure-aware features by manipulating the selected heads. Experimental results show that the importance-based head selection is effective for feature injection, and dialogue summarization can be improved by incorporating coreference information via head manipulation.
Extending Context Window of Large Language Models via Semantic Compression
- Authors: Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, Wei Han
- Subjects: Computation and Language (cs.CL); Information Theory (cs.IT)
- Arxiv link: https://arxiv.org/abs/2312.09571
- Pdf link: https://arxiv.org/pdf/2312.09571
- Abstract Transformer-based Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses. This constraint restricts their applicability in scenarios involving long texts. We propose a novel semantic compression method that enables generalization to texts that are 6-8 times longer, without incurring significant computational costs or requiring fine-tuning. Our proposed framework draws inspiration from source coding in information theory and employs a pre-trained model to reduce the semantic redundancy of long inputs before passing them to the LLMs for downstream tasks. Experimental results demonstrate that our method effectively extends the context window of LLMs across a range of tasks including question answering, summarization, few-shot learning, and information retrieval. Furthermore, the proposed semantic compression method exhibits consistent fluency in text generation while reducing the associated computational overhead.
Multiscale Vision Transformer With Deep Clustering-Guided Refinement for Weakly Supervised Object Localization
- Authors: David Kim, Sinhae Cha, Byeongkeun Kang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2312.09584
- Pdf link: https://arxiv.org/pdf/2312.09584
- Abstract This work addresses the task of weakly-supervised object localization. The goal is to learn object localization using only image-level class labels, which are much easier to obtain compared to bounding box annotations. This task is important because it reduces the need for labor-intensive ground-truth annotations. However, methods for object localization trained using weak supervision often suffer from limited accuracy in localization. To address this challenge and enhance localization accuracy, we propose a multiscale object localization transformer (MOLT). It comprises multiple object localization transformers that extract patch embeddings across various scales. Moreover, we introduce a deep clustering-guided refinement method that further enhances localization accuracy by utilizing separately extracted image segments. These segments are obtained by clustering pixels using convolutional neural networks. Finally, we demonstrate the effectiveness of our proposed method by conducting experiments on the publicly available ILSVRC-2012 dataset.
TOP-ReID: Multi-spectral Object Re-Identification with Token Permutation
- Authors: Yuhao Wang, Xuehu Liu, Pingping Zhang, Hu Lu, Zhengzheng Tu, Huchuan Lu
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
- Arxiv link: https://arxiv.org/abs/2312.09612
- Pdf link: https://arxiv.org/pdf/2312.09612
- Abstract Multi-spectral object Re-identification (ReID) aims to retrieve specific objects by leveraging complementary information from different image spectra. It delivers great advantages over traditional single-spectral ReID in complex visual environment. However, the significant distribution gap among different image spectra poses great challenges for effective multi-spectral feature representations. In addition, most of current Transformer-based ReID methods only utilize the global feature of class tokens to achieve the holistic retrieval, ignoring the local discriminative ones. To address the above issues, we step further to utilize all the tokens of Transformers and propose a cyclic token permutation framework for multi-spectral object ReID, dubbled TOP-ReID. More specifically, we first deploy a multi-stream deep network based on vision Transformers to preserve distinct information from different image spectra. Then, we propose a Token Permutation Module (TPM) for cyclic multi-spectral feature aggregation. It not only facilitates the spatial feature alignment across different image spectra, but also allows the class token of each spectrum to perceive the local details of other spectra. Meanwhile, we propose a Complementary Reconstruction Module (CRM), which introduces dense token-level reconstruction constraints to reduce the distribution gap across different image spectra. With the above modules, our proposed framework can generate more discriminative multi-spectral features for robust object ReID. Extensive experiments on three ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) verify the effectiveness of our methods. The code is available at https://github.com/924973292/TOP-ReID.
Algorithms for automatic intents extraction and utterances classification for goal-oriented dialogue systems
- Authors: Leonid Legashev, Alexander Shukhman, Arthur Zhigalov
- Subjects: Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2312.09658
- Pdf link: https://arxiv.org/pdf/2312.09658
- Abstract Modern machine learning techniques in the natural language processing domain can be used to automatically generate scripts for goal-oriented dialogue systems. The current article presents a general framework for studying the automatic generation of scripts for goal-oriented dialogue systems. A method for preprocessing dialog data sets in JSON format is described. A comparison is made of two methods for extracting user intent based on BERTopic and latent Dirichlet allocation. A comparison has been made of two implemented algorithms for classifying statements of users of a goal-oriented dialogue system based on logistic regression and BERT transformer models. The BERT transformer approach using the bert-base-uncased model showed better results for the three metrics Precision (0.80), F1-score (0.78) and Matthews correlation coefficient (0.74) in comparison with other methods.
Part Representation Learning with Teacher-Student Decoder for Occluded Person Re-identification
- Authors: Shang Gao, Chenyang Yu, Pingping Zhang, Huchuan Lu
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
- Arxiv link: https://arxiv.org/abs/2312.09797
- Pdf link: https://arxiv.org/pdf/2312.09797
- Abstract Occluded person re-identification (ReID) is a very challenging task due to the occlusion disturbance and incomplete target information. Leveraging external cues such as human pose or parsing to locate and align part features has been proven to be very effective in occluded person ReID. Meanwhile, recent Transformer structures have a strong ability of long-range modeling. Considering the above facts, we propose a Teacher-Student Decoder (TSD) framework for occluded person ReID, which utilizes the Transformer decoder with the help of human parsing. More specifically, our proposed TSD consists of a Parsing-aware Teacher Decoder (PTD) and a Standard Student Decoder (SSD). PTD employs human parsing cues to restrict Transformer's attention and imparts this information to SSD through feature distillation. Thereby, SSD can learn from PTD to aggregate information of body parts automatically. Moreover, a mask generator is designed to provide discriminative regions for better ReID. In addition, existing occluded person ReID benchmarks utilize occluded samples as queries, which will amplify the role of alleviating occlusion interference and underestimate the impact of the feature absence issue. Contrastively, we propose a new benchmark with non-occluded queries, serving as a complement to the existing benchmark. Extensive experiments demonstrate that our proposed method is superior and the new benchmark is essential. The source codes are available at https://github.com/hh23333/TSD.
Grammatical information in BERT sentence embeddings as two-dimensional arrays
- Authors: Vivi Nastase, Paola Merlo
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2312.09890
- Pdf link: https://arxiv.org/pdf/2312.09890
- Abstract Sentence embeddings induced with various transformer architectures encode much semantic and syntactic information in a distributed manner in a one-dimensional array. We investigate whether specific grammatical information can be accessed in these distributed representations. Using data from a task developed to test rule-like generalizations, our experiments on detecting subject-verb agreement yield several promising results. First, we show that while the usual sentence representations encoded as one-dimensional arrays do not easily support extraction of rule-like regularities, a two-dimensional reshaping of these vectors allows various learning architectures to access such information. Next, we show that various architectures can detect patterns in these two-dimensional reshaped sentence embeddings and successfully learn a model based on smaller amounts of simpler training data, which performs well on more complex test data. This indicates that current sentence embeddings contain information that is regularly distributed, and which can be captured when the embeddings are reshaped into higher dimensional arrays. Our results cast light on representations produced by language models and help move towards developing few-shot learning approaches.
Exploring Automatic Text Simplification of German Narrative Documents
- Authors: Thorben Schomacker, Tillmann Dönicke, Marina Tropmann-Frick
- Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2312.09907
- Pdf link: https://arxiv.org/pdf/2312.09907
- Abstract In this paper, we apply transformer-based Natural Language Generation (NLG) techniques to the problem of text simplification. Currently, there are only a few German datasets available for text simplification, even fewer with larger and aligned documents, and not a single one with narrative texts. In this paper, we explore to which degree modern NLG techniques can be applied to German narrative text simplifications. We use Longformer attention and a pre-trained mBART model. Our findings indicate that the existing approaches for German are not able to solve the task properly. We conclude on a few directions for future research to address this problem.
DHFormer: A Vision Transformer-Based Attention Module for Image Dehazing
- Authors: Abdul Wasi, O. Jeba Shiney
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
- Arxiv link: https://arxiv.org/abs/2312.09955
- Pdf link: https://arxiv.org/pdf/2312.09955
- Abstract Images acquired in hazy conditions have degradations induced in them. Dehazing such images is a vexed and ill-posed problem. Scores of prior-based and learning-based approaches have been proposed to mitigate the effect of haze and generate haze-free images. Many conventional methods are constrained by their lack of awareness regarding scene depth and their incapacity to capture long-range dependencies. In this paper, a method that uses residual learning and vision transformers in an attention module is proposed. It essentially comprises two networks: In the first one, the network takes the ratio of a hazy image and the approximated transmission matrix to estimate a residual map. The second network takes this residual image as input and passes it through convolution layers before superposing it on the generated feature maps. It is then passed through global context and depth-aware transformer encoders to obtain channel attention. The attention module then infers the spatial attention map before generating the final haze-free image. Experimental results, including several quantitative metrics, demonstrate the efficiency and scalability of the suggested methodology.
LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language
- Authors: Pierpaolo Basile, Elio Musacchio, Marco Polignano, Lucia Siciliani, Giuseppe Fiameni, Giovanni Semeraro
- Subjects: Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2312.09993
- Pdf link: https://arxiv.org/pdf/2312.09993
- Abstract Large Language Models represent state-of-the-art linguistic models designed to equip computers with the ability to comprehend natural language. With its exceptional capacity to capture complex contextual relationships, the LLaMA (Large Language Model Meta AI) family represents a novel advancement in the field of natural language processing by releasing foundational models designed to improve the natural language understanding abilities of the transformer architecture thanks to their large amount of trainable parameters (7, 13, and 70 billion parameters). In many natural language understanding tasks, these models obtain the same performances as private company models such as OpenAI Chat-GPT with the advantage to make publicly available weights and code for research and commercial uses. In this work, we investigate the possibility of Language Adaptation for LLaMA models, explicitly focusing on addressing the challenge of Italian Language coverage. Adopting an open science approach, we explore various tuning approaches to ensure a high-quality text generated in Italian suitable for common tasks in this underrepresented language in the original models' datasets. We aim to release effective text generation models with strong linguistic properties for many tasks that seem challenging using multilingual or general-purpose LLMs. By leveraging an open science philosophy, this study contributes to Language Adaptation strategies for the Italian language by introducing the novel LLaMAntino family of Italian LLMs.
Accelerating Neural Network Training: A Brief Review
- Authors: Sahil Nokhwal, Priyanka Chilakalapudi, Preeti Donekal, Manoj Chandrasekharan, Suman Nokhwal, Ram Swaroop, Raj Bala, Saurabh Pahune, Ankit Chaudhary
- Subjects: Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.10024
- Pdf link: https://arxiv.org/pdf/2312.10024
- Abstract The process of training a deep neural network is characterized by significant time requirements and associated costs. Although researchers have made considerable progress in this area, further work is still required due to resource constraints. This study examines innovative approaches to expedite the training process of deep neural networks (DNN), with specific emphasis on three state-of-the-art models such as ResNet50, Vision Transformer (ViT), and EfficientNet. The research utilizes sophisticated methodologies, including Gradient Accumulation (GA), Automatic Mixed Precision (AMP), and Pin Memory (PM), in order to optimize performance and accelerate the training procedure. The study examines the effects of these methodologies on the DNN models discussed earlier, assessing their efficacy with regard to training rate and computational efficacy. The study showcases the efficacy of including GA as a strategic approach, resulting in a noteworthy decrease in the duration required for training. This enables the models to converge at a faster pace. The utilization of AMP enhances the speed of computations by taking advantage of the advantages offered by lower precision arithmetic while maintaining the correctness of the model. Furthermore, this study investigates the application of Pin Memory as a strategy to enhance the efficiency of data transmission between the central processing unit and the graphics processing unit, thereby offering a promising opportunity for enhancing overall performance. The experimental findings demonstrate that the combination of these sophisticated methodologies significantly accelerates the training of DNNs, offering vital insights for experts seeking to improve the effectiveness of deep learning processes.
Point Transformer V3: Simpler, Faster, Stronger
- Authors: Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, Hengshuang Zhao
- Subjects: Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2312.10035
- Pdf link: https://arxiv.org/pdf/2312.10035
- Abstract This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.
Keyword: scene understanding
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment
- Authors: Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang
- Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2312.09625
- Pdf link: https://arxiv.org/pdf/2312.09625
- Abstract Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose \textbf{3D-VLA}, a weakly supervised approach for \textbf{3D} visual grounding based on \textbf{V}isual \textbf{L}inguistic \textbf{A}lignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.
Keyword: visual reasoning
One Self-Configurable Model to Solve Many Abstract Visual Reasoning Problems
- Authors: Mikołaj Małkiński, Jacek Mańdziuk
- Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2312.09997
- Pdf link: https://arxiv.org/pdf/2312.09997
- Abstract Abstract Visual Reasoning (AVR) comprises a wide selection of various problems similar to those used in human IQ tests. Recent years have brought dynamic progress in solving particular AVR tasks, however, in the contemporary literature AVR problems are largely dealt with in isolation, leading to highly specialized task-specific methods. With the aim of developing universal learning systems in the AVR domain, we propose the unified model for solving Single-Choice Abstract visual Reasoning tasks (SCAR), capable of solving various single-choice AVR tasks, without making any a priori assumptions about the task structure, in particular the number and location of panels. The proposed model relies on a novel Structure-Aware dynamic Layer (SAL), which adapts its weights to the structure of the considered AVR problem. Experiments conducted on Raven's Progressive Matrices, Visual Analogy Problems, and Odd One Out problems show that SCAR (SAL-based models, in general) effectively solves diverse AVR tasks, and its performance is on par with the state-of-the-art task-specific baselines. What is more, SCAR demonstrates effective knowledge reuse in multi-task and transfer learning settings. To our knowledge, this work is the first successful attempt to construct a general single-choice AVR solver relying on self-configurable architecture and unified solving method. With this work we aim to stimulate and foster progress on task-independent research paths in the AVR domain, with the long-term goal of development of a general AVR solver.