Donghyun Kim issues

Results 102 issues of


                                            Donghyun Kim

[85] When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

[paper](https://arxiv.org/abs/2106.01548) NTK 도 찾아보게 만들고... 수학공부 다시 시작하게 만들어 준 고마운 논문. augmentation 이나 pretraining 없는 조건에서는, SAM 이 VIT 나 MLP-Mixer 에 굉장히 잘 적용되고, resnet 을 이기더라.. 하는...

ICLR22

[84] A Loss Curvature Perspective on Training Instability in Deep Learning

[paper](https://arxiv.org/pdf/2110.04369.pdf) maximum eigenvalue of the loss Hessian == λ_1 이라 놓자. 마찬가지로, k 개의 loss Hessian 이 있다고 할 때, minimum eigenvalue of the loss Hessian == λ_k `λ_1 <...

Google

[88] Training Compute-Optimal Large Language Models (Chinchilla)

현재 LLM 들은 학습이 덜 되었다 !! Gopher 모델을 주된 비교군으로 놓았음. Gopher 에서 4배만큼 parameter 를 줄이고, 4배만큼 training 데이터를 늘렸더니, SOTA 를 찍더라. [paper](https://arxiv.org/pdf/2203.15556.pdf) 아래 그림에서 FLOPs 는...

DeepMind

LLM

[83] Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs (RepLKNet)

[paper](https://arxiv.org/pdf/2203.06717.pdf) [code - MegEngine](https://github.com/megvii-research/RepLKNet) [code - pytorch](https://github.com/DingXiaoH/RepLKNet-pytorch) 개인적으로 좋아하는 MEGVII 의 Rep~ style 연구 31x31 large kernel 을 활용하여 좋은 성능을 이끌어낸다. ERF 결과 ![image](https://user-images.githubusercontent.com/16400591/162907189-c7390e42-a6dd-4a2b-9020-5e5419acc2ed.png) # RepLKNet ## Custom Kernel...

MEGVII

[87] PaLM: Scaling Language Modeling with Pathways

[paper](https://arxiv.org/pdf/2204.02311.pdf) [blog](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html) 6144 TPU v4 chip 을 이용해 540B GPT-like 모델 학습 Pathways system 사용 (#115) paper 에서 주로 볼 내용 1. Efficient Scaling : _Pathways system 을 어떻게 활용하였는가_...

Google

Pretraining

LLM

[86] Pathways: Asynchronous Distributed Dataflow for ML

[paper](https://arxiv.org/abs/2203.12533) TPU들에 대해 어떻게 분산 처리를 할 것인가 파이토치 같은 경우 각 GPU마다 같은 프로그램을 띄워서 필요할 때 각 프로그램이 collective operation을 수행하는 방식. TF v1 같은 경우는 하나의 프로그램에서...

Google

TODO LIST

# prompt Calibrate Before Use: Improving Few-Shot Performance of Language Models (https://arxiv.org/abs/2102.09690) p-tuning (https://arxiv.org/abs/2104.08691) Do Prompt-Based Models Really Understand the Meaning of their Prompts? (https://arxiv.org/abs/2109.01247) An Empirical Study on Few-shot...

[82] Efficient Language Modeling with Sparse all-MLP (sMLP)

`rosinality`'s comment ``` mlp with gating으로 lm 학습하기 moe 기반의 sparse 모델, 더 높은 성능을 더 적은 연산량으로 달성 autoregressive lm을 all mlp로 태클 autoregressive이기 때문에 moe routing에 이후 토큰...

Pretraining

Meta AI

MoE

MLP

[81] Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting

bai 붙은 사람들은 다 OCR을 잘 하는 걸까? (~xiang bai 센세에 이어..~) ![image](https://user-images.githubusercontent.com/16400591/158283259-b4b22e15-3412-4e23-a90c-d60eb84fc676.png) OCR task 를 위한 pretraining strategy 제안. [paper](https://arxiv.org/pdf/2203.03911.pdf) # INTRO 3개의 pipeline 을 그림으로 표현 1. OCR...

OCR

SenseTime

Pretraining

[80] cosFormer: Rethinking Softmax in Attention

아 ㅋㅋ relu 가 빠르다고 ㅋㅋ ![image](https://user-images.githubusercontent.com/16400591/158086402-91a55d2a-e6e5-4aca-92ba-72262093c17d.png) [paper](https://arxiv.org/pdf/2202.08791.pdf) ## self-attention A == self attention function attention output 은 다음과 같이 정의된다. ![image](https://user-images.githubusercontent.com/16400591/158086447-8818426c-6466-4ca7-b5ce-b81fdb63affb.png) 보통 S 는 다음과 같이 정의된다. ![image](https://user-images.githubusercontent.com/16400591/158086734-2cf02e94-6d6b-4231-8960-f664dc6bd55a.png) ![image](https://user-images.githubusercontent.com/16400591/158086935-0fc776da-5920-4691-9a88-d7b7b4b0d12b.png)...

SenseTime

Light Attention

ICLR22