Papers-books-and-blogs
Papers-books-and-blogs copied to clipboard
This repository contains the research papers, white papers, thesis etc that I love.
This repository contains a list of the books, blogs, research papers and white papers that I have read and found interesting.
Table of contents
- AI, DL, NLP and RL
- Calculus
- Computer Architecture
- Computer Graphics
- Data Structures and Algorithms
- Digital Electronics
- Graph Theory
- Information Theory
- Linear Algebra
- Measure Theory
- Optimization Theory
- Probability and Stochastic Processes
- Quantum Computing
- Signal Processing
AI, DL, NLP and RL
-
1-bit Adam: communication efficient large-scale training with Adam’s convergence speed
-
5 best practices for efficient model training
-
8-bit approximations for parallelism in deep learning
-
8-bit optimizers via block-wise quantization
-
A 'neural' network that learns to play Backgammon
-
A BetterTransformer for fast transformer inference
-
A deep reinforced model for abstractive summarization
-
A dynamical approach to temporal pattern processing
-
A few more examples may be worth billions of parameters
-
A general and adaptive robust loss function
-
A generalist agent
-
A gentle introduction to 8-bit matrix multiplication for transformers at scale using Hugging Face transformers, accelerate and bitsandbytes
-
A note on the evaluation of generative models
-
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
-
A simple but tough-to-beat baseline for sentence embeddings
-
A simple language model for task-oriented dialogue
-
A simple neural attentive meta-learner
-
A simple neural network module for relational reasoning
-
A study of BFLOAT16 for deep learning training
-
A style-based generator architecture for generative adversarial networks
-
A stylometric inquiry into hyperpartisan and fake news
-
A3T: adversarially augmented adversarial training
-
Accelerated PyTorch 2 transformers
-
Accelerating large language model training with variable sparse pre-training and dense fine-tuning
-
Accelerating PyTorch with CUDA graphs
-
AdapterHub: a framework for adapting transformers
-
Adversarial approximate inference for speech to electroglottograph conversion
-
Adversarial autoencoders
-
Adversarial examples that fool both computer vision and time-limited humans
-
Adversarial feature learning
-
Adversarial generation of natural language
-
Adversarial information factorization
-
Adversarially learned inference
-
AlexaTM 20B: few-shot learning using a large-scale multilingual seq2seq model
-
Amazon SageMaker model parallelism: a general and flexible framework for large model training
-
An image is worth 16x16 words: transformers for image recognition at scale
-
An overview of gradient descent optimization algorithms
-
Analysing mathematical reasoning abilities of neural models
-
Approximation by superpositions of sigmoidal function
- Artificial Intelligence: a modern approach
-
Aspect based sentiment analysis with gated convolutional networks
-
Attention is all you need
-
Attention is off by one
-
Auto-encoding variational Bayes
-
Backpropagation through the void: optimizing control variates for black-box gradient estimation
-
BART: denoising sequence-to-sequence pre-training for natural language generation, translation and comprehension
-
Batch normalization: accelerating deep network training by reducing internal covariate shift
-
Behavioral cloning from observation
-
BERT: pre-training of deep bidirectional transformers for language understanding
-
Better & faster large language models via multi-token prediction
-
Beyond domain APIs: Task-oriented conversational modeling with unstructured knowledge access
-
BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation
-
Blockwise parallel transformer for large context models
-
BLOOM: A 176B-parameter open-access multilingual language model
-
Bootstrapping entity alignment with knowledge graph embedding
-
Bridging the gap between prior and posterior knowledge selection for knowledge-grounded dialogue generation
-
Bringing open large language models to consumer devices
-
BTLM-3B-8K: 7B performance in a 3 billion parameter model
-
Building blocks for a complex-valued transformer architecture
-
CATS: contextually-aware thresholding for sparsity in large language models
-
ChatGPT: optimizing language models for dialogue
-
ColBERT: efficient and effective passage search via contextualized late interaction over BERT
-
Colossal-AI: a unified deep learning system for large-scale parallel training
-
Compiling machine learning programs via high-level tracing
-
Complex transformer: a framework for modeling complex-valued sequence
-
Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning
-
Conditional image synthesis with auxilliary classifier GANs
-
Conformal nucleus sampling
-
Connecting large language models with evolutionary algorithms yields powerful prompt optimizers
-
Connectivity versus entropy
-
Constituency parsing with a self-attentive encoder
-
Constraint based knowledge base distillation in end-to-end task oriented dialogs
-
Context generation improves open domain question answering
-
Convert transformers to ONNX with hugging face optimum
-
Convolutional networks for graphs for learning molecular fingerprints
-
Convolutional neural network language models
-
Countering adversarial images using input transformations
-
Cramming: training a language model on a single GPU in one day
-
Crosslingual generalization through multitask finetuning
-
Curriculum learning
-
Cutting down on prompts and parameters: simple few-shot learning with language models
-
Data engineering for scaling language models to 128K context
-
Deep Boltzmann machines
-
Deep complex networks
- Deep learning
-
Deep learning and the information bottleneck principle
-
Deep learning techniques for super-resolution in video games
-
Deep residual learning for image recognition
-
Deep text classification can be fooled
-
DeepSpeed compression: a composable library for extreme compression and zero-cost quantization
-
DeepSpeed Inference: enabling efficient inference of transformer models at unprecedented scale
-
DeepSpeed powers 8x larger MoE model training with high performance
-
DeepSpeed Ulysses: system optimizations for enabling training of extreme long sequence transformer models
-
DeepSpeed: accelerating large-scale model inference and training via system optimizations and compression
-
DeepSpeed: advancing MoE inference and training to power next-generation AI scale
-
Denoising distantly supervised open-domain question answering
-
Diffusion convolutional recurrent neural network: data-driven traffic forecasting
-
Discrete variational autoencoders
-
Disentangling by factorising
-
Disentangling language and knowledge in task-oriented dialogs
-
Distributionally robust language modeling
-
Editing models with task arithmetic
-
Efficient estimation of word representations in vector space
-
Efficient large scale language modeling with mixtures of experts
-
Efficient large-scale language model training on GPU clusters using Megatron-LM
-
Enchancing the reliability of out-of-distribution image detection in neural networks
-
End-to-end task-oriented dialog modeling with semi-structured knowledge management
-
Enhance reasoning for large language models in the game Werewolf
-
Ensemble adversarial training: attacks and defenses
-
Equilibrium propagation: bridging the gap between energy-based models and backpropagation
-
Estimating or propagating gradients through stochastic neurons for conditional computation
-
Exemplar encoder-decoder for neural conversation generation
-
Expert human-level driving in gran turismo sport using deep reinforcement learning with image-based representation
-
Exploring deep recurrent models with reinforcement learning for molecule design
-
Exploring the limits of transfer learning with a unified text-to-text transformer
-
Extreme compression for pre-trained transformers made simple and efficient
-
Fast abstractive summarization with reinforce-selected sentence rewriting
-
Fast benchmarking of accuracy vs. training time with cyclic learning rates
-
Fast transformer decoding: one write-head is all you need
-
Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning
-
FFJORD: Free-form continuous dynamics for scalable reversible generative models
-
Finetuned language models are zero-shot learners
-
Flash-decoding for long-context inference
-
FlashAttention: fast and memory-efficient exact attention with IO-awareness
-
FlashAttention: fast transformer training with long sequences
-
Foundations of NLP explained visually: beam search, how it works
-
FP8 formats for deep learning
-
FP8-LM: training FP8 large language models
-
Gemini: a family of highly capable multimodal models
-
Gemma: open models based on Gemini research and technology
-
Generating adversarial examples with adversarial networks
-
Generating sentences from a continuous space
-
Generation-augmented retrieval for open-domain question answering
-
Generative adversarial nets
-
Generative pretraining from pixels
- Genetic algorithms in search, optimization and machine learning
-
GeoMAN: multi-level attention networks for geo-sensory time series prediction
-
Getting the most out of the NVIDIA A100 GPU with Multi-Instance GPU
-
GLaM: efficient scaling of language models with mixture-of-experts
-
GLM-130B: an open bilingual pre-trained model
-
GLU variants improve transformer
-
Going deeper with convolutions
-
GPT-4 architecture, infrastructure, training dataset, costs, vision, MoE
-
GPT-NeoX-20B: an open-source autoregressive language model
-
GQA: training generalized multi-query transformer models from multi-head checkpoints
-
Gradient-based hyperparameter optimization through reversible learning
-
Graph attention networks
-
Grounding large language models in interactive environments with online reinforcement learning
-
Hierarchical neural story generation
-
Hindsight: posterior-guided training of retrievers for improved open-ended generation
-
HiPPO: recurrent memory with optimal polynomial projections
-
HotFlip: white-box adversarial examples for text classification
-
How big should my language model be?
-
How Pytorch 2.0 accelerates deep learning with operator fusion and CPU/GPU code-generation
-
How should AI systems behave, and who should decide?
-
How we sped up transformer inference 100x for 🤗 API customers
-
How 🤗 Accelerate runs very large models thanks to PyTorch
-
Hydragen: high-throughput LLM inference with shared prefixes
-
HyKnow: end-to-end task-oriented dialog modeling with hybrid knowledge management
-
Hyperparameter search with Transformers and Ray Tune
-
Image-to-image translation with conditional generative adversarial networks
-
ImageNet classification using deep convolutional neural networks
-
Improving entity linking by modeling latent relations between mentions
-
Improving language models by retrieving from trillions of tokens
-
Improving language understanding by generative pre-training
-
Improving reinforcement learning from human feedback with efficient reward model ensemble
-
Incredibly fast BLOOM inference with DeepSpeed and Accelerate
-
Inference suboptimality in variational autoencoders
-
InfoGAN: interpretable representation learning by information maximizing generative adversarial nets
-
Interpretable convolutional neural networks via feedforward design
-
Introducing MPT-7B: a new standard for open-source, commercially usable LLMs
-
Introducing nvFuser, a deep learning compiler for PyTorch
-
Introducing Turing image super resolution: AI powered image enhancements for Microsoft Edge and Bing maps
-
Introducing 🤗 accelerate
-
Is ChatGPT 175 billion parameters? Technical analysis
-
Is the future of neural networks Sparse? An introduction (1/N)
-
Jack of all trades, master of some, a multi-purpose transformer agent
-
Jack of all trades, master of some, a multi-purpose transformer agent
-
Joint reasoning on hybrid-knowledge sources for task-oriented dialog
-
Judging LLM-as-a-judge with MT-bench and chatbot arena
-
Know what you don't know: unanswerable questions for SQuAD
-
Knowledge-grounded dialogue generation with pre-trained language models
-
Language is not all you need: aligning perception with language models
-
Language modeling with gated convolutional networks
-
Language modelling with pixels
-
Language models (mostly) know what they know
-
Language models are unsupervised multitask learners
-
Language models as compilers: simulating pseudocode execution improves algorithmic reasoning in language models
-
Large language models are not fair evaluators
-
Layer normalization
-
Layer-condensed KV cache for efficient inference of large language models
-
Learning activation functions to improve deep neural networks
-
Learning associative inference using fast weight memory
-
Learning discourse-level diversity for neural dialog models using conditional variational autoencoders
-
Learning on a general network
-
Learning representations by back-propagating errors
-
Learning transferable visual models from natural language supervision
-
Learning transferable visual models from natural language supervision
-
Learning word embeddings efficiently with noise-contrastive estimation
-
Leave no context behind: efficient infinite context transformers with infini-attention
-
Lessons learned on language model safety and misuse
-
Lifelong language pretraining with distribution-specialized experts
-
Linear scaling made possible with weight streaming
-
Linformer: self-attention with linear complexity
-
LLM in a flash: efficient large language model inference with limited memory
-
LLM.int8(): 8-bit matrix multiplication for transformers at scale
-
Long sequence modeling with XGen: a 7B LLM trained on 8K input sequence length
-
LoRA: Low-Rank Adaptation of large language models
-
Lost in the middle: how language models use long contexts
-
M6-10T: a sharing-delinking paradigm for efficient multi-trillion parameter pretraining
- Machine learning
- Machine learning: a probabilistic perspective
-
Making deep learning go brrrr from first principles
-
Making DeepSpeed ZeRO run efficiently on more-affordable hardware
-
Mask & focus: conversation modelling by learning concepts
-
Matryoshka representation learning
-
Maximizing communication efficiency for large-scale training via 0/1 Adam
-
MCR-DL: mix-and-match communication runtime for deep learning
-
MegaBlocks: efficient sparse training with mixture-of-experts
-
Megatron-LM: training multi-billion parameter language models using model parallelism
-
Memory-efficient pipeline-parallel DNN training
-
MinTL: minimalist transfer learning for task-oriented dialogue systems
-
Mix and match: learning-free controllable text generation using energy language models
-
Mixed precision training
-
Mixture of attention heads: selecting attention heads per token
-
Mixture-of-Experts meets instruction tuning: a winning combination for large language models
-
mixup: beyond empirical risk minimization
-
MMCoQA: conversational question answering over text, tables and images
-
Mode matching in GANs through latent space learning and inversion
-
Multi-level memory for task oriented dialogs
-
Multitask prompt tuning enables parameter-efficient transfer learning
-
MultiWOZ - A large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling
-
Mutual information neural estimation
-
NeMo: a toolkit for building AI applications using neural modules
-
Neural GPUs learn algorithms
- Neural network methods for natural language processing
-
Neural networks and physical systems with emergent collective computational abilities
- Neural networks for pattern recognition
-
Neural ordinary differential equations
-
No train no gain: revisiting efficient training algorithms for transformer-based language models
-
Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples
-
OctoPack: instruction tuning code large language models
-
On the convergence of Adam and beyond
-
On the power of neural networks for solving hard problems
-
One model to learn them all
-
Open domain question answering over tables via dense retrieval
-
Open question answering over tables and text
-
OPT: open pre-trained transformer language models
-
Optimal brain compression: a framework for accurate post-training quantization and pruning
-
Optimal perceptual inference
-
Optimization story: Bloom inference
-
Orca 2: teaching small language models how to reason
-
Orca: progressive learning from complex explanation traces of GPT-4
-
Outer product-based neural collaborative filtering
-
Outrageously large neural networks: the sparsely-gated mixture-of-experts layer
-
Overcoming oscillations in quantization-aware training
-
PAL: Program-aided language models
-
PaLM: scaling language modeling with pathways
-
Parallel context windows improve in-context learning of large language models
- Pattern classification
- Pattern recognition and machine learning
-
Perceptual losses for real-time style transfer and super-resolution
-
Personalizing dialogue agents: I have a dog, do you have pets too?
-
Phase-functioned neural networks for character control
-
Playing Atari with deep reinforcement learning
-
Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing
-
Prefix-tuning: optimizing continuous prompts for generation
-
Probabilistic latent semantic analysis
-
Progressive growing of GANs from improved quality, stability and variation
-
Prompting with pseudo-code instructions
-
Proximal policy optimization algorithms
-
PullNet: open domain question answering with iterative retrieval on knowledge bases and text
-
PyTorch trace analysis for the masses
-
Q-BERT: Hessian based ultra low precision quantization of BERT
-
R3Net: recurrent residual refinement network for saliency detection
-
Reading Wikipedia to answer open-domain questions
-
REALM: Retrieval-augmented language model pretraining
-
Recurrent models of visual attention
-
Reducing activation recomputation in large transformer models
-
Regularizing and optimizing LSTM language models
- Reinforcement Learning: An Introduction
-
ReLoRA: high-rank training through low-rank updates
-
Restricted Boltzmann machines for collaborative filtering
-
Retrieval augmentation reduces hallucination in conversation
-
Retrieval-augmented generation for knowledge-intensive NLP tasks
-
Revisiting classifier two-sample tests
-
RoBERTa: a robustly optimized BERT pretraining approach
-
RoFormer: enhanced transformer with rotary position embedding
-
SantaCoder: don't reach for the stars!
-
Scaling instruction-finetuned language models
-
Scaling PyTorch FSDP for training foundation Models on IBM cloud
-
Scaling transformer to 1M tokens and beyond with RMT
-
Scattered mixture-of-experts implementation
-
Self-instruct: aligning language model with self generated instructions
-
Self-normalizing neural networks
-
Semantically equivalent adversarial rules for debugging NLP models
-
Seq2seq model and the exposure bias problem
-
Sequence parallelism: long sequence training from system perspective
-
Sequential latent knowledge selection for knowledge-grounded dialogue
-
Simple and effective multi-paragraph reading comprehension
-
Simplifying transformer blocks
-
SlimPajama-DC: understanding data combinations for LLM training
-
SmoothQuant: accurate and efficient post-training quantization for large language models
-
Soft filter pruning for accelerating deep convolutional neural networks
-
SOLAR 10.7B: scaling large language models with simple yet effective depth up-scaling
-
SOLOIST: building task bots at scale with transfer learning and machine teaching
-
Solving quantitative reasoning problems with language models
-
Spatial temporal graph convolutional networks for skeleton-based action recognition
-
Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting
-
Spectral normalization for generative adversarial networks
- Speech and language processing
-
StarCoder: may the source be with you!
-
Sticking the landing: simple, lower-variance gradient estimators for variational inference
-
StitchNet: composing neural networks from pre-trained fragments
-
Stochastic hyperparameter optimization through hypernetworks
-
Strategies for teaching layered networks classification tasks
-
Structured prompting: scaling in-context learning to 1,000 examples
-
Style transfer from non-parallel text by cross-alignment
-
Subword regularization: improving neural network translation models with multiple subword candidates
-
Supervised learning of probability distributions by neural networks
-
Supporting efficient large model training on AMD InstinctTM GPUs with DeepSpeed
-
Switch transformers: scaling to trillion parameter models with simple and efficient sparsity
-
Synchronization in neural nets
-
Synthetic data (almost) from scratch: generalized instruction tuning for language models
-
Tackling the poor assumptions of Naive Bayes text classifiers
-
Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer
-
TextWorld: a learning environment for text-based games
-
The best of both worlds: combining recent advances in neural machine translation
- The elements of statistical learning: data mining, inference and prediction
-
The Flan collection: designing data and methods for effective instruction tuning
-
The information bottleneck method
-
The Pile: an 800GB dataset of diverse text for language modeling
-
The power of scale for parameter-efficient prompt tuning
-
The wisdom of hindsight makes language models better instruction followers
-
Thermometer encoding: one hot way to resist adversarial examples
-
To regularize or not to regularize? The bias variance trade-off in regularized AEs
-
Towards crowdsourced training of large neural networks using decentralized mixture-of-experts
-
Towards deep learning models resilient to adversarial attacks
-
Towards evaluating the robustness of neural networks
-
Train short, test long: Attention with linear biases enables input length extrapolation
-
Training compute-optimal large language models
-
Training language models to follow instructions with human feedback
-
Transformer memory as a differentiable search index
-
Transformer quality in linear time
-
Transformer-XL: attentive language models beyond a fixed-length context
-
Transformers explained visually (part 1): overview of functionality
-
Transformers explained visually (part 2): how it works, step-by-step
-
Transformers explained visually (part 3): multi-head attention, deep dive
-
Turing-NLG: a 17-billion-parameter language model by Microsoft
-
UL2: unifying language learning paradigms
-
Understanding convolutional neural networks with a mathematical model
-
Understanding disentangling in β-VAE
-
Understanding the Open Pre-Trained Transformers (OPT) library
-
Unit tests for stochastic optimization
-
Universal language model fine-tuning for text classification
-
Unlimiformer: long-range transformers with unlimited length input
-
Unpaired image-to-image translation using cycle-consistent adversarial networks
-
Unsupervised machine translation using monolingual corpora only
-
Unsupervised representation learning by predicting image rotations
-
Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, the world’s largest and most powerful generative language model
-
Variational inference using implicit distributions
-
Variational inference with latent space quantization for adversarial resilience
-
Variational learning for unsupervised knowledge grounded dialogs
-
Variational lossy autoencoder
-
Vector-quantized input-contextualized soft prompts for natural language understanding
-
VEEGAN: reducing mode collapse in GANs using implicit variational learning
-
Very deep convolutional networks for large-scale image recognition
-
Visual instruction tuning
-
Visualizing data using t-SNE
-
Wasserstein GAN
-
wav2vec 2.0: a framework for self-supervised learning of speech representations
-
Wavenet: a generative model for raw audio
-
WebGPT: browser-assisted question-answering with human feedback
-
What language model to train if you have one million GPU hours?
-
Will GPT-4 run DOOM?
-
Word translation without parallel data
-
Writing CUDA kernels for PyTorch
-
Yandex publishes YaLM 100B. It’s the largest GPT-like neural network in open source
-
You only cache once: decoder-decoder architectures for language models
-
You only look once: unified, real-time object detection
-
ZeRO & DeepSpeed: new system optimizations enable training models with over 100 billion parameters
-
ZeRO++: Extremely efficient collective communication for giant model training
-
ZeRO-2 & DeepSpeed: shattering barriers of deep learning speed & scale
-
ZeRO-Infinity: breaking the GPU memory wall for extreme scale deep learning
-
Zero-shot text-to-image generation
-
ZeRO: memory optimizations toward training trillion parameter models
-
ZeroQuant: efficient and affordable post-training quantization for large-scale transformers
-
β-VAE: learning basic visual concepts with a constrained variational framework
-
🍷 FineWeb: decanting the web for the finest text data at scale
Calculus
- Calculus of variations
- Thomas' calculus
Computer Architecture
-
Accelerated computing with a reconfigurable dataflow architecture
- Computer architecture: a quantitative approach
- Computer organization and design ARM edition: the hardware software interface
-
Flipping bits in memory without accessing them: an experimental study of DRAM disturbance errors
-
Improving DRAM performance by parallelizing refreshes with accesses
-
Memory performance attacks: denial of memory service in multi-core systems
-
Memory scaling: a systems architecture perspective
-
Millicode in an IBM zSeries processor
-
MTIA v1: Meta's first-generation AI inference accelerator
-
RAIDR: Retention-Aware Intelligent DRAM Refresh
-
Stall-time fair memory access scheduling for chip multiprocessors
Computer Graphics
Data Structures and Algorithms
- Data structures and algorithms in Java
- Introduction to algorithms
Digital Electronics
- Digital design: with an introduction to the Verilog HDL
Graph Theory
- Introduction to graph theory
Information Theory
- Elements of information theory
-
Error detecting and error correcting codes
Linear Algebra
- Linear algebra and its applications
- Matrix analysis and applied linear algebra
- The matrix cookbook
Measure Theory
- Measure theory
Optimization Theory
- Convex Optimization
-
Distributed optimization and statistical learning via the alternating direction method of multipliers
Probability and Stochastic Processes
- Introduction to probability and stochastic processes with applications
Quantum Computing
-
A fast quantum mechanical algorithm for database search
-
A single quantum cannot be cloned
-
Can quantum-mechanical description of physical reality be considered complete
-
Image recognition with an adiabatic quantum computer I. mapping to quadratic unconstrained binary optimization
-
Integer optimization toolbox (minimizing polynomials over integer lattices using quantum annealing)
-
Limits on parallel speedup for classical Ising model solvers
-
Partitioning optimization problems for hybrid classical/quantum execution
-
Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer
-
Probabilistic cloning and identification of linearly independent quantum states
-
Programming with D-Wave: map coloring problem
- Quantum computation and quantum information
- Quantum computing: a gentle introduction
-
Quantum performance evaluation: a short reading list
-
Quantum theory, the Church-Turing principle and the universal quantum computer
-
Rapid solution of problems by quantum computation
-
Teleporting an unknown quantum state via dual classical and Einstein-Podolsky-Rosen channels
Signal Processing
- Discrete-time signal processing
- Foundations of Signal Processing
- Signals and systems
- Understanding digital signal processing