HPC-Paper-Notes
HPC-Paper-Notes copied to clipboard
My notes on various HPC papers.
List of Papers
- Energy Efficient Architecture for Graph Analytics Accelerators
- A Template Based Design Methodology for Graph Parallel Hardware Accelerators
- System Simulation with gem5 and SystemC
- GAIL: The Graph Algorithm Iron Law
- Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Serve
- Graphicionado A High Performance and Energy-Efficient Accelerator for Graph Analytics
- Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads
- Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Design Approach
- GNN Performance Optimization
- Dissecting the Graphcore IPU Architecture
- Using the Graphcore IPU for Traditional HPC Applications
- Roofline: An Insightful Visual Performance Model
- CUDA New Features and Beyond
- A Study of Persistent Threads Style GPU Programming for GPGPU Workloads
- BrainTorrent: A Peer to Peer Environment for Decentralized Federated Learning
- Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
- Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
- A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
- The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU
- Softshell: Dynamic Scheduling on GPUs
- Gravel: Fine-Grain GPU-Initiated Network Messages
- SPIN:Seamless Operating System Integration of Peer to Peer DMA Between SSDs and GPUs
- Automatic Graph Partitioning for Very Large-scale Deep Learning
- Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures
- Productivity, Portability, Performance: Data-Centric Python
- Interferences between Communications and Computations in Distributed HPC Systems
- MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
- GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters
- GPUnet: Networking Abstractions for GPU Programs
- GPUrdma: GPU-side library for high performance networking from GPU kernels
- Trends in Data Locality Abstractions for HPC Systems
- Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
- Benchmarking GPUs to Tune Dense Linear Algebra
- Brook for GPUs: stream computing on graphics hardware
- IPUG: Accelerating Breadth-First Graph Traversals using Manycore Graphcore IPUs
- Supporting RISC-V Performance Counters through Performance analysis tools for Linux
- Merrimac: Supercomputing with Streams
- Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer
- A-RISC-V-Simulator-and-Benchmark-Suite-for-Designing-and-Evaluating-Vector-Architectures
- PyTorch Distributed Experiences on Accelerating Data Parallel Training
- An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
- ZeRO Memory Optimizations Toward Training Trillion Parameter Models
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
- XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform
- Architecture and Performance of Devito, a System for Automated Stencil Computation
- Distributed Training of Deep Learning Models A Taxonomic Perspective
- Performance Trade-offs in GPU Communication A Study of Host and Device-initiated Approaches
- Assessment of NVSHMEM for High Performance Computing
- Sparse GPU Kernels for Deep Learning
- The State of Sparsity in Deep Neural Networks
- Pruning neural networks without any data by iteratively conserving synaptic flow
- SNIP: Single-shot Network Pruning based on Connection Sensitivity
- Comparing Rewinding and Fine-tuning in Neural Network Pruning
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
- An asynchronous message driven parallel framework for extreme scale deep learning
- Bolt: Bridging The Gap Between Auto Tuners And Hardware Native Performance
- Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision
- Attention is All You Need
- Scaling Laws for Neural Language Models
- Language Models are Few-Shot Learners
- BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Longformer: The Long-Document Transformer
- Linformer: Self-Attention with Linear Complexity
- The Efficiency Misnomer
- A Survey of Transformers
- PipeTransformer-Automated Elastic Pipelining for Distributed Training of Large-scale Models
- Training Compute-Optimal Large Language Models
- WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture
- Sparse-GPT-Massive-Language-Models-Can-Be-Accurately-Pruned-in-One-Shot
Two Papers A Week Goal (Starting from 28.06.2021)
28.06.2021 - 04.07.2021
- Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
- Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
05.07.2021 - 11.07.2021
- A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
- The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU
12.07.2021 - 18.07.2021
- Softshell: Dynamic Scheduling on GPUs
- Gravel: Fine-Grain GPU-Initiated Network Messages
09.08.2021 - 15.08.2021
- SPIN:Seamless Operating System Integration of Peer to Peer DMA Between SSDs and GPUs
- GPU-to-CPU Callbacks
16.08.2021 - 22.08.2021
- PyTorch: An Imperative Style, High-Performance Deep Learning Library -> Zero technical depth. Please give my time back.
- Automatic Graph Partitioning for Very Large-scale Deep Learning
23.08.2021 - 29.08.2021
- Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures
- Productivity, Portability, Performance: Data-Centric Python
30.08.2021 - 05.09.2021
- Analyzing Put/Get APIs for Thread-collaborative Processors
- Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time
06.09.2021 - 12.09.2021
- Interferences between Communications and Computations in Distributed HPC Systems
- Memory Bandwidth Contention: Communication vs Computation Tradeoffs in Supercomputers with Multicore Architectures
13.09.2021 - 19.09.2021
- MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
- GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters
20.09.2021 - 26.09.2021
- GPUnet: Networking Abstractions for GPU Programs
- GPUrdma: GPU-side library for high performance networking from GPU kernels
27.09.2021 - 03.10.2021
- Trends in Data Locality Abstractions for HPC Systems
- Moores Law is ending
04.10.2021 - 10.10.2021
- Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
- Deep Residual Learning for Image Recognition
11.10.2021 - 17.10.2021
- Benchmarking GPUs to Tune Dense Linear Algebra
- Brook for GPUs: stream computing on graphics hardware
18.10.2021 - 24.10.2021
- IPUG: Accelerating Breadth-First Graph Traversals using Manycore Graphcore IPUs
- Supporting RISC-V Performance Counters through Performance analysis tools for Linux
25.10.2021 - 31.10.2021
- Merrimac: Supercomputing with Streams
- Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer
01.11.2021 - 07.11.2021
- Toward a Scalable and Distributed Infrastructure for Deep Learning Applications
- A Data-centric Optimization Framework for Machine Learning
08.11.2021 - 14.11.2021
- A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures
- PyTorch Distributed Experiences on Accelerating Data Parallel Training
15.11.2021 - 21.11.2021
- An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
- ZeRO Memory Optimizations Toward Training Trillion Parameter Models
22.11.2021 - 28.11.2021
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
29.11.2021 - 05.12.2021
- Unpublished paper (will update if it's accepted)
- XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform
06.11.2021 - 12.12.2021
- Architecture and Performance of Devito, a System for Automated Stencil Computation
- Distributed Training of Deep Learning Models A Taxonomic Perspective
13.12.2021 - 19.12.2021
- Performance Trade-offs in GPU Communication A Study of Host and Device-initiated Approaches
- Assessment of NVSHMEM for High Performance Computing
20.12.2021 - 26.12.2021
- A Data-Centric Optimization Framework for Machine Learning
- Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
Essential Reading List in Parallel Computing (Including suggestions of my advisor (Didem Unat))
Trends
- ✅ Moores Law is ending
- A new golden age for computer architecture
- Abstract machine models and proxy architectures for exascale computing
- ✅ Trends in Data Locality Abstractions for HPC Systems
Architectures
- ✅ Merrimac: Supercomputing with Streams
- Synergistic Processing in Cell's Multicore Architecture
- Knights Landing: Second Generation Intel Xeon Phi Product
Performance Models and Tools
- ✅ Roofline: an insightful visual performance model for multicore architectures
- ExaSAT: An exascale co-design tool for performance modeling
- hwloc: A generic framework for managing hardware affinities in HPC applications
Applications
- Optimization of sparse matrix-vector multiplication on emerging multicore platforms
- Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
- ✅ Benchmarking GPUs to tune dense linear algebra
Programming Models
- ✅ Brook for GPUs: Stream Computing on Graphics Hardware
- OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES
- Productivity and performance using partitioned global address space languages
- Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
Compilers
- Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines
- Chill: A Framework for High Level Loop Transformations
- Pluto: A Practical and Automatic Polyhedral Program Optimization System
Runtime Systems
- Cilk: An Efficient Multithreaded Runtime System
- StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
- Legion: expressing locality and independence with logical regions
- Charm++ A portable concurrent object oriented system based on C++
My Scalable Deep Learning List (Just a list of papers that I read recently. Not recommendations)
- PyTorch Distributed Experiences on Accelerating Data Parallel Training
- An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
- ZeRO Memory Optimizations Toward Training Trillion Parameter Models
- Distributed Training of Deep Learning Models A Taxonomic Perspective
- Sparse GPU Kernels for Deep Learning
- The State of Sparsity in Deep Neural Networks
- Pruning neural networks without any data by iteratively conserving synaptic flow
- SNIP: Single-shot Network Pruning based on Connection Sensitivity
- Comparing Rewinding and Fine-tuning in Neural Network Pruning
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
- An asynchronous message driven parallel framework for extreme scale deep learning
- Bolt: Bridging The Gap Between Auto Tuners And Hardware Native Performance
- Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision
- Attention is All You Need
- Scaling Laws for Neural Language Models
- Language Models are Few-Shot Learners
- BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Longformer: The Long-Document Transformer
- Linformer: Self-Attention with Linear Complexity
- The Efficiency Misnomer
- A Survey of Transformers
- PipeTransformer-Automated Elastic Pipelining for Distributed Training of Large-scale Models
- Training Compute-Optimal Large Language Models
- Sparse-GPT-Massive-Language-Models-Can-Be-Accurately-Pruned-in-One-Shot