List of Papers

Energy Efficient Architecture for Graph Analytics Accelerators
A Template Based Design Methodology for Graph Parallel Hardware Accelerators
System Simulation with gem5 and SystemC
GAIL: The Graph Algorithm Iron Law
Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Serve
Graphicionado A High Performance and Energy-Efficient Accelerator for Graph Analytics
Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads
Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Design Approach
GNN Performance Optimization
Dissecting the Graphcore IPU Architecture
Using the Graphcore IPU for Traditional HPC Applications
Roofline: An Insightful Visual Performance Model
CUDA New Features and Beyond
A Study of Persistent Threads Style GPU Programming for GPGPU Workloads
BrainTorrent: A Peer to Peer Environment for Decentralized Federated Learning
Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU
Softshell: Dynamic Scheduling on GPUs
Gravel: Fine-Grain GPU-Initiated Network Messages
SPIN:Seamless Operating System Integration of Peer to Peer DMA Between SSDs and GPUs
Automatic Graph Partitioning for Very Large-scale Deep Learning
Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures
Productivity, Portability, Performance: Data-Centric Python
Interferences between Communications and Computations in Distributed HPC Systems
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters
GPUnet: Networking Abstractions for GPU Programs
GPUrdma: GPU-side library for high performance networking from GPU kernels
Trends in Data Locality Abstractions for HPC Systems
Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
Benchmarking GPUs to Tune Dense Linear Algebra
Brook for GPUs: stream computing on graphics hardware
IPUG: Accelerating Breadth-First Graph Traversals using Manycore Graphcore IPUs
Supporting RISC-V Performance Counters through Performance analysis tools for Linux
Merrimac: Supercomputing with Streams
Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer
A-RISC-V-Simulator-and-Benchmark-Suite-for-Designing-and-Evaluating-Vector-Architectures
PyTorch Distributed Experiences on Accelerating Data Parallel Training
An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
ZeRO Memory Optimizations Toward Training Trillion Parameter Models
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform
Architecture and Performance of Devito, a System for Automated Stencil Computation
Distributed Training of Deep Learning Models A Taxonomic Perspective
Performance Trade-offs in GPU Communication A Study of Host and Device-initiated Approaches
Assessment of NVSHMEM for High Performance Computing
Sparse GPU Kernels for Deep Learning
The State of Sparsity in Deep Neural Networks
Pruning neural networks without any data by iteratively conserving synaptic flow
SNIP: Single-shot Network Pruning based on Connection Sensitivity
Comparing Rewinding and Fine-tuning in Neural Network Pruning
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
An asynchronous message driven parallel framework for extreme scale deep learning
Bolt: Bridging The Gap Between Auto Tuners And Hardware Native Performance
Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision
Attention is All You Need
Scaling Laws for Neural Language Models
Language Models are Few-Shot Learners
BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Longformer: The Long-Document Transformer
Linformer: Self-Attention with Linear Complexity
The Efficiency Misnomer
A Survey of Transformers
PipeTransformer-Automated Elastic Pipelining for Distributed Training of Large-scale Models
Training Compute-Optimal Large Language Models
WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture
Sparse-GPT-Massive-Language-Models-Can-Be-Accurately-Pruned-in-One-Shot

Two Papers A Week Goal (Starting from 28.06.2021)

28.06.2021 - 04.07.2021

Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing

05.07.2021 - 11.07.2021

A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU

12.07.2021 - 18.07.2021

Softshell: Dynamic Scheduling on GPUs
Gravel: Fine-Grain GPU-Initiated Network Messages

09.08.2021 - 15.08.2021

SPIN:Seamless Operating System Integration of Peer to Peer DMA Between SSDs and GPUs
GPU-to-CPU Callbacks

16.08.2021 - 22.08.2021

PyTorch: An Imperative Style, High-Performance Deep Learning Library -> Zero technical depth. Please give my time back.
Automatic Graph Partitioning for Very Large-scale Deep Learning

23.08.2021 - 29.08.2021

Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures
Productivity, Portability, Performance: Data-Centric Python

30.08.2021 - 05.09.2021

06.09.2021 - 12.09.2021

Interferences between Communications and Computations in Distributed HPC Systems
Memory Bandwidth Contention: Communication vs Computation Tradeoffs in Supercomputers with Multicore Architectures

13.09.2021 - 19.09.2021

MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters

20.09.2021 - 26.09.2021

GPUnet: Networking Abstractions for GPU Programs
GPUrdma: GPU-side library for high performance networking from GPU kernels

27.09.2021 - 03.10.2021

Trends in Data Locality Abstractions for HPC Systems
Moores Law is ending

04.10.2021 - 10.10.2021

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
Deep Residual Learning for Image Recognition

11.10.2021 - 17.10.2021

Benchmarking GPUs to Tune Dense Linear Algebra
Brook for GPUs: stream computing on graphics hardware

18.10.2021 - 24.10.2021

IPUG: Accelerating Breadth-First Graph Traversals using Manycore Graphcore IPUs
Supporting RISC-V Performance Counters through Performance analysis tools for Linux

25.10.2021 - 31.10.2021

Merrimac: Supercomputing with Streams
Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer

01.11.2021 - 07.11.2021

08.11.2021 - 14.11.2021

A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures
PyTorch Distributed Experiences on Accelerating Data Parallel Training

15.11.2021 - 21.11.2021

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
ZeRO Memory Optimizations Toward Training Trillion Parameter Models

22.11.2021 - 28.11.2021

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

29.11.2021 - 05.12.2021

Unpublished paper (will update if it's accepted)
XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform

06.11.2021 - 12.12.2021

Architecture and Performance of Devito, a System for Automated Stencil Computation
Distributed Training of Deep Learning Models A Taxonomic Perspective

13.12.2021 - 19.12.2021

Performance Trade-offs in GPU Communication A Study of Host and Device-initiated Approaches
Assessment of NVSHMEM for High Performance Computing

20.12.2021 - 26.12.2021

Essential Reading List in Parallel Computing (Including suggestions of my advisor (Didem Unat))

Trends

Architectures

Performance Models and Tools

Applications

Programming Models

Compilers

Runtime Systems

My Scalable Deep Learning List (Just a list of papers that I read recently. Not recommendations)

PyTorch Distributed Experiences on Accelerating Data Parallel Training
An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
ZeRO Memory Optimizations Toward Training Trillion Parameter Models
Distributed Training of Deep Learning Models A Taxonomic Perspective
Sparse GPU Kernels for Deep Learning
The State of Sparsity in Deep Neural Networks
Pruning neural networks without any data by iteratively conserving synaptic flow
SNIP: Single-shot Network Pruning based on Connection Sensitivity
Comparing Rewinding and Fine-tuning in Neural Network Pruning
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
An asynchronous message driven parallel framework for extreme scale deep learning
Bolt: Bridging The Gap Between Auto Tuners And Hardware Native Performance
Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision
Attention is All You Need
Scaling Laws for Neural Language Models
Language Models are Few-Shot Learners
BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Longformer: The Long-Document Transformer
Linformer: Self-Attention with Linear Complexity
The Efficiency Misnomer
A Survey of Transformers
PipeTransformer-Automated Elastic Pipelining for Distributed Training of Large-scale Models
Training Compute-Optimal Large Language Models
Sparse-GPT-Massive-Language-Models-Can-Be-Accurately-Pruned-in-One-Shot

HPC-Paper-Notes
HPC-Paper-Notes copied to clipboard

Metadata

List of Papers

Two Papers A Week Goal (Starting from 28.06.2021)

28.06.2021 - 04.07.2021

05.07.2021 - 11.07.2021

12.07.2021 - 18.07.2021

09.08.2021 - 15.08.2021

16.08.2021 - 22.08.2021

23.08.2021 - 29.08.2021

30.08.2021 - 05.09.2021

06.09.2021 - 12.09.2021

13.09.2021 - 19.09.2021

20.09.2021 - 26.09.2021

27.09.2021 - 03.10.2021

04.10.2021 - 10.10.2021

11.10.2021 - 17.10.2021

18.10.2021 - 24.10.2021

25.10.2021 - 31.10.2021

01.11.2021 - 07.11.2021

08.11.2021 - 14.11.2021

15.11.2021 - 21.11.2021

22.11.2021 - 28.11.2021

29.11.2021 - 05.12.2021

06.11.2021 - 12.12.2021

13.12.2021 - 19.12.2021

20.12.2021 - 26.12.2021

Essential Reading List in Parallel Computing (Including suggestions of my advisor (Didem Unat))

Trends

Architectures

Performance Models and Tools

Applications

Programming Models

Compilers

Runtime Systems

My Scalable Deep Learning List (Just a list of papers that I read recently. Not recommendations)

← Metadata

Owner

Metadata

HPC-Paper-Notes HPC-Paper-Notes copied to clipboard

Metadata

List of Papers

Two Papers A Week Goal (Starting from 28.06.2021)

28.06.2021 - 04.07.2021

05.07.2021 - 11.07.2021

12.07.2021 - 18.07.2021

09.08.2021 - 15.08.2021

16.08.2021 - 22.08.2021

23.08.2021 - 29.08.2021

30.08.2021 - 05.09.2021

06.09.2021 - 12.09.2021

13.09.2021 - 19.09.2021

20.09.2021 - 26.09.2021

27.09.2021 - 03.10.2021

04.10.2021 - 10.10.2021

11.10.2021 - 17.10.2021

18.10.2021 - 24.10.2021

25.10.2021 - 31.10.2021

01.11.2021 - 07.11.2021

08.11.2021 - 14.11.2021

15.11.2021 - 21.11.2021

22.11.2021 - 28.11.2021

29.11.2021 - 05.12.2021

06.11.2021 - 12.12.2021

13.12.2021 - 19.12.2021

20.12.2021 - 26.12.2021

Essential Reading List in Parallel Computing (Including suggestions of my advisor (Didem Unat))

Trends

Architectures

Performance Models and Tools

Applications

Programming Models

Compilers

Runtime Systems

My Scalable Deep Learning List (Just a list of papers that I read recently. Not recommendations)

← Metadata

Owner

Metadata

HPC-Paper-Notes
HPC-Paper-Notes copied to clipboard