HPC-Paper-Notes icon indicating copy to clipboard operation
HPC-Paper-Notes copied to clipboard

My notes on various HPC papers.

List of Papers

  1. Energy Efficient Architecture for Graph Analytics Accelerators
  2. A Template Based Design Methodology for Graph Parallel Hardware Accelerators
  3. System Simulation with gem5 and SystemC
  4. GAIL: The Graph Algorithm Iron Law
  5. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Serve
  6. Graphicionado A High Performance and Energy-Efficient Accelerator for Graph Analytics
  7. Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads
  8. Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Design Approach
  9. GNN Performance Optimization
  10. Dissecting the Graphcore IPU Architecture
  11. Using the Graphcore IPU for Traditional HPC Applications
  12. Roofline: An Insightful Visual Performance Model
  13. CUDA New Features and Beyond
  14. A Study of Persistent Threads Style GPU Programming for GPGPU Workloads
  15. BrainTorrent: A Peer to Peer Environment for Decentralized Federated Learning
  16. Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
  17. Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
  18. A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
  19. The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU
  20. Softshell: Dynamic Scheduling on GPUs
  21. Gravel: Fine-Grain GPU-Initiated Network Messages
  22. SPIN:Seamless Operating System Integration of Peer to Peer DMA Between SSDs and GPUs
  23. Automatic Graph Partitioning for Very Large-scale Deep Learning
  24. Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures
  25. Productivity, Portability, Performance: Data-Centric Python
  26. Interferences between Communications and Computations in Distributed HPC Systems
  27. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
  28. GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters
  29. GPUnet: Networking Abstractions for GPU Programs
  30. GPUrdma: GPU-side library for high performance networking from GPU kernels
  31. Trends in Data Locality Abstractions for HPC Systems
  32. Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
  33. Benchmarking GPUs to Tune Dense Linear Algebra
  34. Brook for GPUs: stream computing on graphics hardware
  35. IPUG: Accelerating Breadth-First Graph Traversals using Manycore Graphcore IPUs
  36. Supporting RISC-V Performance Counters through Performance analysis tools for Linux
  37. Merrimac: Supercomputing with Streams
  38. Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer
  39. A-RISC-V-Simulator-and-Benchmark-Suite-for-Designing-and-Evaluating-Vector-Architectures
  40. PyTorch Distributed Experiences on Accelerating Data Parallel Training
  41. An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
  42. ZeRO Memory Optimizations Toward Training Trillion Parameter Models
  43. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  44. XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform
  45. Architecture and Performance of Devito, a System for Automated Stencil Computation
  46. Distributed Training of Deep Learning Models A Taxonomic Perspective
  47. Performance Trade-offs in GPU Communication A Study of Host and Device-initiated Approaches
  48. Assessment of NVSHMEM for High Performance Computing
  49. Sparse GPU Kernels for Deep Learning
  50. The State of Sparsity in Deep Neural Networks
  51. Pruning neural networks without any data by iteratively conserving synaptic flow
  52. SNIP: Single-shot Network Pruning based on Connection Sensitivity
  53. Comparing Rewinding and Fine-tuning in Neural Network Pruning
  54. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
  55. Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
  56. An asynchronous message driven parallel framework for extreme scale deep learning
  57. Bolt: Bridging The Gap Between Auto Tuners And Hardware Native Performance
  58. Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision
  59. Attention is All You Need
  60. Scaling Laws for Neural Language Models
  61. Language Models are Few-Shot Learners
  62. BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
  63. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  64. Longformer: The Long-Document Transformer
  65. Linformer: Self-Attention with Linear Complexity
  66. The Efficiency Misnomer
  67. A Survey of Transformers
  68. PipeTransformer-Automated Elastic Pipelining for Distributed Training of Large-scale Models
  69. Training Compute-Optimal Large Language Models
  70. WholeGraph: A Fast Graph Neural Network Training Framework with Multi-GPU Distributed Shared Memory Architecture
  71. Sparse-GPT-Massive-Language-Models-Can-Be-Accurately-Pruned-in-One-Shot

Two Papers A Week Goal (Starting from 28.06.2021)

28.06.2021 - 04.07.2021

  • Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
  • Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing

05.07.2021 - 11.07.2021

  • A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
  • The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU

12.07.2021 - 18.07.2021

  • Softshell: Dynamic Scheduling on GPUs
  • Gravel: Fine-Grain GPU-Initiated Network Messages

09.08.2021 - 15.08.2021

  • SPIN:Seamless Operating System Integration of Peer to Peer DMA Between SSDs and GPUs
  • GPU-to-CPU Callbacks

16.08.2021 - 22.08.2021

23.08.2021 - 29.08.2021

  • Stateful Dataflow Multigraphs: A data-centric model for performance portability on heterogeneous architectures
  • Productivity, Portability, Performance: Data-Centric Python

30.08.2021 - 05.09.2021

06.09.2021 - 12.09.2021

  • Interferences between Communications and Computations in Distributed HPC Systems
  • Memory Bandwidth Contention: Communication vs Computation Tradeoffs in Supercomputers with Multicore Architectures

13.09.2021 - 19.09.2021

  • MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters
  • GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters

20.09.2021 - 26.09.2021

  • GPUnet: Networking Abstractions for GPU Programs
  • GPUrdma: GPU-side library for high performance networking from GPU kernels

27.09.2021 - 03.10.2021

04.10.2021 - 10.10.2021

  • Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines
  • Deep Residual Learning for Image Recognition

11.10.2021 - 17.10.2021

  • Benchmarking GPUs to Tune Dense Linear Algebra
  • Brook for GPUs: stream computing on graphics hardware

18.10.2021 - 24.10.2021

  • IPUG: Accelerating Breadth-First Graph Traversals using Manycore Graphcore IPUs
  • Supporting RISC-V Performance Counters through Performance analysis tools for Linux

25.10.2021 - 31.10.2021

  • Merrimac: Supercomputing with Streams
  • Peta-scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer

01.11.2021 - 07.11.2021

08.11.2021 - 14.11.2021

  • A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures
  • PyTorch Distributed Experiences on Accelerating Data Parallel Training

15.11.2021 - 21.11.2021

  • An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
  • ZeRO Memory Optimizations Toward Training Trillion Parameter Models

22.11.2021 - 28.11.2021

29.11.2021 - 05.12.2021

  • Unpublished paper (will update if it's accepted)
  • XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform

06.11.2021 - 12.12.2021

  • Architecture and Performance of Devito, a System for Automated Stencil Computation
  • Distributed Training of Deep Learning Models A Taxonomic Perspective

13.12.2021 - 19.12.2021

  • Performance Trade-offs in GPU Communication A Study of Host and Device-initiated Approaches
  • Assessment of NVSHMEM for High Performance Computing

20.12.2021 - 26.12.2021

Essential Reading List in Parallel Computing (Including suggestions of my advisor (Didem Unat))

Trends

Architectures

Performance Models and Tools

Applications

Programming Models

Compilers

Runtime Systems

My Scalable Deep Learning List (Just a list of papers that I read recently. Not recommendations)

  • PyTorch Distributed Experiences on Accelerating Data Parallel Training
  • An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
  • ZeRO Memory Optimizations Toward Training Trillion Parameter Models
  • Distributed Training of Deep Learning Models A Taxonomic Perspective
  • Sparse GPU Kernels for Deep Learning
  • The State of Sparsity in Deep Neural Networks
  • Pruning neural networks without any data by iteratively conserving synaptic flow
  • SNIP: Single-shot Network Pruning based on Connection Sensitivity
  • Comparing Rewinding and Fine-tuning in Neural Network Pruning
  • The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
  • Torch.fx: Practical Program Capture and Transformation for Deep Learning in Python
  • An asynchronous message driven parallel framework for extreme scale deep learning
  • Bolt: Bridging The Gap Between Auto Tuners And Hardware Native Performance
  • Efficient Tensor Core-Based GPU Kernels for Structured Sparsity under Reduced Precision
  • Attention is All You Need
  • Scaling Laws for Neural Language Models
  • Language Models are Few-Shot Learners
  • BERT Pre-training of Deep Bidirectional Transformers for Language Understanding
  • RoBERTa: A Robustly Optimized BERT Pretraining Approach
  • Longformer: The Long-Document Transformer
  • Linformer: Self-Attention with Linear Complexity
  • The Efficiency Misnomer
  • A Survey of Transformers
  • PipeTransformer-Automated Elastic Pipelining for Distributed Training of Large-scale Models
  • Training Compute-Optimal Large Language Models
  • Sparse-GPT-Massive-Language-Models-Can-Be-Accurately-Pruned-in-One-Shot