Distributed machine learning with Python accelerating model training and serving with distributed systems /

Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter...

Full description

Bibliographic Details
Main Author: Wang, Guanhua
Format: Electronic Book
Language:English
Published: Birmingham : Packt Publishing, Limited, 2022
Subjects:
Table of Contents:
  • Intro
  • Title page
  • Copyright and Credits
  • Dedication
  • Contributors
  • Table of Contents
  • Preface
  • Section 1
  • Data Parallelism
  • Chapter 1: Splitting Input Data
  • Single-node training is too slow
  • The mismatch between data loading bandwidth and model training bandwidth
  • Single-node training time on popular datasets
  • Accelerating the training process with data parallelism
  • Data parallelism
  • the high-level bits
  • Stochastic gradient descent
  • Model synchronization
  • Hyperparameter tuning
  • Global batch size
  • Learning rate adjustment
  • Model synchronization schemes
  • Collective communication
  • Broadcast
  • Gather
  • All-Gather
  • Summary
  • Chapter 3: Building a Data Parallel Training and Serving Pipeline
  • Technical requirements
  • The data parallel training pipeline in a nutshell
  • Input pre-processing
  • Input data partition
  • Data loading
  • Training
  • Model synchronization
  • Model update
  • Single-machine multi-GPUs and multi-machine multi-GPUs
  • Single-machine multi-GPU
  • Multi-machine multi-GPU
  • Checkpointing and fault tolerance
  • Model checkpointing
  • Load model checkpoints
  • Model evaluation and hyperparameter tuning
  • Model serving in data parallelism
  • Summary
  • Chapter 4: Bottlenecks and Solutions
  • Communication bottlenecks in data parallel training
  • Analyzing the communication workloads
  • Parameter server architecture
  • The All-Reduce architecture
  • The inefficiency of state-of-the-art communication schemes
  • Leveraging idle links and host resources
  • Tree All-Reduce
  • Hybrid data transfer over PCIe and NVLink
  • On-device memory bottlenecks
  • Recomputation and quantization
  • Recomputation
  • Quantization
  • Summary
  • Section 2
  • Model Parallelism
  • Chapter 5: Splitting the Model
  • Technical requirements
  • Single-node training error
  • out of memory
  • Fine-tuning BERT on a single GPU
  • Trying to pack a giant model inside one state-of-the-art GPU
  • ELMo, BERT, and GPT
  • Basic concepts
  • RNN
  • ELMo
  • BERT
  • GPT
  • Pre-training and fine-tuning
  • State-of-the-art hardware
  • P100, V100, and DGX-1
  • NVLink
  • A100 and DGX-2
  • NVSwitch
  • Summary
  • Chapter 6: Pipeline Input and Layer Split
  • Vanilla model parallelism is inefficient
  • Forward propagation
  • Backward propagation
  • GPU idle time between forward and backward propagation
  • Pipeline input