Distributed machine learning with Python accelerating model training and serving with distributed systems /

Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter...

Full description

Bibliographic Details
Main Author:	Wang, Guanhua
Format:	Electronic Book
Language:	English
Published:	Birmingham : Packt Publishing, Limited, 2022
Subjects:	Machine learning Python (Computer program language) Apprentissage automatique Python (Langage de programmation)

Table of Contents:

Intro
Title page
Copyright and Credits
Dedication
Contributors
Table of Contents
Preface
Section 1
Data Parallelism
Chapter 1: Splitting Input Data
Single-node training is too slow
The mismatch between data loading bandwidth and model training bandwidth
Single-node training time on popular datasets
Accelerating the training process with data parallelism
Data parallelism
the high-level bits
Stochastic gradient descent
Model synchronization
Hyperparameter tuning
Global batch size
Learning rate adjustment
Model synchronization schemes
Collective communication
Broadcast
Gather
All-Gather
Summary
Chapter 3: Building a Data Parallel Training and Serving Pipeline
Technical requirements
The data parallel training pipeline in a nutshell
Input pre-processing
Input data partition
Data loading
Training
Model synchronization
Model update
Single-machine multi-GPUs and multi-machine multi-GPUs
Single-machine multi-GPU
Multi-machine multi-GPU
Checkpointing and fault tolerance
Model checkpointing
Load model checkpoints
Model evaluation and hyperparameter tuning
Model serving in data parallelism
Summary
Chapter 4: Bottlenecks and Solutions
Communication bottlenecks in data parallel training
Analyzing the communication workloads
Parameter server architecture
The All-Reduce architecture
The inefficiency of state-of-the-art communication schemes
Leveraging idle links and host resources
Tree All-Reduce
Hybrid data transfer over PCIe and NVLink
On-device memory bottlenecks
Recomputation and quantization
Recomputation
Quantization
Summary
Section 2
Model Parallelism
Chapter 5: Splitting the Model
Technical requirements
Single-node training error
out of memory
Fine-tuning BERT on a single GPU
Trying to pack a giant model inside one state-of-the-art GPU
ELMo, BERT, and GPT
Basic concepts
RNN
ELMo
BERT
GPT
Pre-training and fine-tuning
State-of-the-art hardware
P100, V100, and DGX-1
NVLink
A100 and DGX-2
NVSwitch
Summary
Chapter 6: Pipeline Input and Layer Split
Vanilla model parallelism is inefficient
Forward propagation
Backward propagation
GPU idle time between forward and backward propagation
Pipeline input

Distributed machine learning with Python accelerating model training and serving with distributed systems /

Similar Items