The Gradient AllReduce algorithm is a popular synchronous data-parallel distributed algorithm. It is the algorithm implemented in most existing solutions such as PyTorch DistributedDataParallel, Horovod, and TensorFlow Mirrored Strategy.
With this algorithm, each worker does the following steps in each iteration.
- Compute the gradient using a minibatch.
- Compute the mean of the gradients on all workers by using the AllReduce collective.
- Update the model with the averaged gradient.
In Bagua, this algorithm is supported via the
class. The performance of the
GradientAllReduce implementation in Bagua by
default should be on par with PyTorch DDP and faster than Horovod in most cases.
Bagua supports additional optimizations such as hierarchical communication that
can be configured when instantiating the
GradientAllReduce class. They can
make Bagua faster than other implementations in certain scenarios, for example
when the inter-machine network is a bottleneck.
A complete example of running Gradient AllReduce can be found at Bagua examples
--algorithm gradient_allreduce command line argument.
You need to initialize the Bagua algorithm with (see API documentation for what parameters you can customize):
from bagua.torch_api.algorithms import gradient_allreduce algorithm = gradient_allreduce.GradientAllReduceAlgorithm()
Then decorate your model with:
model = model.with_bagua([optimizer], algorithm)