If you see some error like the message below, just clean the original installation record first by
rm -rf /root/.rustup and reinstall.
error: could not rename component file from '/root/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/share/doc/cargo' to '/root/.rustup/tmp/m74fkrv0gv6708f6_dir/bk'error: caused by: other os error.
You can try to check whether the machine has multiple network interfaces, and
NCCL_SOCKET_IFNAME=network card name (such as eth01) to specify
the one you want to use (usually a physical one). Card information can be
Using a different algorithm or using more GPUs has similar effect as using a different optimizer, so you need to retune your hyperparameters. Some tricks you can try:
- Train more epochs and increase the number of training iterations to 0.2-0.3 times more than the original.
- Scale the learning rate. If the total batch size of distributed training is increased by times, the learning rate should also be increased by times to be .
- Performing a gradual learning rate warmup for several epochs often helps (see also Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour).