Abstract: Distributed data-parallel training (DDP) is prevalent in large-scale deep learning. To increase the training throughput and scalability, high-performance collective communication methods ...