[RESUBMIT] Cleanup error reporting for ProcessGroupNCCL (#112419)

Continuing some of the work from https://github.com/pytorch/pytorch/pull/108191, I realized majority of errors raised from ProcessGroupNCCL were just generic RuntimeError.

In this PR, I've added appropriate error types to all the exceptions raised from ProcessGroupNCCL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112419
Approved by: https://github.com/fduwjj
This commit is contained in:
Pritam Damania
2023-10-31 05:58:21 +00:00
committed by PyTorch MergeBot
parent cb942ef2b1
commit e66ec5843f
4 changed files with 129 additions and 105 deletions

View File

@@ -224,7 +224,7 @@ TEST_F(ProcessGroupNCCLErrorsTest, testNCCLTimedoutErrorsBlocking) {
// Now run all reduce with errors.
pg.set_timedout_error();
work = pg.allreduce(tensors_);
EXPECT_THROW(work->wait(), std::runtime_error);
EXPECT_THROW(work->wait(), c10::DistBackendError);
// Communicators might be aborted here, further operations would fail.
}