pytorch/test/distributed at 2e2fb668fa63e63ee13ff0eafbcadabb9c653de2 - pytorch - Carlos Sousa's Git

OSSForks/pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2026-01-15 12:15:51 +00:00

Files

History

fduwjj eaeae0ac95 [c10d] Change collective to take in a list of tensors so it work fully for all collectives (#135049 )

We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL.

This partially revert what we did in https://github.com/pytorch/pytorch/pull/119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135049
Approved by: https://github.com/kwen2501

2024-09-05 07:56:56 +00:00

..

[PP] Fix zero bubble composability with DP (#134052 )

2024-09-04 23:46:29 +00:00

…

[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 )

2024-08-30 02:13:45 +00:00

Runtime Estimator for estimating GPU compute time (#134243 )

2024-08-28 20:06:54 +00:00

…

…

[DTensor] Extend implicit replication to replicate DTensor for foreach ops so model doesn't have to be fully tp-ed when using 2D (#134551 )

2024-08-29 09:01:31 +00:00

[TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882 )

2024-09-03 19:43:21 +00:00

flight_recorder

[FR] Make pg_name unique, show P2P collective status and fix bugs when running the script as command (#134780 )

2024-08-30 18:03:17 +00:00

[FSDP] casting input args with dataclass(frozen=True) (#135067 )

2024-09-05 01:19:53 +00:00

…

…

…

[PP] Fix zero bubble composability with DP (#134052 )

2024-09-04 23:46:29 +00:00

…

tensor/parallel

Exclude test_transformers and unit tests which require recent GPU arch (#132895 )

2024-08-27 20:40:53 +00:00

argparse_util_test.py

…

test_c10d_common.py

Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 )"

2024-08-30 16:27:40 +00:00

test_c10d_functional_native.py

…

test_c10d_gloo.py

[BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200 )

2024-08-15 15:50:19 +00:00

test_c10d_logger.py

…

test_c10d_nccl.py

[c10d] Change collective to take in a list of tensors so it work fully for all collectives (#135049 )

2024-09-05 07:56:56 +00:00

test_c10d_object_collectives.py

…

test_c10d_ops_nccl.py

…

test_c10d_pypg.py

…

test_c10d_spawn_gloo.py

…

test_c10d_spawn_nccl.py

…

test_c10d_spawn_ucc.py

…

test_c10d_spawn.py

…

test_c10d_ucc.py

…

test_collective_utils.py

…

test_compute_comm_reordering.py

…

test_control_collectives.py

…

test_data_parallel.py

…

test_device_mesh.py

Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 )"

2024-08-30 16:27:40 +00:00

test_distributed_spawn.py

…

test_dynamo_distributed.py

restore CSE'd node metadata in runtime asserts pass (#134516 )

2024-09-05 07:50:04 +00:00

test_fake_pg.py

…

test_functional_api.py

make make_fx collective test single threaded (#134775 )

2024-08-30 15:58:20 +00:00

test_inductor_collectives.py

…

test_launcher.py

…

test_multi_threaded_pg.py

…

test_nccl.py

…

test_pg_wrapper.py

…

test_store.py

…

test_symmetric_memory.py

[CUDA][P2P] Check device capability in requires_cuda_p2p_access (#134523 )

2024-08-30 14:08:55 +00:00