shard pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed 1->2

Fixes #ISSUE_NUMBER

shard `pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed ...` from 1 shard to 2

Pros:
- It currently takes about 2.6 hours and is 3rd longest running job on pull
- Theoretically minimal overhead

Cons:
- Requires changes to the run_test.py which might have correctness issues

Notes:
- Cannot shard further as one of the test files is responsible for about half of the total run time

spreadsheet regarding sharding: https://docs.google.com/spreadsheets/d/1BdtVsjRr0Is9LXMNilR02FEdPXNq7zEWl8AmR3ArsLQ/edit#gid=1153012347

Test Plan:
<details><summary>expand to see test plan (its long)</summary>

tests from a commit ran on master (90 tests ran)
```
2022-05-03T12:45:34.7974184Z Selected tests:
2022-05-03T12:45:34.7974495Z  distributed/_shard/sharded_optim/test_sharded_optim
2022-05-03T12:45:34.7974839Z  distributed/_shard/sharded_tensor/ops/test_binary_cmp
2022-05-03T12:45:34.7975209Z  distributed/_shard/sharded_tensor/ops/test_elementwise_ops
2022-05-03T12:45:34.7975575Z  distributed/_shard/sharded_tensor/ops/test_embedding
2022-05-03T12:45:34.7976180Z  distributed/_shard/sharded_tensor/ops/test_embedding_bag
2022-05-03T12:45:34.7976802Z  distributed/_shard/sharded_tensor/ops/test_init
2022-05-03T12:45:34.7977361Z  distributed/_shard/sharded_tensor/ops/test_linear
2022-05-03T12:45:34.7978157Z  distributed/_shard/sharded_tensor/ops/test_math_ops
2022-05-03T12:45:34.7978879Z  distributed/_shard/sharded_tensor/test_megatron_prototype
2022-05-03T12:45:34.7979594Z  distributed/_shard/sharded_tensor/test_sharded_tensor
2022-05-03T12:45:34.7980366Z  distributed/_shard/sharded_tensor/test_sharded_tensor_reshard
2022-05-03T12:45:34.7981066Z  distributed/_shard/sharding_plan/test_sharding_plan
2022-05-03T12:45:34.7981877Z  distributed/_shard/sharding_spec/test_sharding_spec
2022-05-03T12:45:34.7982387Z  distributed/_shard/test_partial_tensor
2022-05-03T12:45:34.7982691Z  distributed/_shard/test_replicated_tensor
2022-05-03T12:45:34.7982994Z  distributed/_shard/test_sharder
2022-05-03T12:45:34.7983280Z  distributed/algorithms/test_join
2022-05-03T12:45:34.7983695Z  distributed/elastic/events/lib_test
2022-05-03T12:45:34.7983984Z  distributed/elastic/metrics/api_test
2022-05-03T12:45:34.7984308Z  distributed/elastic/multiprocessing/api_test
2022-05-03T12:45:34.7984624Z  distributed/elastic/timer/api_test
2022-05-03T12:45:34.7984924Z  distributed/elastic/timer/local_timer_example
2022-05-03T12:45:34.7985254Z  distributed/elastic/timer/local_timer_test
2022-05-03T12:45:34.7985575Z  distributed/elastic/utils/distributed_test
2022-05-03T12:45:34.7985889Z  distributed/elastic/utils/logging_test
2022-05-03T12:45:34.7986176Z  distributed/elastic/utils/util_test
2022-05-03T12:45:34.7986492Z  distributed/fsdp/test_flatten_params_wrapper
2022-05-03T12:45:34.7986799Z  distributed/fsdp/test_fsdp_apply
2022-05-03T12:45:34.7987078Z  distributed/fsdp/test_fsdp_checkpoint
2022-05-03T12:45:34.7987388Z  distributed/fsdp/test_fsdp_clip_grad_norm
2022-05-03T12:45:34.7987691Z  distributed/fsdp/test_fsdp_comm
2022-05-03T12:45:34.7987961Z  distributed/fsdp/test_fsdp_core
2022-05-03T12:45:34.7988251Z  distributed/fsdp/test_fsdp_exec_order
2022-05-03T12:45:34.7988570Z  distributed/fsdp/test_fsdp_freezing_weights
2022-05-03T12:45:34.7988865Z  distributed/fsdp/test_fsdp_grad_acc
2022-05-03T12:45:34.7989176Z  distributed/fsdp/test_fsdp_ignored_modules
2022-05-03T12:45:34.7989478Z  distributed/fsdp/test_fsdp_input
2022-05-03T12:45:34.7989950Z  distributed/fsdp/test_fsdp_memory
2022-05-03T12:45:34.7990241Z  distributed/fsdp/test_fsdp_meta
2022-05-03T12:45:34.7990640Z  distributed/fsdp/test_fsdp_mixed_precision
2022-05-03T12:45:34.7990964Z  distributed/fsdp/test_fsdp_multiple_forward
2022-05-03T12:45:34.7991293Z  distributed/fsdp/test_fsdp_multiple_wrapping
2022-05-03T12:45:34.7991610Z  distributed/fsdp/test_fsdp_optim_state
2022-05-03T12:45:34.7991895Z  distributed/fsdp/test_fsdp_overlap
2022-05-03T12:45:34.7992195Z  distributed/fsdp/test_fsdp_pure_fp16
2022-05-03T12:45:34.7992500Z  distributed/fsdp/test_fsdp_state_dict
2022-05-03T12:45:34.7992818Z  distributed/fsdp/test_fsdp_summon_full_params
2022-05-03T12:45:34.7993117Z  distributed/fsdp/test_fsdp_traversal
2022-05-03T12:45:34.7993861Z  distributed/fsdp/test_fsdp_uneven
2022-05-03T12:45:34.7994181Z  distributed/fsdp/test_shard_utils
2022-05-03T12:45:34.7994447Z  distributed/fsdp/test_utils
2022-05-03T12:45:34.7994721Z  distributed/fsdp/test_wrap
2022-05-03T12:45:34.7995015Z  distributed/nn/jit/test_instantiator
2022-05-03T12:45:34.7995328Z  distributed/optim/test_zero_redundancy_optimizer
2022-05-03T12:45:34.7995664Z  distributed/pipeline/sync/skip/test_api
2022-05-03T12:45:34.7995983Z  distributed/pipeline/sync/skip/test_gpipe
2022-05-03T12:45:34.7996315Z  distributed/pipeline/sync/skip/test_inspect_skip_layout
2022-05-03T12:45:34.7996652Z  distributed/pipeline/sync/skip/test_leak
2022-05-03T12:45:34.7996977Z  distributed/pipeline/sync/skip/test_portal
2022-05-03T12:45:34.7997292Z  distributed/pipeline/sync/skip/test_stash_pop
2022-05-03T12:45:34.7997623Z  distributed/pipeline/sync/skip/test_tracker
2022-05-03T12:45:34.7997968Z  distributed/pipeline/sync/skip/test_verify_skippables
2022-05-03T12:45:34.7998301Z  distributed/pipeline/sync/test_balance
2022-05-03T12:45:34.7998591Z  distributed/pipeline/sync/test_bugs
2022-05-03T12:45:34.7998927Z  distributed/pipeline/sync/test_checkpoint
2022-05-03T12:45:34.7999243Z  distributed/pipeline/sync/test_copy
2022-05-03T12:45:34.7999557Z  distributed/pipeline/sync/test_deferred_batch_norm
2022-05-03T12:45:34.7999896Z  distributed/pipeline/sync/test_dependency
2022-05-03T12:45:34.8000215Z  distributed/pipeline/sync/test_inplace
2022-05-03T12:45:34.8000516Z  distributed/pipeline/sync/test_microbatch
2022-05-03T12:45:34.8000826Z  distributed/pipeline/sync/test_phony
2022-05-03T12:45:34.8001130Z  distributed/pipeline/sync/test_pipe
2022-05-03T12:45:34.8001424Z  distributed/pipeline/sync/test_pipeline
2022-05-03T12:45:34.8001733Z  distributed/pipeline/sync/test_stream
2022-05-03T12:45:34.8002055Z  distributed/pipeline/sync/test_transparency
2022-05-03T12:45:34.8002353Z  distributed/pipeline/sync/test_worker
2022-05-03T12:45:34.8002672Z  distributed/rpc/cuda/test_tensorpipe_agent
2022-05-03T12:45:34.8002982Z  distributed/rpc/test_faulty_agent
2022-05-03T12:45:34.8003270Z  distributed/rpc/test_tensorpipe_agent
2022-05-03T12:45:34.8003568Z  distributed/test_c10d_common
2022-05-03T12:45:34.8003839Z  distributed/test_c10d_gloo
2022-05-03T12:45:34.8004088Z  distributed/test_c10d_nccl
2022-05-03T12:45:34.8004369Z  distributed/test_c10d_spawn_gloo
2022-05-03T12:45:34.8004656Z  distributed/test_c10d_spawn_nccl
2022-05-03T12:45:34.8004938Z  distributed/test_data_parallel
2022-05-03T12:45:34.8005212Z  distributed/test_distributed_spawn
2022-05-03T12:45:34.8005496Z  distributed/test_launcher
2022-05-03T12:45:34.8005767Z  distributed/test_nccl
2022-05-03T12:45:34.8006019Z  distributed/test_pg_wrapper
2022-05-03T12:45:34.8006285Z  distributed/test_store
```

tests ran on first shard for distributed on this PR (34 tests)
```
2022-05-02T21:26:00.1385256Z Selected tests:
2022-05-02T21:26:00.1385767Z  distributed/test_distributed_spawn
2022-05-02T21:26:00.1386403Z  distributed/elastic/multiprocessing/api_test
2022-05-02T21:26:00.1387051Z  distributed/fsdp/test_fsdp_memory
2022-05-02T21:26:00.1387607Z  distributed/fsdp/test_fsdp_ignored_modules
2022-05-02T21:26:00.1388179Z  distributed/fsdp/test_fsdp_apply
2022-05-02T21:26:00.1388600Z  distributed/_shard/sharded_tensor/ops/test_binary_cmp
2022-05-02T21:26:00.1389181Z  distributed/_shard/sharding_spec/test_sharding_spec
2022-05-02T21:26:00.1389545Z  distributed/_shard/sharded_tensor/ops/test_linear
2022-05-02T21:26:00.1389878Z  distributed/fsdp/test_fsdp_uneven
2022-05-02T21:26:00.1390186Z  distributed/fsdp/test_fsdp_multiple_wrapping
2022-05-02T21:26:00.1390526Z  distributed/fsdp/test_fsdp_multiple_forward
2022-05-02T21:26:00.1390877Z  distributed/_shard/sharded_tensor/ops/test_embedding
2022-05-02T21:26:00.1391219Z  distributed/_shard/test_partial_tensor
2022-05-02T21:26:00.1391542Z  distributed/_shard/sharded_optim/test_sharded_optim
2022-05-02T21:26:00.1391915Z  distributed/_shard/sharded_tensor/ops/test_elementwise_ops
2022-05-02T21:26:00.1392297Z  distributed/fsdp/test_flatten_params_wrapper
2022-05-02T21:26:00.1392585Z  distributed/fsdp/test_utils
2022-05-02T21:26:00.1392883Z  distributed/nn/jit/test_instantiator
2022-05-02T21:26:00.1393167Z  distributed/test_nccl
2022-05-02T21:26:00.1393466Z  distributed/_shard/sharding_plan/test_sharding_plan
2022-05-02T21:26:00.1393787Z  distributed/_shard/test_sharder
2022-05-02T21:26:00.1394085Z  distributed/elastic/timer/api_test
2022-05-02T21:26:00.1394383Z  distributed/pipeline/sync/skip/test_api
2022-05-02T21:26:00.1394738Z  distributed/pipeline/sync/skip/test_inspect_skip_layout
2022-05-02T21:26:00.1395090Z  distributed/pipeline/sync/skip/test_portal
2022-05-02T21:26:00.1395424Z  distributed/pipeline/sync/skip/test_tracker
2022-05-02T21:26:00.1395935Z  distributed/pipeline/sync/test_balance
2022-05-02T21:26:00.1396288Z  distributed/pipeline/sync/test_checkpoint
2022-05-02T21:26:00.1396635Z  distributed/pipeline/sync/test_deferred_batch_norm
2022-05-02T21:26:00.1396953Z  distributed/pipeline/sync/test_inplace
2022-05-02T21:26:00.1397269Z  distributed/pipeline/sync/test_phony
2022-05-02T21:26:00.1397587Z  distributed/pipeline/sync/test_pipeline
2022-05-02T21:26:00.1397903Z  distributed/pipeline/sync/test_transparency
2022-05-02T21:26:00.1398221Z  distributed/rpc/test_faulty_agent
```

tests ran on second shard for distributed on this PR (56 tests)
```
2022-05-02T21:26:55.1342892Z Selected tests:
2022-05-02T21:26:55.1343201Z  distributed/rpc/cuda/test_tensorpipe_agent
2022-05-02T21:26:55.1343526Z  distributed/fsdp/test_fsdp_core
2022-05-02T21:26:55.1343829Z  distributed/test_c10d_nccl
2022-05-02T21:26:55.1344089Z  distributed/test_c10d_gloo
2022-05-02T21:26:55.1344408Z  distributed/fsdp/test_fsdp_summon_full_params
2022-05-02T21:26:55.1344749Z  distributed/fsdp/test_fsdp_mixed_precision
2022-05-02T21:26:55.1345085Z  distributed/optim/test_zero_redundancy_optimizer
2022-05-02T21:26:55.1345423Z  distributed/fsdp/test_fsdp_optim_state
2022-05-02T21:26:55.1345773Z  distributed/_shard/sharded_tensor/test_sharded_tensor
2022-05-02T21:26:55.1346088Z  distributed/fsdp/test_fsdp_state_dict
2022-05-02T21:26:55.1346379Z  distributed/test_store
2022-05-02T21:26:55.1346661Z  distributed/test_c10d_spawn_gloo
2022-05-02T21:26:55.1346966Z  distributed/test_pg_wrapper
2022-05-02T21:26:55.1347252Z  distributed/test_c10d_spawn_nccl
2022-05-02T21:26:55.1347565Z  distributed/fsdp/test_fsdp_clip_grad_norm
2022-05-02T21:26:55.1347871Z  distributed/fsdp/test_wrap
2022-05-02T21:26:55.1348369Z  distributed/fsdp/test_fsdp_grad_acc
2022-05-02T21:26:55.1348679Z  distributed/algorithms/test_join
2022-05-02T21:26:55.1349004Z  distributed/fsdp/test_fsdp_freezing_weights
2022-05-02T21:26:55.1349305Z  distributed/fsdp/test_fsdp_comm
2022-05-02T21:26:55.1349593Z  distributed/test_c10d_common
2022-05-02T21:26:55.1349885Z  distributed/fsdp/test_fsdp_meta
2022-05-02T21:26:55.1350171Z  distributed/fsdp/test_fsdp_exec_order
2022-05-02T21:26:55.1350486Z  distributed/fsdp/test_fsdp_checkpoint
2022-05-02T21:26:55.1350798Z  distributed/fsdp/test_fsdp_overlap
2022-05-02T21:26:55.1351105Z  distributed/elastic/timer/local_timer_example
2022-05-02T21:26:55.1351423Z  distributed/fsdp/test_fsdp_input
2022-05-02T21:26:55.1351749Z  distributed/_shard/sharded_tensor/ops/test_init
2022-05-02T21:26:55.1352190Z  distributed/elastic/timer/local_timer_test
2022-05-02T21:26:55.1352520Z  distributed/elastic/utils/distributed_test
2022-05-02T21:26:55.1352841Z  distributed/fsdp/test_fsdp_pure_fp16
2022-05-02T21:26:55.1353150Z  distributed/test_data_parallel
2022-05-02T21:26:55.1353437Z  distributed/fsdp/test_fsdp_traversal
2022-05-02T21:26:55.1353792Z  distributed/_shard/sharded_tensor/test_sharded_tensor_reshard
2022-05-02T21:26:55.1354174Z  distributed/_shard/sharded_tensor/ops/test_embedding_bag
2022-05-02T21:26:55.1354534Z  distributed/_shard/sharded_tensor/test_megatron_prototype
2022-05-02T21:26:55.1354858Z  distributed/test_launcher
2022-05-02T21:26:55.1355149Z  distributed/elastic/utils/util_test
2022-05-02T21:26:55.1355441Z  distributed/elastic/utils/logging_test
2022-05-02T21:26:55.1355755Z  distributed/elastic/metrics/api_test
2022-05-02T21:26:55.1356095Z  distributed/_shard/sharded_tensor/ops/test_math_ops
2022-05-02T21:26:55.1356455Z  distributed/_shard/test_replicated_tensor
2022-05-02T21:26:55.1356754Z  distributed/elastic/events/lib_test
2022-05-02T21:26:55.1357065Z  distributed/fsdp/test_shard_utils
2022-05-02T21:26:55.1357387Z  distributed/pipeline/sync/skip/test_gpipe
2022-05-02T21:26:55.1357702Z  distributed/pipeline/sync/skip/test_leak
2022-05-02T21:26:55.1358040Z  distributed/pipeline/sync/skip/test_stash_pop
2022-05-02T21:26:55.1358396Z  distributed/pipeline/sync/skip/test_verify_skippables
2022-05-02T21:26:55.1358716Z  distributed/pipeline/sync/test_bugs
2022-05-02T21:26:55.1359027Z  distributed/pipeline/sync/test_copy
2022-05-02T21:26:55.1359350Z  distributed/pipeline/sync/test_dependency
2022-05-02T21:26:55.1359662Z  distributed/pipeline/sync/test_microbatch
2022-05-02T21:26:55.1359983Z  distributed/pipeline/sync/test_pipe
2022-05-02T21:26:55.1360299Z  distributed/pipeline/sync/test_stream
2022-05-02T21:26:55.1360593Z  distributed/pipeline/sync/test_worker
2022-05-02T21:26:55.1360912Z  distributed/rpc/test_tensorpipe_agent
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76564
Approved by: https://github.com/jeffdaily, https://github.com/janeyx99
This commit is contained in:
Catherine Lee
2022-05-03 23:01:42 +00:00
committed by PyTorch MergeBot
parent 3a2fc312be
commit 56ea57de61
2 changed files with 5 additions and 1 deletions

View File

@@ -181,7 +181,8 @@ jobs:
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
{ config: "distributed", shard: 1, num_shards: 1, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 1, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
]}
linux-bionic-rocm5_1-py3_7-build:

View File

@@ -538,6 +538,7 @@ def test_distributed(test_module, test_directory, options):
backend, with_init
)
)
old_environ = dict(os.environ)
os.environ["TEMP_DIR"] = tmp_dir
os.environ["BACKEND"] = backend
os.environ["INIT_METHOD"] = "env://"
@@ -588,6 +589,8 @@ def test_distributed(test_module, test_directory, options):
return return_code
finally:
shutil.rmtree(tmp_dir)
os.environ.clear()
os.environ.update(old_environ)
return 0