mirror of
https://github.com/zebrajr/pytorch.git
synced 2026-01-15 12:15:51 +00:00
shard pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed 1->2
Fixes #ISSUE_NUMBER shard `pull / linux-xenial-cuda11.3-py3.7-gcc7 / test (distributed ...` from 1 shard to 2 Pros: - It currently takes about 2.6 hours and is 3rd longest running job on pull - Theoretically minimal overhead Cons: - Requires changes to the run_test.py which might have correctness issues Notes: - Cannot shard further as one of the test files is responsible for about half of the total run time spreadsheet regarding sharding: https://docs.google.com/spreadsheets/d/1BdtVsjRr0Is9LXMNilR02FEdPXNq7zEWl8AmR3ArsLQ/edit#gid=1153012347 Test Plan: <details><summary>expand to see test plan (its long)</summary> tests from a commit ran on master (90 tests ran) ``` 2022-05-03T12:45:34.7974184Z Selected tests: 2022-05-03T12:45:34.7974495Z distributed/_shard/sharded_optim/test_sharded_optim 2022-05-03T12:45:34.7974839Z distributed/_shard/sharded_tensor/ops/test_binary_cmp 2022-05-03T12:45:34.7975209Z distributed/_shard/sharded_tensor/ops/test_elementwise_ops 2022-05-03T12:45:34.7975575Z distributed/_shard/sharded_tensor/ops/test_embedding 2022-05-03T12:45:34.7976180Z distributed/_shard/sharded_tensor/ops/test_embedding_bag 2022-05-03T12:45:34.7976802Z distributed/_shard/sharded_tensor/ops/test_init 2022-05-03T12:45:34.7977361Z distributed/_shard/sharded_tensor/ops/test_linear 2022-05-03T12:45:34.7978157Z distributed/_shard/sharded_tensor/ops/test_math_ops 2022-05-03T12:45:34.7978879Z distributed/_shard/sharded_tensor/test_megatron_prototype 2022-05-03T12:45:34.7979594Z distributed/_shard/sharded_tensor/test_sharded_tensor 2022-05-03T12:45:34.7980366Z distributed/_shard/sharded_tensor/test_sharded_tensor_reshard 2022-05-03T12:45:34.7981066Z distributed/_shard/sharding_plan/test_sharding_plan 2022-05-03T12:45:34.7981877Z distributed/_shard/sharding_spec/test_sharding_spec 2022-05-03T12:45:34.7982387Z distributed/_shard/test_partial_tensor 2022-05-03T12:45:34.7982691Z distributed/_shard/test_replicated_tensor 2022-05-03T12:45:34.7982994Z distributed/_shard/test_sharder 2022-05-03T12:45:34.7983280Z distributed/algorithms/test_join 2022-05-03T12:45:34.7983695Z distributed/elastic/events/lib_test 2022-05-03T12:45:34.7983984Z distributed/elastic/metrics/api_test 2022-05-03T12:45:34.7984308Z distributed/elastic/multiprocessing/api_test 2022-05-03T12:45:34.7984624Z distributed/elastic/timer/api_test 2022-05-03T12:45:34.7984924Z distributed/elastic/timer/local_timer_example 2022-05-03T12:45:34.7985254Z distributed/elastic/timer/local_timer_test 2022-05-03T12:45:34.7985575Z distributed/elastic/utils/distributed_test 2022-05-03T12:45:34.7985889Z distributed/elastic/utils/logging_test 2022-05-03T12:45:34.7986176Z distributed/elastic/utils/util_test 2022-05-03T12:45:34.7986492Z distributed/fsdp/test_flatten_params_wrapper 2022-05-03T12:45:34.7986799Z distributed/fsdp/test_fsdp_apply 2022-05-03T12:45:34.7987078Z distributed/fsdp/test_fsdp_checkpoint 2022-05-03T12:45:34.7987388Z distributed/fsdp/test_fsdp_clip_grad_norm 2022-05-03T12:45:34.7987691Z distributed/fsdp/test_fsdp_comm 2022-05-03T12:45:34.7987961Z distributed/fsdp/test_fsdp_core 2022-05-03T12:45:34.7988251Z distributed/fsdp/test_fsdp_exec_order 2022-05-03T12:45:34.7988570Z distributed/fsdp/test_fsdp_freezing_weights 2022-05-03T12:45:34.7988865Z distributed/fsdp/test_fsdp_grad_acc 2022-05-03T12:45:34.7989176Z distributed/fsdp/test_fsdp_ignored_modules 2022-05-03T12:45:34.7989478Z distributed/fsdp/test_fsdp_input 2022-05-03T12:45:34.7989950Z distributed/fsdp/test_fsdp_memory 2022-05-03T12:45:34.7990241Z distributed/fsdp/test_fsdp_meta 2022-05-03T12:45:34.7990640Z distributed/fsdp/test_fsdp_mixed_precision 2022-05-03T12:45:34.7990964Z distributed/fsdp/test_fsdp_multiple_forward 2022-05-03T12:45:34.7991293Z distributed/fsdp/test_fsdp_multiple_wrapping 2022-05-03T12:45:34.7991610Z distributed/fsdp/test_fsdp_optim_state 2022-05-03T12:45:34.7991895Z distributed/fsdp/test_fsdp_overlap 2022-05-03T12:45:34.7992195Z distributed/fsdp/test_fsdp_pure_fp16 2022-05-03T12:45:34.7992500Z distributed/fsdp/test_fsdp_state_dict 2022-05-03T12:45:34.7992818Z distributed/fsdp/test_fsdp_summon_full_params 2022-05-03T12:45:34.7993117Z distributed/fsdp/test_fsdp_traversal 2022-05-03T12:45:34.7993861Z distributed/fsdp/test_fsdp_uneven 2022-05-03T12:45:34.7994181Z distributed/fsdp/test_shard_utils 2022-05-03T12:45:34.7994447Z distributed/fsdp/test_utils 2022-05-03T12:45:34.7994721Z distributed/fsdp/test_wrap 2022-05-03T12:45:34.7995015Z distributed/nn/jit/test_instantiator 2022-05-03T12:45:34.7995328Z distributed/optim/test_zero_redundancy_optimizer 2022-05-03T12:45:34.7995664Z distributed/pipeline/sync/skip/test_api 2022-05-03T12:45:34.7995983Z distributed/pipeline/sync/skip/test_gpipe 2022-05-03T12:45:34.7996315Z distributed/pipeline/sync/skip/test_inspect_skip_layout 2022-05-03T12:45:34.7996652Z distributed/pipeline/sync/skip/test_leak 2022-05-03T12:45:34.7996977Z distributed/pipeline/sync/skip/test_portal 2022-05-03T12:45:34.7997292Z distributed/pipeline/sync/skip/test_stash_pop 2022-05-03T12:45:34.7997623Z distributed/pipeline/sync/skip/test_tracker 2022-05-03T12:45:34.7997968Z distributed/pipeline/sync/skip/test_verify_skippables 2022-05-03T12:45:34.7998301Z distributed/pipeline/sync/test_balance 2022-05-03T12:45:34.7998591Z distributed/pipeline/sync/test_bugs 2022-05-03T12:45:34.7998927Z distributed/pipeline/sync/test_checkpoint 2022-05-03T12:45:34.7999243Z distributed/pipeline/sync/test_copy 2022-05-03T12:45:34.7999557Z distributed/pipeline/sync/test_deferred_batch_norm 2022-05-03T12:45:34.7999896Z distributed/pipeline/sync/test_dependency 2022-05-03T12:45:34.8000215Z distributed/pipeline/sync/test_inplace 2022-05-03T12:45:34.8000516Z distributed/pipeline/sync/test_microbatch 2022-05-03T12:45:34.8000826Z distributed/pipeline/sync/test_phony 2022-05-03T12:45:34.8001130Z distributed/pipeline/sync/test_pipe 2022-05-03T12:45:34.8001424Z distributed/pipeline/sync/test_pipeline 2022-05-03T12:45:34.8001733Z distributed/pipeline/sync/test_stream 2022-05-03T12:45:34.8002055Z distributed/pipeline/sync/test_transparency 2022-05-03T12:45:34.8002353Z distributed/pipeline/sync/test_worker 2022-05-03T12:45:34.8002672Z distributed/rpc/cuda/test_tensorpipe_agent 2022-05-03T12:45:34.8002982Z distributed/rpc/test_faulty_agent 2022-05-03T12:45:34.8003270Z distributed/rpc/test_tensorpipe_agent 2022-05-03T12:45:34.8003568Z distributed/test_c10d_common 2022-05-03T12:45:34.8003839Z distributed/test_c10d_gloo 2022-05-03T12:45:34.8004088Z distributed/test_c10d_nccl 2022-05-03T12:45:34.8004369Z distributed/test_c10d_spawn_gloo 2022-05-03T12:45:34.8004656Z distributed/test_c10d_spawn_nccl 2022-05-03T12:45:34.8004938Z distributed/test_data_parallel 2022-05-03T12:45:34.8005212Z distributed/test_distributed_spawn 2022-05-03T12:45:34.8005496Z distributed/test_launcher 2022-05-03T12:45:34.8005767Z distributed/test_nccl 2022-05-03T12:45:34.8006019Z distributed/test_pg_wrapper 2022-05-03T12:45:34.8006285Z distributed/test_store ``` tests ran on first shard for distributed on this PR (34 tests) ``` 2022-05-02T21:26:00.1385256Z Selected tests: 2022-05-02T21:26:00.1385767Z distributed/test_distributed_spawn 2022-05-02T21:26:00.1386403Z distributed/elastic/multiprocessing/api_test 2022-05-02T21:26:00.1387051Z distributed/fsdp/test_fsdp_memory 2022-05-02T21:26:00.1387607Z distributed/fsdp/test_fsdp_ignored_modules 2022-05-02T21:26:00.1388179Z distributed/fsdp/test_fsdp_apply 2022-05-02T21:26:00.1388600Z distributed/_shard/sharded_tensor/ops/test_binary_cmp 2022-05-02T21:26:00.1389181Z distributed/_shard/sharding_spec/test_sharding_spec 2022-05-02T21:26:00.1389545Z distributed/_shard/sharded_tensor/ops/test_linear 2022-05-02T21:26:00.1389878Z distributed/fsdp/test_fsdp_uneven 2022-05-02T21:26:00.1390186Z distributed/fsdp/test_fsdp_multiple_wrapping 2022-05-02T21:26:00.1390526Z distributed/fsdp/test_fsdp_multiple_forward 2022-05-02T21:26:00.1390877Z distributed/_shard/sharded_tensor/ops/test_embedding 2022-05-02T21:26:00.1391219Z distributed/_shard/test_partial_tensor 2022-05-02T21:26:00.1391542Z distributed/_shard/sharded_optim/test_sharded_optim 2022-05-02T21:26:00.1391915Z distributed/_shard/sharded_tensor/ops/test_elementwise_ops 2022-05-02T21:26:00.1392297Z distributed/fsdp/test_flatten_params_wrapper 2022-05-02T21:26:00.1392585Z distributed/fsdp/test_utils 2022-05-02T21:26:00.1392883Z distributed/nn/jit/test_instantiator 2022-05-02T21:26:00.1393167Z distributed/test_nccl 2022-05-02T21:26:00.1393466Z distributed/_shard/sharding_plan/test_sharding_plan 2022-05-02T21:26:00.1393787Z distributed/_shard/test_sharder 2022-05-02T21:26:00.1394085Z distributed/elastic/timer/api_test 2022-05-02T21:26:00.1394383Z distributed/pipeline/sync/skip/test_api 2022-05-02T21:26:00.1394738Z distributed/pipeline/sync/skip/test_inspect_skip_layout 2022-05-02T21:26:00.1395090Z distributed/pipeline/sync/skip/test_portal 2022-05-02T21:26:00.1395424Z distributed/pipeline/sync/skip/test_tracker 2022-05-02T21:26:00.1395935Z distributed/pipeline/sync/test_balance 2022-05-02T21:26:00.1396288Z distributed/pipeline/sync/test_checkpoint 2022-05-02T21:26:00.1396635Z distributed/pipeline/sync/test_deferred_batch_norm 2022-05-02T21:26:00.1396953Z distributed/pipeline/sync/test_inplace 2022-05-02T21:26:00.1397269Z distributed/pipeline/sync/test_phony 2022-05-02T21:26:00.1397587Z distributed/pipeline/sync/test_pipeline 2022-05-02T21:26:00.1397903Z distributed/pipeline/sync/test_transparency 2022-05-02T21:26:00.1398221Z distributed/rpc/test_faulty_agent ``` tests ran on second shard for distributed on this PR (56 tests) ``` 2022-05-02T21:26:55.1342892Z Selected tests: 2022-05-02T21:26:55.1343201Z distributed/rpc/cuda/test_tensorpipe_agent 2022-05-02T21:26:55.1343526Z distributed/fsdp/test_fsdp_core 2022-05-02T21:26:55.1343829Z distributed/test_c10d_nccl 2022-05-02T21:26:55.1344089Z distributed/test_c10d_gloo 2022-05-02T21:26:55.1344408Z distributed/fsdp/test_fsdp_summon_full_params 2022-05-02T21:26:55.1344749Z distributed/fsdp/test_fsdp_mixed_precision 2022-05-02T21:26:55.1345085Z distributed/optim/test_zero_redundancy_optimizer 2022-05-02T21:26:55.1345423Z distributed/fsdp/test_fsdp_optim_state 2022-05-02T21:26:55.1345773Z distributed/_shard/sharded_tensor/test_sharded_tensor 2022-05-02T21:26:55.1346088Z distributed/fsdp/test_fsdp_state_dict 2022-05-02T21:26:55.1346379Z distributed/test_store 2022-05-02T21:26:55.1346661Z distributed/test_c10d_spawn_gloo 2022-05-02T21:26:55.1346966Z distributed/test_pg_wrapper 2022-05-02T21:26:55.1347252Z distributed/test_c10d_spawn_nccl 2022-05-02T21:26:55.1347565Z distributed/fsdp/test_fsdp_clip_grad_norm 2022-05-02T21:26:55.1347871Z distributed/fsdp/test_wrap 2022-05-02T21:26:55.1348369Z distributed/fsdp/test_fsdp_grad_acc 2022-05-02T21:26:55.1348679Z distributed/algorithms/test_join 2022-05-02T21:26:55.1349004Z distributed/fsdp/test_fsdp_freezing_weights 2022-05-02T21:26:55.1349305Z distributed/fsdp/test_fsdp_comm 2022-05-02T21:26:55.1349593Z distributed/test_c10d_common 2022-05-02T21:26:55.1349885Z distributed/fsdp/test_fsdp_meta 2022-05-02T21:26:55.1350171Z distributed/fsdp/test_fsdp_exec_order 2022-05-02T21:26:55.1350486Z distributed/fsdp/test_fsdp_checkpoint 2022-05-02T21:26:55.1350798Z distributed/fsdp/test_fsdp_overlap 2022-05-02T21:26:55.1351105Z distributed/elastic/timer/local_timer_example 2022-05-02T21:26:55.1351423Z distributed/fsdp/test_fsdp_input 2022-05-02T21:26:55.1351749Z distributed/_shard/sharded_tensor/ops/test_init 2022-05-02T21:26:55.1352190Z distributed/elastic/timer/local_timer_test 2022-05-02T21:26:55.1352520Z distributed/elastic/utils/distributed_test 2022-05-02T21:26:55.1352841Z distributed/fsdp/test_fsdp_pure_fp16 2022-05-02T21:26:55.1353150Z distributed/test_data_parallel 2022-05-02T21:26:55.1353437Z distributed/fsdp/test_fsdp_traversal 2022-05-02T21:26:55.1353792Z distributed/_shard/sharded_tensor/test_sharded_tensor_reshard 2022-05-02T21:26:55.1354174Z distributed/_shard/sharded_tensor/ops/test_embedding_bag 2022-05-02T21:26:55.1354534Z distributed/_shard/sharded_tensor/test_megatron_prototype 2022-05-02T21:26:55.1354858Z distributed/test_launcher 2022-05-02T21:26:55.1355149Z distributed/elastic/utils/util_test 2022-05-02T21:26:55.1355441Z distributed/elastic/utils/logging_test 2022-05-02T21:26:55.1355755Z distributed/elastic/metrics/api_test 2022-05-02T21:26:55.1356095Z distributed/_shard/sharded_tensor/ops/test_math_ops 2022-05-02T21:26:55.1356455Z distributed/_shard/test_replicated_tensor 2022-05-02T21:26:55.1356754Z distributed/elastic/events/lib_test 2022-05-02T21:26:55.1357065Z distributed/fsdp/test_shard_utils 2022-05-02T21:26:55.1357387Z distributed/pipeline/sync/skip/test_gpipe 2022-05-02T21:26:55.1357702Z distributed/pipeline/sync/skip/test_leak 2022-05-02T21:26:55.1358040Z distributed/pipeline/sync/skip/test_stash_pop 2022-05-02T21:26:55.1358396Z distributed/pipeline/sync/skip/test_verify_skippables 2022-05-02T21:26:55.1358716Z distributed/pipeline/sync/test_bugs 2022-05-02T21:26:55.1359027Z distributed/pipeline/sync/test_copy 2022-05-02T21:26:55.1359350Z distributed/pipeline/sync/test_dependency 2022-05-02T21:26:55.1359662Z distributed/pipeline/sync/test_microbatch 2022-05-02T21:26:55.1359983Z distributed/pipeline/sync/test_pipe 2022-05-02T21:26:55.1360299Z distributed/pipeline/sync/test_stream 2022-05-02T21:26:55.1360593Z distributed/pipeline/sync/test_worker 2022-05-02T21:26:55.1360912Z distributed/rpc/test_tensorpipe_agent ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/76564 Approved by: https://github.com/jeffdaily, https://github.com/janeyx99
This commit is contained in:
committed by
PyTorch MergeBot
parent
3a2fc312be
commit
56ea57de61
3
.github/workflows/pull.yml
vendored
3
.github/workflows/pull.yml
vendored
@@ -181,7 +181,8 @@ jobs:
|
||||
{ include: [
|
||||
{ config: "default", shard: 1, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
|
||||
{ config: "default", shard: 2, num_shards: 2, runner: "linux.4xlarge.nvidia.gpu" },
|
||||
{ config: "distributed", shard: 1, num_shards: 1, runner: "linux.8xlarge.nvidia.gpu" },
|
||||
{ config: "distributed", shard: 1, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
|
||||
{ config: "distributed", shard: 2, num_shards: 2, runner: "linux.8xlarge.nvidia.gpu" },
|
||||
]}
|
||||
|
||||
linux-bionic-rocm5_1-py3_7-build:
|
||||
|
||||
@@ -538,6 +538,7 @@ def test_distributed(test_module, test_directory, options):
|
||||
backend, with_init
|
||||
)
|
||||
)
|
||||
old_environ = dict(os.environ)
|
||||
os.environ["TEMP_DIR"] = tmp_dir
|
||||
os.environ["BACKEND"] = backend
|
||||
os.environ["INIT_METHOD"] = "env://"
|
||||
@@ -588,6 +589,8 @@ def test_distributed(test_module, test_directory, options):
|
||||
return return_code
|
||||
finally:
|
||||
shutil.rmtree(tmp_dir)
|
||||
os.environ.clear()
|
||||
os.environ.update(old_environ)
|
||||
return 0
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user