A few fixes:
- we weren't partitioning around index put with boolean inputs
- graph partitioning was skipping the whole graph, anytime we used V.graph.disable_cudagraphs_reason. There's no reason to use this for partitioning. I've updated the skip logic to encompass all of the reasons we would use disable cudagraphs reason.
- Prune the deterministic disable cudagraphs reason. I'm not sure how this list of ops got there originally, but I've added opinfo tests showing they're cudagraphable.
We run into errors with just part of these fixes, unrelated, so i'm doing this as one pr.
Fixes#169951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170103
Approved by: https://github.com/BoyuanFeng
Multiple TorchBench models on XPU fail accuracy tests due to numeric tolerance being too strict rather. Two contributing factors identified:
1. Measurement methodology change (PyTorch 2.6.0 enforcing cosine_similarity https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2227) surfaced limitations and increased sensitivity in error checks for phlippe_resnet.
2. BatchNorm decomposition noise (~1e-5 RMSE per BN in fp16) accumulates through the iteration in botnet26t_256, pushing aggregate diffs beyond current thresholds.
**Analysis**
- phlippe_resnet failures reproduce across CPU and XPU; fp16 already uses higher tolerance, implying bf16 thresholds are misaligned.
- Disabling BN decomposition brings botnet26t_256 outputs within tolerance; with decomposition enabled, cumulative numeric error is expected.
- CI health indicates changes are non-disruptive; failures, where present, are unrelated to these PRs.
Fixes https://github.com/intel/torch-xpu-ops/issues/1799
Fixes https://github.com/intel/torch-xpu-ops/issues/1305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170552
Approved by: https://github.com/EikanWang, https://github.com/desertfire
Co-authored-by: Tomasz Bohutyn <tbohutyn@habana.ai>
A few fixes:
- we weren't partitioning around index put with boolean inputs
- graph partitioning was skipping the whole graph, anytime we used V.graph.disable_cudagraphs_reason. There's no reason to use this for partitioning. I've updated the skip logic to encompass all of the reasons we would use disable cudagraphs reason.
- Prune the deterministic disable cudagraphs reason. I'm not sure how this list of ops got there originally, but I've added opinfo tests showing they're cudagraphable.
We run into errors with just part of these fixes, unrelated, so i'm doing this as one pr.
Fixes#169951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170103
Approved by: https://github.com/BoyuanFeng
python benchmarks/dynamo/torchbench.py --performance --inference -k vision_maskrcnn
was failing with:
```
Traceback (most recent call last):
File "/home/jansel/pytorch/benchmarks/dynamo/torchbench.py", line 490, in <module>
torchbench_main()
File "/home/jansel/pytorch/benchmarks/dynamo/torchbench.py", line 486, in torchbench_main
main(TorchBenchmarkRunner(), original_dir)
File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 3730, in main
process_entry(0, runner, original_dir, args)
File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 3655, in process_entry
result = run(runner, args, original_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 4387, in run
runner.run_one_model(
File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 2966, in run_one_model
status = self.run_performance_test(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 2873, in run_performance_test
experiment(
TypeError: coverage_experiment() got an unexpected keyword argument 'batch_size'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170009
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #170004
This pull request adds comprehensive documentation to the operator benchmark suite, detailing how CI regression tracking is performed for both CPU and GPU devices. The new section in the `README.md` explains the workflows, devices, operators tracked, schedules, triggers, and instructions for manually running benchmarks. This update will help contributors understand how performance regressions are monitored and how to interact with the CI workflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168145
Approved by: https://github.com/malfet, https://github.com/huydhn
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Joel Schlosser <75754324+jbschlosser@users.noreply.github.com>
Summary:
## Tests
Standalone: `python -m torchbenchmark.models.modded_nanogpt.main`
Through dynamo benchmarks: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --only modded_nanogpt --disable-cudagraphs`
This PR adds a tweaked version of the Aug 23rd record for the nanogpt speedrun (GPT-2 small variant): 9d9dc969c4/train_gpt.py.
The later records can not be ran without building FA3 from source, so we will ommit them until the dynamo FA3 PR is merged.
The tweaks are to library-ify the script by commenting out everything other than the model class definitions, to change the pg initialization to use fake pg, and constant-ify some hyperparameters.
The tests run locally, but this model specifically requires H100. I wasn't sure how to filter for that, so I skipped all the tests. This will be tested on the dynamo benchmark side: https://github.com/pytorch/pytorch/pull/169449.
X-link: https://github.com/pytorch/benchmark/pull/2660
Differential Revision: D88233265
Pulled By: xmfan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169505
Approved by: https://github.com/BoyuanFeng
Following example from #149932 and doc in
[README.md](benchmarks/dynamo/pr_time_benchmarks/README.md)
cd benchmarks/dynamo/pr_time_benchmarks
`PYTHONPATH=./:../../../ python benchmarks/dtensor.py a`
Currently outputs:
```
collecting instruction count for dtensor_dispatch_detach
instruction count for iteration 0 is 14919468
instruction count for iteration 1 is 136283
instruction count for iteration 2 is 133750
instruction count for iteration 3 is 133757
instruction count for iteration 4 is 133751
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167394
Approved by: https://github.com/laithsakka
It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features.
E.g. for mix-order-reduction, by running the following command
```
python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}'
```
I get following output:
```
Geomean speedup for benchmark RMSNormBackward
eager 11 data points
compiled 11 data points, 15.82x speedup
quack 11 data points, 15.45x speedup
liger 11 data points, 14.06x speedup
compiled-no-fusion 11 data points, 10.26x speedup
```
The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative).
The new curve also shows up in the figure:
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166697
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #166053, #166382, #166461, #166585, #166675