Commit Graph

2255 Commits

Author SHA1 Message Date
jainapurva
fdccb593c8 Add normalization and activation ops to operator benchmarks (#169544)
We're adding some more ops to the benchmarking:

Normalization ops:
- LayerNorm
- RMSNorm
- BatchNorm1d
- BatchNorm2d
- BatchNorm3d
- GroupNorm

Activation ops:
- nn.GELU
- nn.SiLU
- nn.ReLU
- nn.LeakyReLU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169544
Approved by: https://github.com/slayton58
2025-12-22 16:41:45 +00:00
shunting314
96b3e7d789 [ez] log why fail to gen golden ref (#170843)
Sometimes when doing accuracy test with a large batch size, it fail due to the golden ref can not be generated. It would be helpful to log why we fail to generate the golden ref. In my case, it's due to OOM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170843
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-12-19 21:00:07 +00:00
Markus Hoehnerbach
b9bf9a52f3 benchmarks: remove torchbench pin as it does not exist (#170304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170304
Approved by: https://github.com/eellison
2025-12-19 20:37:32 +00:00
eellison
e5b9bcedfe [inductor] Fix cudagraph skip for index_put_ with boolean indices, graph partitioning logic (#170103)
A few fixes:

- we weren't partitioning around index put with boolean inputs

- graph partitioning was skipping the whole graph, anytime we used V.graph.disable_cudagraphs_reason. There's no reason to use this for partitioning. I've updated the skip logic to encompass all of the reasons we would use disable cudagraphs reason.

- Prune the deterministic disable cudagraphs reason. I'm not sure how this list of ops got there originally, but I've added opinfo tests showing they're cudagraphable.

We run into errors with just part of these fixes, unrelated, so i'm doing this as one pr.

Fixes #169951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170103
Approved by: https://github.com/BoyuanFeng
2025-12-18 22:43:04 +00:00
Tomasz Bohutyn
7d355795e4 Raise XPU tolerances for bf16 ResNet & BotNet TorchBench (#170552)
Multiple TorchBench models on XPU fail accuracy tests due to numeric tolerance being too strict rather. Two contributing factors identified:

1. Measurement methodology change (PyTorch 2.6.0 enforcing cosine_similarity https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2227) surfaced limitations and increased sensitivity in error checks for phlippe_resnet.
2. BatchNorm decomposition noise (~1e-5 RMSE per BN in fp16) accumulates through the iteration in botnet26t_256, pushing aggregate diffs beyond current thresholds.

**Analysis**

- phlippe_resnet failures reproduce across CPU and XPU; fp16 already uses higher tolerance, implying bf16 thresholds are misaligned.
- Disabling BN decomposition brings botnet26t_256 outputs within tolerance; with decomposition enabled, cumulative numeric error is expected.
- CI health indicates changes are non-disruptive; failures, where present, are unrelated to these PRs.

Fixes https://github.com/intel/torch-xpu-ops/issues/1799
Fixes https://github.com/intel/torch-xpu-ops/issues/1305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170552
Approved by: https://github.com/EikanWang, https://github.com/desertfire

Co-authored-by: Tomasz Bohutyn <tbohutyn@habana.ai>
2025-12-17 21:04:17 +00:00
Jeff Daily
4fb120133c [ROCm][CI] update ci expected dynamo benchmark results (#170469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170469
Approved by: https://github.com/desertfire
2025-12-16 03:34:07 +00:00
PyTorch MergeBot
6c75344a1d Revert "[inductor] Fix cudagraph skip for index_put_ with boolean indices, graph partitioning logic (#170103)"
This reverts commit d3008cfb3c.

Reverted https://github.com/pytorch/pytorch/pull/170103 on behalf of https://github.com/izaitsevfb due to breaks inductor:cudagraph_trees_expandable_segments tests internally, see [D89125296](https://www.internalfb.com/diff/D89125296) ([comment](https://github.com/pytorch/pytorch/pull/170103#issuecomment-3652993720))
2025-12-15 05:06:03 +00:00
Bin Bao
a773df9943 [CI] Another update to rocm expected result files (#170309)
Summary: From `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py 47b09ca1c35d507849c4eb37ac1524e395ce39a2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170309
Approved by: https://github.com/eellison
ghstack dependencies: #170348
2025-12-14 21:23:41 +00:00
eellison
d3008cfb3c [inductor] Fix cudagraph skip for index_put_ with boolean indices, graph partitioning logic (#170103)
A few fixes:

- we weren't partitioning around index put with boolean inputs

- graph partitioning was skipping the whole graph, anytime we used V.graph.disable_cudagraphs_reason. There's no reason to use this for partitioning. I've updated the skip logic to encompass all of the reasons we would use disable cudagraphs reason.

- Prune the deterministic disable cudagraphs reason. I'm not sure how this list of ops got there originally, but I've added opinfo tests showing they're cudagraphable.

We run into errors with just part of these fixes, unrelated, so i'm doing this as one pr.

Fixes #169951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170103
Approved by: https://github.com/BoyuanFeng
2025-12-13 01:45:14 +00:00
Bin Bao
877a3a56f3 [CI] Update update_expected.py to skip cuda-13 results (#170348)
Summary: Since cuda-13 runs skip more tests, we should only use cuda-12 runs to update the expected result files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170348
Approved by: https://github.com/eellison
2025-12-13 00:02:19 +00:00
PyTorch MergeBot
8b21f924c3 Revert "[Inductor XPU GEMM] Step 1/N: Refactor cutlass configuration. (#160174)"
This reverts commit eabb7ad212.

Reverted https://github.com/pytorch/pytorch/pull/160174 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause a perf regression ([comment](https://github.com/pytorch/pytorch/pull/160174#issuecomment-3644336456))
2025-12-12 00:13:49 +00:00
Jason Ansel
b83fc89703 [dynamo] Fix benchmarks/dynamo/common.py error (#170009)
python benchmarks/dynamo/torchbench.py --performance --inference -k vision_maskrcnn

was failing with:
```
Traceback (most recent call last):
  File "/home/jansel/pytorch/benchmarks/dynamo/torchbench.py", line 490, in <module>
    torchbench_main()
  File "/home/jansel/pytorch/benchmarks/dynamo/torchbench.py", line 486, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 3730, in main
    process_entry(0, runner, original_dir, args)
  File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 3655, in process_entry
    result = run(runner, args, original_dir)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 4387, in run
    runner.run_one_model(
  File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 2966, in run_one_model
    status = self.run_performance_test(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/benchmarks/dynamo/common.py", line 2873, in run_performance_test
    experiment(
TypeError: coverage_experiment() got an unexpected keyword argument 'batch_size'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170009
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #170004
2025-12-11 00:09:02 +00:00
Bin Bao
32324da2f5 [ci] Update expected result files for rocm (#170072)
Summary: Updated by running "python benchmarks/dynamo/ci_expected_accuracy/update_expected.py afb173d9b9440d804b5f77d0c291e53c720d1fcf".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170072
Approved by: https://github.com/jataylo, https://github.com/jeffdaily
2025-12-10 18:17:54 +00:00
Bin Bao
a764023aaf [CI] Update expected result files (#169860)
Summary: benchmarks/dynamo/ci_expected_accuracy/update_expected.py stopped working because it needs to deal with both cuda and rocm runs now. Fixed it and updated the result files for inductor-periodic jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169860
Approved by: https://github.com/zou3519, https://github.com/huydhn
2025-12-10 14:34:14 +00:00
jainapurva
ab5b7fbfb9 Update operator benchmarks README (#168145)
This pull request adds comprehensive documentation to the operator benchmark suite, detailing how CI regression tracking is performed for both CPU and GPU devices. The new section in the `README.md` explains the workflows, devices, operators tracked, schedules, triggers, and instructions for manually running benchmarks. This update will help contributors understand how performance regressions are monitored and how to interact with the CI workflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168145
Approved by: https://github.com/malfet, https://github.com/huydhn

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Joel Schlosser <75754324+jbschlosser@users.noreply.github.com>
2025-12-09 15:14:06 +00:00
Jack Taylor
b53d925706 Remove outdated flaky models and enable deterministic algorithms on ROCm (#169024)
Remove outdated flaky models and enable deterministic algorithms on ROCm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169024
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-12-09 13:24:38 +00:00
Ivan Zaitsev
27bf9eb778 Update expected results for basic_modules_ListOfLinears_eager (#169895)
This pull request updates the expected results for the compile time instruction count in the `benchmarks/dynamo/pr_time_benchmarks/expected_results.csv` file. The change reflects a lower instruction count for the `basic_modules_ListOfLinears_eager` benchmark.

this reflects improvement after: https://github.com/pytorch/pytorch/pull/169553  that is outside the tolerance and fails the CI

e81262f1b9/1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169895
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-12-09 01:45:42 +00:00
cyy
5213a72bd3 Enable ruff SIM115 check (#169437)
This PR enables the ruff check for using context managers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169437
Approved by: https://github.com/Lucaskabela, https://github.com/albanD
2025-12-05 01:58:13 +00:00
Simon Fan
e5fd7b7ac8 Add a single GPU variant of modded-nanogpt to torchbench (#169502) (#169505)
Summary:

## Tests
Standalone: `python -m torchbenchmark.models.modded_nanogpt.main`
Through dynamo benchmarks: `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --only modded_nanogpt --disable-cudagraphs`

This PR adds a tweaked version of the Aug 23rd record for the nanogpt speedrun (GPT-2 small variant): 9d9dc969c4/train_gpt.py.

The later records can not be ran without building FA3 from source, so we will ommit them until the dynamo FA3 PR is merged.

The tweaks are to library-ify the script by commenting out everything other than the model class definitions, to change the pg initialization to use fake pg, and constant-ify some hyperparameters.

The tests run locally, but this model specifically requires H100. I wasn't sure how to filter for that, so I skipped all the tests. This will be tested on the dynamo benchmark side: https://github.com/pytorch/pytorch/pull/169449.

X-link: https://github.com/pytorch/benchmark/pull/2660

Differential Revision: D88233265

Pulled By: xmfan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169505
Approved by: https://github.com/BoyuanFeng
2025-12-04 20:25:02 +00:00
atalman
a36e1d39eb Triton 3.6 pin update (#168096)
Required for release 2.10

Rocm wheel build fix provided by: https://github.com/pytorch/pytorch/pull/169369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168096
Approved by: https://github.com/njriasan, https://github.com/malfet, https://github.com/huydhn
2025-12-04 15:09:20 +00:00
xinan.lin
eabb7ad212 [Inductor XPU GEMM] Step 1/N: Refactor cutlass configuration. (#160174)
This PR is the first step toward implementing RFC #160175.
Currently, all Cutlass-related Torch Inductor configs are located in `torch._inductor.config.cuda`. This PR refactors the device-agnostic Cutlass configurations into `torch._inductor.config.cutlass`, so they can be shared and reused by XPU as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160174
Approved by: https://github.com/EikanWang, https://github.com/mlazos, https://github.com/jansel
2025-12-04 08:53:43 +00:00
William Wen
c55b1e8f61 [dynamo, guards] cache Source hashing (#168886)
Final fix for https://github.com/pytorch/pytorch/issues/168118. Decreases guard build time from 9s -> 0.5s on a local tlparse. On the guard build benchmark, time went from 50.18s -> 8.15s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168886
Approved by: https://github.com/anijain2305
ghstack dependencies: #168131, #168203, #168386
2025-12-04 05:22:52 +00:00
Simon Fan
7eb6259200 [dynamo][benchmarks] add option to force amp to use bfloat16 instead of float16 (#169449)
For models which hardcode bf16 like modded-nanogpt

Tested on `python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --only modded_nanogpt --disable-cudagraphs` w/ https://github.com/pytorch/benchmark/pull/2660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169449
Approved by: https://github.com/Lucaskabela
2025-12-03 23:32:07 +00:00
PyTorch MergeBot
fdf863d5e1 Revert "Triton 3.6 pin update (#168096)"
This reverts commit 93d0d6838c.

Reverted https://github.com/pytorch/pytorch/pull/168096 on behalf of https://github.com/atalman due to Causes timeouts https://github.com/pytorch/pytorch/issues/169492 ([comment](https://github.com/pytorch/pytorch/pull/168096#issuecomment-3609092057))
2025-12-03 22:23:29 +00:00
William Wen
7f55ba19c4 [dynamo, guards] add guard builder microbenchmark (#169087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169087
Approved by: https://github.com/anijain2305
ghstack dependencies: #167888
2025-12-02 23:28:02 +00:00
atalman
93d0d6838c Triton 3.6 pin update (#168096)
Required for release 2.10

Rocm wheel build fix provided by: https://github.com/pytorch/pytorch/pull/169369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168096
Approved by: https://github.com/njriasan, https://github.com/malfet
2025-12-02 17:28:48 +00:00
Boyuan Feng
a7dc6dab9a bump transformer pin to 4.57.3 (#169226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169226
Approved by: https://github.com/anijain2305
2025-12-01 18:58:14 +00:00
Yuanyuan Chen
f47dd0ddef Enable SIM118 (#167399)
This PR enables the `SIM118` rule of ruff, which checks for key-existence checks against dict.keys() calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167399
Approved by: https://github.com/albanD
2025-11-28 08:00:09 +00:00
jainapurva
c9c8a8567d Add optimizer tests in operator microbenchmarks (#168101)
This PR adds comprehensive benchmarks for PyTorch optimizers to measure optimizer.step() performance across different parameter configurations.

### Optimizers benchmarked:
  - AdamW
  - Adam
  - SGD (with momentum=0.9)
  - RMSprop
  - Adagrad

### Test configurations:
- num_params: [1, 10, 100]
- param_size: [100K, 1M, 10M]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168101
Approved by: https://github.com/slayton58
2025-11-23 20:13:33 +00:00
PyTorch MergeBot
3f0d46c8b0 Revert "[Inductor XPU GEMM] Step 1/N: Refactor cutlass configuration. (#160174)"
This reverts commit 008ac433b0.

Reverted https://github.com/pytorch/pytorch/pull/160174 on behalf of https://github.com/yangw-dev due to failed internal tests test_cpu_/test_cpu#link-tree/torch/utils/_config_module.py line 371, in _config = self._config[name] KeyError: 'cuda.cutlass_dir' Diff: D87660662 ([comment](https://github.com/pytorch/pytorch/pull/160174#issuecomment-3567237578))
2025-11-23 00:46:16 +00:00
linhaifeng
9fa3e6e513 [BugFix] Fix incorrect type hint. (#168892)
tuple[int] -> tuple[int,...]

1 -> more

Like shape, shape: tuple[int, ...]  # [B, Hq, M, Hkv, N, D]

Inspired by https://github.com/pytorch/pytorch/pull/168320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168892
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-11-22 23:50:13 +00:00
Jason Ansel
b565593c62 [dynamo] Add optree.tree_map microbenchmark (#168341)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168341
Approved by: https://github.com/anijain2305
ghstack dependencies: #168340
2025-11-22 06:30:10 +00:00
xinan.lin
008ac433b0 [Inductor XPU GEMM] Step 1/N: Refactor cutlass configuration. (#160174)
This PR is the first step toward implementing RFC #160175.
Currently, all Cutlass-related Torch Inductor configs are located in `torch._inductor.config.cuda`. This PR refactors the device-agnostic Cutlass configurations into `torch._inductor.config.cutlass`, so they can be shared and reused by XPU as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160174
Approved by: https://github.com/EikanWang, https://github.com/mlazos, https://github.com/jansel
2025-11-21 15:44:24 +00:00
Fadi Arafeh
f97c3fc8e4 Re-enable ConvTranspose operator benchmarks for AArch64 (#166731)
This was disabled by #165585 due to #165654 which was fixed by #165904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166731
Approved by: https://github.com/malfet
ghstack dependencies: #165904
2025-11-20 18:01:10 +00:00
shunting314
b288d0020b [inductor] unittest for run2run determinism (#167482)
Not sure if the path are already properly setup so I can call 'benchmarks/dynamo/huggingface.py' in unit test directly. Let's tell from CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167482
Approved by: https://github.com/v0i0, https://github.com/mlazos
2025-11-17 20:12:15 +00:00
Adrian Abeyta
2f3bb7482c Improve benchmarks/dynamo:check_perf_csv output and failure summary (#161728)
Resolves https://github.com/pytorch/pytorch/issues/161290

## Summary

Expands `dynamo/check_perf_csv.py` output capabilities with latency, compile time and memory information:

- Display's measured speedup and display % from target
- Added clear messaging for all passing model tests when no regression is found
- Added error handling if csv file is missing

### Example (Failing Check)

```bash
python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40
```

**Example Output:**
```
Checking inductor_training_smoketest.csv (speedup threshold >= 1.40x)
hf_Bert                            speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB)
Error 1 model(s) performance regressed
    hf_Bert
  - hf_Bert: 1.005x (< 1.40x; -28.2% from target)
```

### Example (Passing Check)

```bash
python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40
```

**Example Output:**
```
Checking inductor_training_smoketest.csv (speedup threshold >= 1.00x)
hf_Bert                            speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB)
All 1 model(s) passed threshold check (>= 1.00x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161728
Approved by: https://github.com/isuruf
2025-11-17 17:54:29 +00:00
Richard Zou
bc60b86066 Skip stable diffusion models in torchbench, get tests and benchmarks green (#167896)
Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167896
Approved by: https://github.com/aorenste, https://github.com/shunting314
ghstack dependencies: #167609
2025-11-15 02:44:36 +00:00
Will Constable
782fc3c72b [DTensor] Add CPU instruction count benchmark for dispatch (#167394)
Following example from #149932 and doc in
[README.md](benchmarks/dynamo/pr_time_benchmarks/README.md)

cd benchmarks/dynamo/pr_time_benchmarks
`PYTHONPATH=./:../../../ python benchmarks/dtensor.py a`

Currently outputs:

```
collecting instruction count for dtensor_dispatch_detach
instruction count for iteration 0 is 14919468
instruction count for iteration 1 is 136283
instruction count for iteration 2 is 133750
instruction count for iteration 3 is 133757
instruction count for iteration 4 is 133751
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167394
Approved by: https://github.com/laithsakka
2025-11-13 06:54:08 +00:00
jainapurva
f9851af59b Add Attention ops to CI (#165915)
This pull request introduces a new attention operator microbenchmark workflow to the CI system, enabling automated benchmarking and reporting for attention-related operations. The main changes include adding a new GitHub Actions workflow, to add attention benchmarks to the existing Pytorch operator microbenchmark [dashboard](https://hud.pytorch.org/benchmark/v3/dashboard/pytorch_operator_microbenchmark?renderGroupId=main&time.start=2025-10-27T00%3A00%3A00.000Z&time.end=2025-10-29T01%3A00%3A00.000Z&filters.device=cuda&filters.arch=NVIDIA+A100-SXM4-40GB&filters.deviceName=cuda%7C%7CNVIDIA+A100-SXM4-40GB&filters.operatorName=&lcommit.commit=665df0bc7288996d638fcc3da750f8cb2addd6d0&lcommit.workflow_id=18888994873&lcommit.date=2025-10-29T00%3A00%3A00Z&lcommit.branch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915&rcommit.commit=665df0bc7288996d638fcc3da750f8cb2addd6d0&rcommit.workflow_id=18888994873&rcommit.date=2025-10-29T00%3A00%3A00Z&rcommit.branch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915&lbranch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915&rbranch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165915
Approved by: https://github.com/jbschlosser
2025-11-13 05:30:04 +00:00
Jeff Daily
ed79693706 [ROCm][CI] dynamo benchmark repvgg_a2 is flaky (#167660)
Update dynamo results due to flaky model
https://github.com/pytorch/pytorch/actions/runs/19283051320/job/55139788014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167660
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-11-12 17:41:19 +00:00
Richard Zou
4714eb7021 Update dynamic_inductor_timm_training.csv (#167609)
These tests were failing since they were added in
https://github.com/pytorch/pytorch/pull/165381

Evidence: scroll back in HUD, on that commit they were
failing.

I'm going to (1) set the accuracy to get CI green and (2) file an issue
for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167609
Approved by: https://github.com/choijon5, https://github.com/desertfire
2025-11-12 15:15:46 +00:00
PyTorch MergeBot
5ce4a8b49f Revert "fix wrong accuracy_status when exception. (#165731)"
This reverts commit bfcdbd0a97.

Reverted https://github.com/pytorch/pytorch/pull/165731 on behalf of https://github.com/zou3519 due to broke inductor periodic ([comment](https://github.com/pytorch/pytorch/pull/165731#issuecomment-3519743601))
2025-11-12 03:36:27 +00:00
Jeff Daily
9ae62fcc18 [ROCm][CI] dynamo benchmarks update ci expected accuracy (#167574)
repvgg_a2 IMPROVED: accuracy=pass, expected=fail_accuracy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167574
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-11-11 22:54:55 +00:00
linhaifeng
3e7a66fae1 [BugFix][Refactor] fix several instances which use f = open(...) without a corresponding f.close() (#167423)
This pattern can lead to potential file descriptor leaks, which can cause resource exhaustion or other unpredictable issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167423
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-11-11 11:27:59 +00:00
Xu Zhao
8d5cceeb6a [torchbench][optimus] Add backend optimus (#167357)
Summary: `--optimus [all | vertical_opt | horizontal_opt]` will kick off inductor compile with different fusion strategies.

Test Plan:
TorchBench Runner:

```
$ buck2 run mode/opt //pytorch/benchmark:run -- customized_optimus_illustrative -t train -d cuda
GPU Time per batch:   56.254 milliseconds
CPU Wall Time per batch:  56.326 milliseconds
CPU Wall Time:        56.326 milliseconds
Time to first batch:          420.0777 ms
GPU 0 Peak Memory:              0.0695 GB
CPU Peak Memory:              359.6362 GB
```

PT2 Benchmark Runner (comparing with eager):

```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only customized_optimus_illustrative  --performance --training --inductor

running benchmark: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:02<00:00, 14.37it/s]
4.509x
```

eager latency: ~56 ms
inductor latency: ~11 ms

Optimus backend:

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only customized_optimus_illustrative --performance --training --optimus all
11.02923508733511 ms, 13.884015614166856 ms, 0.794x
```

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only customized_optimus_illustrative --performance --training --optimus vertical_opt
12.47156853787601 ms, 10.699485195800662 ms, 1.166x
```

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only customized_optimus_illustrative --performance --training --optimus horizontal_opt
11.078484123572707 ms, 10.797873372212052 ms, 1.026x
```

optimus latency ~10 ms

Differential Revision: D86524903

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167357
Approved by: https://github.com/mengluy0125
2025-11-11 00:35:30 +00:00
Nicolas De Carli
9d9e7c7b1c [Pytorch] Extend OSS conversion benchmarks (#167099)
Summary: We are extending OSS conversion benchmarks, to include all combinations between types

Test Plan: CI

Differential Revision: D86315975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167099
Approved by: https://github.com/mcfi
2025-11-10 23:36:57 +00:00
linhaifeng
71606b289c [BugFix] Fix compute_error in coo_mean_time and csr_mean_time (#166795)
The csr timing loop is nested inside the coo loop. duplicated and inconsistent measurements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166795
Approved by: https://github.com/cyyever, https://github.com/ezyang
2025-11-08 23:57:15 +00:00
Apurva Jain
85fab6c9b0 Fix duplicate benchmarking entries for addmm (#166652)
There have been duplicate entries for addmm in dashboard. This PR fixes the duplicate entries issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166652
Approved by: https://github.com/yangw-dev
2025-11-06 03:25:03 +00:00
jainapurva
b8855e7b0b Add conv ops to operator microbenchmark (#166331)
Adding `conv` (conv1d, conv2d, conv3d) to the list of operator microbenchmarks run in the CI script (`.ci/pytorch/test.sh`), ensuring convolution operators are now benchmarked alongside existing ones.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166331
Approved by: https://github.com/huydhn, https://github.com/jbschlosser
2025-11-03 20:54:52 +00:00
Shunting Zhang
9f9dbe0a9a add a curve for customized compilation in the kernel benchmarking scripts (#166697)
It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features.

E.g. for mix-order-reduction, by running the following command
```
python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}'
```

I get following output:
```
Geomean speedup for benchmark RMSNormBackward
  eager 11 data points
  compiled 11 data points, 15.82x speedup
  quack 11 data points, 15.45x speedup
  liger 11 data points, 14.06x speedup
  compiled-no-fusion 11 data points, 10.26x speedup
```

The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative).

The new curve also shows up in the figure:
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166697
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #166053, #166382, #166461, #166585, #166675
2025-11-01 22:09:56 +00:00