albanD
1913ee1aec
Assert in github, docs, setup and top ( #170598 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170598
Approved by: https://github.com/ezyang , https://github.com/cyyever
2026-01-01 05:27:53 +00:00
Masaki Kozuki
28f22d94eb
Touch __init__.py in vendored_templates for CuTeDSL Grouped MM template ( #170566 )
...
This pull request makes a small improvement to the `mirror_inductor_external_kernels` function in `setup.py` to ensure that newly created directories are recognized as Python packages by `find_packages`.
* When creating a new directory for mirrored files, the code now adds an empty `__init__.py` file to the directory so that `find_packages` treats it as a submodule.Added inclusion of vendored_templates for CuTeDSL.
This is to fix `ModuleNotFoundError` of `torch._inductor.kernel.vendored_templates`
```
$ pytest -v -s -x test/inductor/test_cutedsl_grouped_mm.py
...
FAILED [2.2725s] test_cutedsl_grouped_mm.py::TestCuTeDSLGroupedGemm::test_grouped_gemm_assorted_layouts_layout_A_contiguous_layout_B_broadcasted - torch._inductor.exc.InductorError: ModuleNotFoundError: No module named 'torch._inductor.kernel.vendored_templates'
```
The error wouldn't be there if pytorch is installed in development mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170566
Approved by: https://github.com/Skylion007
2025-12-20 03:09:07 +00:00
Puneet Matharu
7a38744ffa
[AArch64][Build] allow missing cutlass file if CUDA disabled ( #167720 )
...
In a CUDA-disabled build of PyTorch, you may well want to wipe the `third_party/cutlass` directory. However, this can produce an error in `setup.py:mirror_inductor_external_kernels()`. This patch ignores the missing file if `USE_CUDA=0` is set before calling `setup.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167720
Approved by: https://github.com/NikhilAPatel , https://github.com/fadara01 , https://github.com/aditew01
2025-12-08 13:02:31 +00:00
Mikayla Gawarecki
892640e25a
Hide all symbols (except stable/headeronly/shim) if TORCH_STABLE_ONLY is defined ( #167496 )
...
Fixes https://github.com/pytorch/pytorch/issues/161660
This extends the `TORCH_STABLE_ONLY` stopgap added in https://github.com/pytorch/pytorch/pull/161658
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167496
Approved by: https://github.com/janeyx99 , https://github.com/malfet , https://github.com/atalman
2025-12-02 13:10:20 +00:00
PyTorch MergeBot
acf5b204b0
Revert "Hide all symbols (except stable/headeronly/shim) if TORCH_STABLE_ONLY is defined ( #167496 )"
...
This reverts commit 8f4dc30453 .
Reverted https://github.com/pytorch/pytorch/pull/167496 on behalf of https://github.com/atalman due to Failing validations - https://github.com/pytorch/test-infra/actions/runs/19513141127/job/55857898996 ([comment](https://github.com/pytorch/pytorch/pull/167496#issuecomment-3554287955 ))
2025-11-19 19:26:12 +00:00
PyTorch MergeBot
a097e166db
Revert "Error when non stable/headeronly/shim headers are included by stable extension ( #167855 )"
...
This reverts commit a0ccd3e5ff .
Reverted https://github.com/pytorch/pytorch/pull/167855 on behalf of https://github.com/atalman due to Failing validations ([comment](https://github.com/pytorch/pytorch/pull/167855#issuecomment-3553987894 ))
2025-11-19 17:59:50 +00:00
Mikayla Gawarecki
a0ccd3e5ff
Error when non stable/headeronly/shim headers are included by stable extension ( #167855 )
...
Address Nikita's offline comment on #167496
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167855
Approved by: https://github.com/janeyx99
ghstack dependencies: #167496
2025-11-19 14:13:45 +00:00
Mikayla Gawarecki
8f4dc30453
Hide all symbols (except stable/headeronly/shim) if TORCH_STABLE_ONLY is defined ( #167496 )
...
Fixes https://github.com/pytorch/pytorch/issues/161660
This extends the `TORCH_STABLE_ONLY` stopgap added in https://github.com/pytorch/pytorch/pull/161658
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167496
Approved by: https://github.com/janeyx99 , https://github.com/malfet
2025-11-19 14:13:45 +00:00
Tristan Rice
f6b54d8899
flight_recorder: move to torch.distributed ( #167782 )
...
Summary: This moves torchfrtrace to be under `torch.distributed.flight_recorder` instead of `tools.flight_recorder` as the `tools` package is not included in the torch wheels. This makes it so you can use fr trace analyze without using it from a source checkout
Test Plan:
```
buck run //caffe2/fb/flight_recorder:fr_trace
```
CI
Differential Revision: D87022129
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167782
Approved by: https://github.com/fduwjj
2025-11-15 01:16:59 +00:00
PyTorch MergeBot
602102be50
Revert "Hide all symbols (except stable/headeronly/shim) if TORCH_STABLE_ONLY is defined ( #167496 )"
...
This reverts commit bc09a84150 .
Reverted https://github.com/pytorch/pytorch/pull/167496 on behalf of https://github.com/jeanschmidt due to trying to revert 165139, my intention is to land it again, so, will land this once both are reverted ([comment](https://github.com/pytorch/pytorch/pull/167496#issuecomment-3534641209 ))
2025-11-14 21:33:02 +00:00
Mikayla Gawarecki
bc09a84150
Hide all symbols (except stable/headeronly/shim) if TORCH_STABLE_ONLY is defined ( #167496 )
...
Fixes https://github.com/pytorch/pytorch/issues/161660
This extends the `TORCH_STABLE_ONLY` stopgap added in https://github.com/pytorch/pytorch/pull/161658
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167496
Approved by: https://github.com/janeyx99
ghstack dependencies: #167495
2025-11-12 19:15:52 +00:00
Nikhil Patel
a4c7856112
[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel ( #167340 )
...
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/165036 , which previously contained a minor bug in the logic that determined whether the kernel should be enabled. As a result, it was incorrectly activated on non-Blackwell GPUs.
Test Plan:
Inductor test (fbcode):
`INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1"`
Tritonbench (fbcode):
`clear; CUDA_VISIBLE_DEVICES=7 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1" -- --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_cute_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy`
Tritonbench(oss):
`clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy`
Unit Tests(oss):
`clear; python test/inductor/test_cutedsl_grouped_mm.py`
Differential Revision: D86537373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167340
Approved by: https://github.com/jananisriram
2025-11-10 00:29:07 +00:00
PyTorch MergeBot
12860892f8
Revert "[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel ( #167182 )"
...
This reverts commit 77b70970f7 .
Reverted https://github.com/pytorch/pytorch/pull/167182 on behalf of https://github.com/NikhilAPatel due to breaks local source build ([comment](https://github.com/pytorch/pytorch/pull/167182#issuecomment-3503598156 ))
2025-11-07 16:45:23 +00:00
Nikhil Patel
77b70970f7
[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel ( #167182 )
...
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/165036 , which previously contained a minor bug in the logic that determined whether the kernel should be enabled. As a result, it was incorrectly activated on non-Blackwell GPUs.
Test Plan:
Inductor test (fbcode):
`INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1"`
Tritonbench (fbcode):
`clear; CUDA_VISIBLE_DEVICES=7 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1" -- --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_cute_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy`
Tritonbench(oss):
`clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy`
Unit Tests(oss):
`clear; python test/inductor/test_cutedsl_grouped_mm.py`
Differential Revision: D86376880
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167182
Approved by: https://github.com/mlazos , https://github.com/jananisriram
2025-11-06 19:55:38 +00:00
PyTorch MergeBot
5c639466f7
Revert "[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel ( #167003 )"
...
This reverts commit 658c5f879c .
Reverted https://github.com/pytorch/pytorch/pull/167003 on behalf of https://github.com/atalman due to regressed vllm signal: [GH job link](https://github.com/pytorch/pytorch/actions/runs/19093785744/job/54553796743 ) [HUD commit link](658c5f879c ) ([comment](https://github.com/pytorch/pytorch/pull/167003#issuecomment-3491527704 ))
2025-11-05 14:30:15 +00:00
Nikhil Patel
658c5f879c
[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel ( #167003 )
...
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/165036?fbclid=IwY2xjawN3RL1leHRuA2FlbQIxMQBicmlkETExOEcxcnVhNVA1TzRSVmhiAR63GOEpJbZA-JhQ0CSj9ji8H_RHBUhDwYNDtxjOYfDol56OGqmC4r7jPP96Fw_aem_bWvtMfVifLQrnpv1YB_fJA , which previously contained a minor bug in the logic that determined whether the kernel should be enabled. As a result, it was incorrectly activated on non-Blackwell GPUs.
Test Plan:
Inductor test (fbcode):
`INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1"`
Tritonbench (fbcode):
`clear; CUDA_VISIBLE_DEVICES=7 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1" -- --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_cute_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy`
Tritonbench(oss):
`clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy`
Unit Tests(oss):
`clear; python test/inductor/test_cutedsl_grouped_mm.py`
Differential Revision: D86231180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167003
Approved by: https://github.com/jananisriram
2025-11-05 06:51:30 +00:00
PyTorch MergeBot
d77c24caac
Revert "[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel ( #165036 )"
...
This reverts commit 0e1a88904f .
Reverted https://github.com/pytorch/pytorch/pull/165036 on behalf of https://github.com/atalman due to regressed vllm signal: [GH job link](https://github.com/pytorch/pytorch/actions/runs/19059329909/job/54439919668 ) [HUD commit link](0e1a88904f ) ([comment](https://github.com/pytorch/pytorch/pull/165036#issuecomment-3487846555 ))
2025-11-04 20:13:33 +00:00
Nikhil Patel
0e1a88904f
[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel ( #165036 )
...
Make sure you're on cutlass 4.2.0+
Test Plan:
Tritonbench(oss):
`clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy`
Unit Tests(oss):
`clear; python test/inductor/test_cutedsl_grouped_mm.py`
Differential Revision: D82010227
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165036
Approved by: https://github.com/alexsamardzic , https://github.com/drisspg , https://github.com/mlazos
2025-11-04 05:58:58 +00:00
linhaifeng
369f2d6951
[3/N] fix typo in other folders ( #166606 )
...
fix typo in other folders
#166374
#166126
_typos.toml
```bash
[files]
extend-exclude = ["tools/linter/dictionary.txt"]
[default.extend-words]
nd = "nd"
arange = "arange"
Nd = "Nd"
GLOBALs = "GLOBALs"
hte = "hte"
iy = "iy"
PN = "PN"
Dout = "Dout"
optin = "optin"
gam = "gam"
PTD = "PTD"
Sur = "Sur"
nin = "nin"
tme = "tme"
inpt = "inpt"
mis = "mis"
Raison = "Raison"
ouput = "ouput"
nto = "nto"
Onwer = "Onwer"
callibrate = "callibrate"
ser = "ser"
Metdata = "Metdata"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166606
Approved by: https://github.com/ezyang
2025-10-30 10:30:40 +00:00
Jerry Mannil
202f83dc4e
[ROCm][layer_norm] Use __builtin_amdgcn_rcpf(x) instead of 1.f/x ( #165589 )
...
Replace (more) exact calculation with hardware approximation.
Benefits:
Reduced code size.
Improved performance for certain scenarios.
Experiments show low reduction in precision.
Experiments show no significant performance regressions. bfloat16 as well as float16 related calculations may benefit largely from this change.
Co-author: @mhalk @amd-hhashemi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165589
Approved by: https://github.com/jeffdaily
2025-10-17 09:12:30 +00:00
Murray Steele
0fd976b65c
Enable mimalloc on non-Windows platforms and make default for AArch64 builds ( #164741 )
...
This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc.
**Updated Results**
Torchbench FP32 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90 " />
Torchbench BF16 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3 " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741
Approved by: https://github.com/fadara01 , https://github.com/aditew01 , https://github.com/malfet
2025-10-09 20:49:46 +00:00
PyTorch MergeBot
688efd9741
Revert "Enable mimalloc on non-Windows platforms and make default for AArch64 builds ( #164741 )"
...
This reverts commit 87eccf10e8 .
Reverted https://github.com/pytorch/pytorch/pull/164741 on behalf of https://github.com/malfet due to But it breaks MacOS builds, see https://github.com/pytorch/pytorch/actions/runs/18382886648/job/52373781138 ([comment](https://github.com/pytorch/pytorch/pull/164741#issuecomment-3386859778 ))
2025-10-09 17:30:25 +00:00
Murray Steele
87eccf10e8
Enable mimalloc on non-Windows platforms and make default for AArch64 builds ( #164741 )
...
This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc.
**Updated Results**
Torchbench FP32 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90 " />
Torchbench BF16 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3 " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741
Approved by: https://github.com/fadara01 , https://github.com/aditew01 , https://github.com/malfet
2025-10-09 16:45:31 +00:00
atalman
98c4e35f14
[CD] Add statically linked windows libraries to exclude list ( #163768 )
...
Fixes: https://github.com/pytorch/pytorch/issues/159514
Seeing following in the Wheel build logs:
```
Linking CXX static library lib\kineto.lib
Linking CXX static library lib\dnnl.lib
....
```
These files are around 800MB uncompressed and 109MB compressed, hence provide ~50% size reduction for Windows CPU builds.
Test Plan: Build Pytorch Windows binary. Build vision, audio and torchcodec with this binary. Smoke test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163768
Approved by: https://github.com/albanD , https://github.com/malfet
2025-09-25 14:03:14 +00:00
Edward Yang
2c5a3d7e60
Delete functorch C extension entirely. ( #163340 )
...
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163340
Approved by: https://github.com/aorenste , https://github.com/wdvr , https://github.com/albanD , https://github.com/malfet
2025-09-24 06:08:58 +00:00
Nikita Shulga
5e7be98800
[BE] Update Python min version to 3.10 ( #162310 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman , https://github.com/Skylion007 , https://github.com/ZainRizvi
2025-09-22 17:04:21 +00:00
PyTorch MergeBot
10adeb9044
Revert "[BE] Update Python min version to 3.10 ( #162310 )"
...
This reverts commit 9f5a644f07 .
Reverted https://github.com/pytorch/pytorch/pull/162310 on behalf of https://github.com/malfet due to Broke lint, but to the best of my knowledge it's no longer possible to run lint for all files on PRs ([comment](https://github.com/pytorch/pytorch/pull/162310#issuecomment-3319289031 ))
2025-09-22 14:13:59 +00:00
Nikita Shulga
9f5a644f07
[BE] Update Python min version to 3.10 ( #162310 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman , https://github.com/Skylion007 , https://github.com/ZainRizvi
2025-09-22 13:37:02 +00:00
PyTorch MergeBot
ae5be038a6
Revert "Delete functorch C extension entirely. ( #163340 )"
...
This reverts commit 1faf6367e3 .
Reverted https://github.com/pytorch/pytorch/pull/163340 on behalf of https://github.com/wdvr due to temporary revert to pull out #162659 ([comment](https://github.com/pytorch/pytorch/pull/163340#issuecomment-3317105243 ))
2025-09-22 06:20:04 +00:00
Edward Yang
1faf6367e3
Delete functorch C extension entirely. ( #163340 )
...
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163340
Approved by: https://github.com/aorenste
ghstack dependencies: #160236
2025-09-21 06:02:21 +00:00
PyTorch MergeBot
578047838c
Revert "[BE] Update Python min version to 3.10 ( #162310 )"
...
This reverts commit 3016616ccb .
Reverted https://github.com/pytorch/pytorch/pull/162310 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](https://github.com/pytorch/pytorch/pull/162862#issuecomment-3310606135 ))
2025-09-19 05:16:49 +00:00
Nikita Shulga
3016616ccb
[BE] Update Python min version to 3.10 ( #162310 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman , https://github.com/Skylion007 , https://github.com/ZainRizvi
ghstack dependencies: #162862
2025-09-19 04:28:56 +00:00
Robert Hardwick
1aeac304b8
Move prioritized text linker optimization code from setup.py to cmake ( #160078 )
...
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.
### Summary
🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )
This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.
### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.
Note:
Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.
Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
2025-09-18 17:09:48 +00:00
PyTorch MergeBot
94db2ad51d
Revert "Move prioritized text linker optimization code from setup.py to cmake ( #160078 )"
...
This reverts commit 26b3ae5890 .
Reverted https://github.com/pytorch/pytorch/pull/160078 on behalf of https://github.com/atalman due to Sorry reverting this broke linux aarch64 CUDA nightlies [pytorch/pytorch/actions/runs/17637486681/job/50146967503](https://github.com/pytorch/pytorch/actions/runs/17637486681/job/50146967503 ) ([comment](https://github.com/pytorch/pytorch/pull/160078#issuecomment-3281426631 ))
2025-09-11 15:29:29 +00:00
Robert Hardwick
26b3ae5890
Move prioritized text linker optimization code from setup.py to cmake ( #160078 )
...
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.
### Summary
🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )
This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.
### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.
Note:
Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.
Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
2025-09-10 09:21:53 +00:00
PyTorch MergeBot
d711f27845
Revert "[ROCm] [CK] Composable Kernel integration for inductor backend ( #158747 )"
...
This reverts commit 019fed39aa .
Reverted https://github.com/pytorch/pytorch/pull/158747 on behalf of https://github.com/jithunnair-amd due to Broke linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test: 019fed39aa/1 ... PR didn't have this job run successfully due to CI outage ([comment](https://github.com/pytorch/pytorch/pull/158747#issuecomment-3259212343 ))
2025-09-05 17:27:45 +00:00
iupaikov-amd
019fed39aa
[ROCm] [CK] Composable Kernel integration for inductor backend ( #158747 )
...
This is a part of our effort for integrating Composable Kernel library for Inductor backend. Currently we have a submodule, but would prefer to have commit pin control over the library as with Triton. We intentionally avoid putting all installation logic in CI scripts to allow locally built versions to have this functionality.
The idea is to have CK as a pytorch dependency in pytorch 2.9 release to allow people to use it with inductor and AOT inductor and then gradually step away from submodule usage. Right now CK usage in SDPA/Gemm is tied to submodule files.
This PR is a remake of due to branch error: https://github.com/pytorch/pytorch/pull/156192
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158747
Approved by: https://github.com/jeffdaily
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com >
Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com >
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-09-04 16:51:06 +00:00
Chris Thi
69a25f6888
[ROCm] Enable USE_FBGEMM_GENAI ( #160676 )
...
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/4703
X-link: https://github.com/facebookresearch/FBGEMM/pull/1728
In this diff we enable the support for the new FBGEMM backed FP8 _scaled_grouped_mm on ROCm. For now we only enable support for `gfx942` as that is what we have thoroughly tested performance and correctness on.
Rollback Plan:
Differential Revision: D79564024
Test Plan:
Ensure builds with:
- `USE_FBGEMM_GENAI=1` and without gfx942
- `USE_FBGEMM_GENAI=1` and with gfx942
- `USE_FBGEMM_GENAI=1` and all current [`PYTORCH_ROCM_ARCH`](9491d289b3/.ci/docker/libtorch/build.sh (L48) )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160676
Approved by: https://github.com/drisspg
2025-09-04 07:13:17 +00:00
Eli Uriegas
0447f2d99b
build: Add fallback commands to setup.py ( #162009 )
...
Adds fallback commands for the following:
* python setup.py install
* python setup.py develop
Ideally these should just work and should provide backwards compat.
Thought process here is that multiple people rely on these commands and just because setuptools wants to drop support for this I don't think a lot of our downstream users who build from source are expecting these to be gone.
This should provide some room for developers to move away from these commands until we have a unified frontend for doing all of these commands that should abstract most of these away.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162009
Approved by: https://github.com/clee2000 , https://github.com/atalman
2025-09-03 02:56:10 +00:00
PyTorch MergeBot
ab7787fb82
Revert "[inductor] Windows inductor use intel-openmp. ( #160258 )"
...
This reverts commit 41673110cd .
Reverted https://github.com/pytorch/pytorch/pull/160258 on behalf of https://github.com/malfet due to Reverting to fix https://github.com/pytorch/pytorch/issues/160898 and https://github.com/pytorch/pytorch/issues/160962 ([comment](https://github.com/pytorch/pytorch/pull/160258#issuecomment-3220158145 ))
2025-08-25 12:57:47 +00:00
PyTorch MergeBot
1eccfb157a
Revert "[BE] Remove intel-openmp dependency in setup.py ( #160976 )"
...
This reverts commit e483947047 .
Reverted https://github.com/pytorch/pytorch/pull/160976 on behalf of https://github.com/malfet due to This PR is doing something strange ([comment](https://github.com/pytorch/pytorch/pull/160976#issuecomment-3220120462 ))
2025-08-25 12:46:12 +00:00
Wang, Chuanqi
e483947047
[BE] Remove intel-openmp dependency in setup.py ( #160976 )
...
Fixes #160962
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160976
Approved by: https://github.com/xuhancn , https://github.com/atalman
2025-08-20 16:33:16 +00:00
FFFrog
39aa3d1471
Remove the dead code in setup.py ( #160515 )
...
The following line has no effect.
34ec5ed275/setup.py (L1205)
This code was originally introduced in this PR: dd7cec680c ,
and clang11 and later now support `-fstack-clash-protection`. Can we remove this line?
@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160515
Approved by: https://github.com/isuruf , https://github.com/albanD
2025-08-14 06:02:11 +00:00
drisspg
15e49f6164
Factor out the strings to templates for better editor integration ( #160357 )
...
# Summary
More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting
Before
<img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b " />
After:
<img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6 " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357
Approved by: https://github.com/eellison
2025-08-14 01:07:53 +00:00
PyTorch MergeBot
c656334120
Revert "Factor out the strings to templates for better editor integration ( #160357 )"
...
This reverts commit cbffde7745 .
Reverted https://github.com/pytorch/pytorch/pull/160357 on behalf of https://github.com/clee2000 due to broke a bunch of internal builds due to not being able to find the file No such file or directory: torch/_inductor/kernel/flex/templates/flex_decode.py.jinja D80145761, might need a buck targets change? ([comment](https://github.com/pytorch/pytorch/pull/160357#issuecomment-3184435581 ))
2025-08-13 15:40:50 +00:00
Xu Han
41673110cd
[inductor] Windows inductor use intel-openmp. ( #160258 )
...
After some debug work, I found PyTorch torch_cpu.dll is using intel-openmp, but not MSVC openmp.
So, switch Windows inductor to intel-openmp.
It fixed: c8205cb354/test/inductor/test_aot_inductor.py (L2405-L2408)
<img width="896" height="230" alt="image" src="https://github.com/user-attachments/assets/273b00f8-7dc1-43c9-9b7f-752e16355a80 " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160258
Approved by: https://github.com/ezyang
2025-08-13 02:36:19 +00:00
drisspg
cbffde7745
Factor out the strings to templates for better editor integration ( #160357 )
...
# Summary
More code motion, tldr is that install 'Better Jinja' in vscode and now you can get highlighting
Before
<img width="776" height="926" alt="Screenshot 2025-08-11 at 2 41 08 PM" src="https://github.com/user-attachments/assets/10868b31-f8ac-4cf5-99fe-19b8789ce06b " />
After:
<img width="1184" height="1299" alt="Screenshot 2025-08-11 at 2 40 27 PM" src="https://github.com/user-attachments/assets/45203765-589e-4d76-8196-d895a2f2fbf6 " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160357
Approved by: https://github.com/eellison
2025-08-12 21:59:54 +00:00
Scott Todd
bfc873d02e
[ROCm][Windows] Revert copying hipblaslt and rocblas dirs. ( #159083 )
...
This reverts the changes from b367e5f6a6 . This will also close https://github.com/pytorch/pytorch/pull/158922 .
Since 30387ab2e4 , ROCm is bootstrapped using the 'rocm' Python module which contains these files (see https://github.com/ROCm/TheRock/blob/main/docs/packaging/python_packaging.md ), so they do not need to be bundled into torch/lib.
There was also a bug in here - if `ROCM_DIR` is unset, the code crashes:
```
File "D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\setuptools\_distutils\dist.py", line 1002, in run_command
cmd_obj.run()
File "D:\b\pytorch_main\setup.py", line 853, in run
rocm_dir_path = Path(os.environ["ROCM_DIR"])
~~~~~~~~~~^^^^^^^^^^^^
File "<frozen os>", line 714, in __getitem__
KeyError: 'ROCM_DIR'
```
The code could have checked for `ROCM_PATH` too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159083
Approved by: https://github.com/jeffdaily
2025-08-12 02:45:49 +00:00
Andres Lugo
5f5f508aa8
[ROCm] Ck backend UX refactor ( #152951 )
...
Refactors how the enablement/disablement of CK Gemms and SDPA works.
- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy , https://github.com/jeffdaily , https://github.com/m-gallus
Co-authored-by: Jeff Daily <jeff.daily@amd.com >
Co-authored-by: Jithun Nair <jithun.nair@amd.com >
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com >
2025-08-08 18:40:17 +00:00
Edward Yang
38d65c6465
Add a USE_NIGHTLY option to setup.py ( #159965 )
...
If you run python setup.py develop with USE_NIGHTLY, instead of actually building PyTorch we will just go ahead and download the corresponding nightly version you specified and dump its binaries. This is intended to obsolete tools/nightly.py. There's some UX polish for detecting what the latest nightly is if you pass in a blank string. I only tested on OS X.
Coded with claude code.
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159965
Approved by: https://github.com/malfet
2025-08-07 01:44:20 +00:00