This PR applies various ruff experimental fixes to code. Most changes are about simplifying `if` conditions. These fixes were applied with `ruff --preview` but my view indicates that they are safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170968
Approved by: https://github.com/Lucaskabela
From https://github.com/pytorch/audio/blob/main/src/libtorchaudio/stable/ops.h
Technically it should have been ok not to port these but looking at these carefully I realized the subtract ported to audio ~would have undefined behavior :/~ is broken
```
inline Tensor subtract(const Tensor& self, const Tensor& other) {
const auto num_args = 2;
std::array<StableIValue, num_args> stack{
torch::stable::detail::from(self), torch::stable::detail::from(other)};
TORCH_ERROR_CODE_CHECK(torch_call_dispatcher(
"aten::subtract", "Tensor", stack.data(), TORCH_ABI_VERSION));
return torch::stable::detail::to<torch::stable::Tensor>(stack[0]);
}
```
as it missed `alpha` the signature for `subtract.Tensor` is `func: subtract.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor`. ~This is also our bad as although out of bounds reads on the stableivalue stack would be caught by asan, without asan they are silent correctness issues (PR coming to fix).~
Use the old path to support this as we don't support stableivalue conversion for Scalar yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169880
Approved by: https://github.com/albanD
ghstack dependencies: #169703, #169709, #169711, #168062, #169872
Similar to https://github.com/pytorch/pytorch/pull/158137 (thank you, @AaronWang04, for the instructional tips and answering my questions!), but performs
`Activation(Addmm) -> _addmm_activation` replacement instead of `Activation(add(mm)) -> _addmm_activation`.
The reasons as to why this mapping over the one in https://github.com/pytorch/pytorch/pull/158137 are:
- Prior work done to extend cuBLASLt coverage in `addmm` beyond just 1D bias and `beta=1, alpha=1`. As long as there is an activation after `addmm`, we can call Lt. This makes the check for pattern replacement leaner and agnostic to the inputs' meta-data (`addmm`'s checks for free).
- Inductor intercepts `addmm` and replaces it with
`alpha * [alpha != 1] * m1 @ m2 + beta * [beta != 1] * input` when followed by point-wise consumers (including activation functions).
So it is way easier and cleaner to intercept just `addmm` (and not combinatorial set of patterns) before such replacements.
Re-run of the benchmark script in https://github.com/pytorch/pytorch/pull/158137 on H100 yields:
`float16`:
```
============================================================
Testing with M=1024, N=1024, K=1024, dtype=float16
============================================================
Average Time per Iteration (cublas): 0.0096 ms
Average Time per Iteration (torch compile): 0.0407 ms
============================================================
Testing with M=2048, N=2048, K=2048, dtype=float16
============================================================
Average Time per Iteration (cublas): 0.0270 ms
Average Time per Iteration (torch compile): 0.0409 ms
============================================================
Testing with M=4096, N=4096, K=4096, dtype=float16
============================================================
Average Time per Iteration (cublas): 0.1828 ms
Average Time per Iteration (torch compile): 0.2415 ms
============================================================
Testing with M=8192, N=8192, K=8192, dtype=float16
============================================================
Average Time per Iteration (cublas): 1.5971 ms
Average Time per Iteration (torch compile): 1.9723 ms
```
`bfloat16`:
```
============================================================
Testing with M=1024, N=1024, K=1024, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 0.0093 ms
Average Time per Iteration (torch compile): 0.0416 m
============================================================
Testing with M=2048, N=2048, K=2048, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 0.0264 ms
Average Time per Iteration (torch compile): 0.0411 ms
============================================================
Testing with M=4096, N=4096, K=4096, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 0.1768 ms
Average Time per Iteration (torch compile): 0.2430 ms
============================================================
Testing with M=8192, N=8192, K=8192, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 1.5564 ms
Average Time per Iteration (torch compile): 1.8916 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168157
Approved by: https://github.com/eellison, https://github.com/eqy
Similar to https://github.com/pytorch/pytorch/pull/158137 (thank you, @AaronWang04, for the instructional tips and answering my questions!), but performs
`Activation(Addmm) -> _addmm_activation` replacement instead of `Activation(add(mm)) -> _addmm_activation`.
The reasons as to why this mapping over the one in https://github.com/pytorch/pytorch/pull/158137 are:
- Prior work done to extend cuBLASLt coverage in `addmm` beyond just 1D bias and `beta=1, alpha=1`. As long as there is an activation after `addmm`, we can call Lt. This makes the check for pattern replacement leaner and agnostic to the inputs' meta-data (`addmm`'s checks for free).
- Inductor intercepts `addmm` and replaces it with
`alpha * [alpha != 1] * m1 @ m2 + beta * [beta != 1] * input` when followed by point-wise consumers (including activation functions).
So it is way easier and cleaner to intercept just `addmm` (and not combinatorial set of patterns) before such replacements.
Re-run of the benchmark script in https://github.com/pytorch/pytorch/pull/158137 on H100 yields:
`float16`:
```
============================================================
Testing with M=1024, N=1024, K=1024, dtype=float16
============================================================
Average Time per Iteration (cublas): 0.0096 ms
Average Time per Iteration (torch compile): 0.0407 ms
============================================================
Testing with M=2048, N=2048, K=2048, dtype=float16
============================================================
Average Time per Iteration (cublas): 0.0270 ms
Average Time per Iteration (torch compile): 0.0409 ms
============================================================
Testing with M=4096, N=4096, K=4096, dtype=float16
============================================================
Average Time per Iteration (cublas): 0.1828 ms
Average Time per Iteration (torch compile): 0.2415 ms
============================================================
Testing with M=8192, N=8192, K=8192, dtype=float16
============================================================
Average Time per Iteration (cublas): 1.5971 ms
Average Time per Iteration (torch compile): 1.9723 ms
```
`bfloat16`:
```
============================================================
Testing with M=1024, N=1024, K=1024, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 0.0093 ms
Average Time per Iteration (torch compile): 0.0416 m
============================================================
Testing with M=2048, N=2048, K=2048, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 0.0264 ms
Average Time per Iteration (torch compile): 0.0411 ms
============================================================
Testing with M=4096, N=4096, K=4096, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 0.1768 ms
Average Time per Iteration (torch compile): 0.2430 ms
============================================================
Testing with M=8192, N=8192, K=8192, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 1.5564 ms
Average Time per Iteration (torch compile): 1.8916 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168157
Approved by: https://github.com/eellison, https://github.com/eqy
Summary:
Original commit changeset: d6d62d0c96dd
Original Phabricator Diff: D84468451 and D84613184
D84468451 caused CUDA OutOfMemoryError in model.
Test Plan:
D84468451 was found through bisect. Also double checked on recent trunk 9866939225248c2adc307be7a804b26db0b9b555: f815887517
With this diff that backs out D84468451 and D84613184 : f816114560
Differential Revision: D85025378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165910
Approved by: https://github.com/clee2000
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition. You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.
This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.
Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition. You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.
This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.
Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition. You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.
This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.
Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164573
Summary:
Original commit changeset: 06888d7ebff0
Original Phabricator Diff: D82932788
Restricted the test to SM90 for scaled_grouped_mm
Test Plan: TBD (will share the linux CI results)
Differential Revision: D83283991
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163905
Approved by: https://github.com/angelayi
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context
we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.
when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()
one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
at::MemoryFormat memory_format) const {
if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
}
return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);
This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.
so I had to define it for pyinterpreter, and then I had to override it for nested tensors.
Approved by: https://github.com/ezyang
Test Plan:
contbuild & OSS CI, see e444cd24d4
Rollback Plan:
Differential Revision: D80435179
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context
we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.
when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()
one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
at::MemoryFormat memory_format) const {
if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
}
return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);
This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.
so I had to define it for pyinterpreter, and then I had to override it for nested tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159197
Approved by: https://github.com/ezyang
PR implements a pass in post_grad to fuse activation(add + mm)
This was previously done similarly here #106912 but was reverted for performance reasons. it was replaced with a pass that unfuses the activation and add from addmm/addmm_activation and let inductor handle the fusion.
however since then cuBLAS team has made a lot of perf improvements on this, will update this post with more benchmarks but preliminary benchmark show good results
perf dash board
<img width="3371" height="1240" alt="Screenshot from 2025-08-07 13-41-35" src="https://github.com/user-attachments/assets/d44d6205-b33a-4a20-9f0f-d9db176b3738" />
Relu works with both training and inference but gelu only works with inference mode due to some fundamental limitations since gelu's derivative depends on input and relu's doesnt. don't think this is fixable with the current addmm_activation API
Graph module before and after this pass
Relu(addmm)
```
graph():
%primals_1 : [num_users=1] = placeholder[target=primals_1]
%primals_2 : [num_users=2] = placeholder[target=primals_2]
%primals_3 : [num_users=2] = placeholder[target=primals_3]
%addmm : [num_users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {})
%relu : [num_users=2] = call_function[target=torch.ops.aten.relu.default](args = (%addmm,), kwargs = {})
%le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%relu, 0), kwargs = {})
%permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {})
return (relu, primals_2, le, permute_1)
graph():
%primals_1 : [num_users=1] = placeholder[target=primals_1]
%primals_2 : [num_users=2] = placeholder[target=primals_2]
%primals_3 : [num_users=2] = placeholder[target=primals_3]
%_addmm_activation_default : [num_users=2] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%primals_1, %primals_3, %primals_2), kwargs = {})
%le : [num_users=1] = call_function[target=torch.ops.aten.le.Scalar](args = (%_addmm_activation_default, 0), kwargs = {})
%permute_1 : [num_users=1] = call_function[target=torch.ops.aten.permute.default](args = (%primals_3, [1, 0]), kwargs = {})
return (_addmm_activation_default, primals_2, le, permute_1)
```
Gelu (addmm)
```
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%arg2_1 : [num_users=1] = placeholder[target=arg2_1]
%addmm : [num_users=4] = call_function[target=torch.ops.aten.addmm.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {})
%mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, %addmm), kwargs = {})
%mul_1 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul, %addmm), kwargs = {})
%mul_2 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_1, 0.044715), kwargs = {})
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%addmm, %mul_2), kwargs = {})
%mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 0.7978845608028654), kwargs = {})
%mul_4 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%addmm, 0.5), kwargs = {})
%tanh : [num_users=1] = call_function[target=torch.ops.aten.tanh.default](args = (%mul_3,), kwargs = {})
%add_1 : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%tanh, 1), kwargs = {})
%mul_5 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_4, %add_1), kwargs = {})
return (mul_5,)
graph():
%arg0_1 : [num_users=1] = placeholder[target=arg0_1]
%arg1_1 : [num_users=1] = placeholder[target=arg1_1]
%arg2_1 : [num_users=1] = placeholder[target=arg2_1]
%_addmm_activation_default : [num_users=1] = call_function[target=torch.ops.aten._addmm_activation.default](args = (%arg0_1, %arg2_1, %arg1_1), kwargs = {use_gelu: True})
return (_addmm_activation_default,)
```
Benchmark setup:
NGC pytorch 25.06 container
cublas version: 12.9.1.4
torch.compile ran with dynamic = False and max_autotune
H100
```
Testing with M=1024, N=1024, K=1024, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 0.0107 ms
Average Time per Iteration (torch compile): 0.0296 ms
============================================================
Testing with M=2048, N=2048, K=2048, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 0.0262 ms
Average Time per Iteration (torch compile): 0.0327 ms
============================================================
Testing with M=4096, N=4096, K=4096, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 0.1763 ms
Average Time per Iteration (torch compile): 0.2457 ms
============================================================
Testing with M=8192, N=8192, K=8192, dtype=bfloat16
============================================================
Average Time per Iteration (cublas): 1.5280 ms
Average Time per Iteration (torch compile): 1.9437 ms
```
A100
```
############################################################
Testing with dtype: float16
############################################################
============================================================
Testing with M=1024, N=1024, K=1024, dtype=float16
============================================================
Average Time per Iteration (cublas): 0.0313 ms
Average Time per Iteration (torch compile): 0.0643 ms
============================================================
Testing with M=2048, N=2048, K=2048, dtype=float16
============================================================
Average Time per Iteration (cublas): 0.1149 ms
Average Time per Iteration (torch compile): 0.1255 ms
============================================================
Testing with M=4096, N=4096, K=4096, dtype=float16
============================================================
Average Time per Iteration (cublas): 0.6297 ms
Average Time per Iteration (torch compile): 0.7547 ms
============================================================
Testing with M=8192, N=8192, K=8192, dtype=float16
============================================================
Average Time per Iteration (cublas): 4.3821 ms
Average Time per Iteration (torch compile): 5.0740 ms
```
Script
```py
import torch
torch.manual_seed(0)
warmup, numrun= 10, 100
sizes = [1024, 2048, 4096, 8192]
dtypes = [torch.float16, torch.bfloat16, torch.float32]
device = torch.device("cuda")
for dtype in dtypes:
dtype_name = str(dtype).split('.')[-1]
print(f"\n{'#'*60}")
print(f"Testing with dtype: {dtype_name}")
print(f"{'#'*60}")
for size in sizes:
M, N, K = size, size, size
print(f"\n{'='*60}")
print(f"Testing with M={M}, N={N}, K={K}, dtype={dtype_name}")
print(f"{'='*60}")
A = torch.randn(M, K, device=device, dtype=dtype)
B = torch.randn(K, N, device=device, dtype=dtype)
C = torch.randn(M, device=device, dtype=dtype)
def func1():
return torch._addmm_activation(C, A, B, use_gelu=True)
def func2():
return torch.nn.functional.gelu(torch.add(C, torch.mm(A, B)), approximate="tanh")
func2_compiled = torch.compile(
func2,
dynamic=False,
options={
"force_disable_caches": True,
"max_autotune": True,
"max_autotune_gemm": True,
"max_autotune_gemm_backends": "TRITON",
"autotune_fallback_to_aten": False,
}
)
for _ in range(warmup): func1()
torch.cuda.synchronize(device=device)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
total_time_ms = 0.0
start_event.record()
for _ in range(numrun): func1()
end_event.record()
torch.cuda.synchronize(device=device)
total_time_ms += start_event.elapsed_time(end_event)
avg_time_ms = total_time_ms / numrun
print(f"Average Time per Iteration (cublas):\t {avg_time_ms:.4f} ms")
for _ in range(warmup): func2_compiled()
torch.cuda.synchronize(device=device)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
total_time_ms = 0.0
start_event.record()
for _ in range(numrun): func2_compiled()
end_event.record()
torch.cuda.synchronize(device=device)
total_time_ms += start_event.elapsed_time(end_event)
avg_time_ms = total_time_ms / numrun
print(f"Average Time per Iteration (torch compile):\t {avg_time_ms:.4f} ms")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158137
Approved by: https://github.com/eellison
Automatically replaces split with rsplit when relevant and only performs the split up to the first ( or last value). This allows early return of the split function and improve efficiency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160107
Approved by: https://github.com/albanD