3275 Commits

Author SHA1 Message Date
Yuanyuan Chen
77470cdbfb Remove old CUDA conditions (#171235)
This PR removes old  branches for CUDA <=12.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/171235
Approved by: https://github.com/ezyang
2025-12-31 09:05:08 +00:00
Yu, Guangye
c4c211a81b [xpu][fix] Use small pool for 1MB allocation (#171453)
# Motivation
Previously, XPU treated an explicit 1 MB allocation as a large-pool allocation. This PR aligns XPU’s behavior with other backends by treating a 1 MB allocation as a small-pool allocation, while preserving the existing behavior that allocations larger than 1 MB use the large pool.

This issue only occurs when a user explicitly allocates exactly 1 MB.

# Additional Context
fix https://github.com/pytorch/pytorch/issues/171451
Pull Request resolved: https://github.com/pytorch/pytorch/pull/171453
Approved by: https://github.com/EikanWang
2025-12-30 05:47:56 +00:00
Aaron Gokaslan
0db3b1eee4 [BE]: Mark more hash impls as noexcept for efficiency (#171388)
Mark hashes as noexcept allow them to be recomputed as needed by STL container objects allowing them to use more efficient implementations. Most of these have very simple hash implementation or cache their hash values anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/171388
Approved by: https://github.com/Lucaskabela, https://github.com/drisspg
2025-12-30 03:18:13 +00:00
Yu, Guangye
949476b243 Support torch.accelerator.get_device_capability on XPU (#170747)
# Motivation
This PR adds support for `torch.accelerator.get_device_capability` on XPU. At the current stage, it reports a limited set of basic scalar data types, taking potential software emulation into account where native hardware support may not be available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170747
Approved by: https://github.com/EikanWang
2025-12-23 09:34:38 +00:00
Yu, Guangye
ec969a2278 [xpu][feature] Add skip actions support to filter out memory trace (#170760)
# Motivation
 This PR intends to introduce a flag to skip specific events in the memory snapshot to reduce trace file size and improve HTML viewer usability. This PR does the same thing as in https://github.com/pytorch/pytorch/issues/168183.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170760
Approved by: https://github.com/EikanWang
ghstack dependencies: #169812
2025-12-22 15:15:28 +00:00
Isalia20
c2acb4e85a [MPS] binary dense scalar kernels (#170337)
Fixes #168964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170337
Approved by: https://github.com/malfet
2025-12-18 15:36:41 +00:00
Yu, Guangye
d2d70b9822 [xpu][feature] Introduce some additional metrics for memory stats of XPU caching allocator (#169812)
# Motivation
This PR introduces useful metrics into xpu device statistics, including `stats.allocation`, `stats.segment`, `stats.active`, `inactive_split`, and `inactive_split_bytes`.

# Additional Context
The UT will be added in https://github.com/pytorch/pytorch/pull/169442. It will be used to check the memory stats consistency with `torch.xpu.memory_stats` and `torch.xpu.memory_snapshot`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169812
Approved by: https://github.com/EikanWang
2025-12-18 06:04:16 +00:00
Daniel Galvez
416fccb5cd Make CachingHostAllocator work with memory pools. (#167507)
Both allocation to a cuda graph's private pool via stream capture and
allocation to a memory pool in non-stream-captured code are supported.

In the case of stream capture, we refuse to reuse a host memory block
as soon as record_event() is called on that block. This is to prevent
a stream-captured CUDA kernel from reading different contents from a
memory block than would be read if, counterfactually, this CUDA
kernels were running eagerly on a cuda stream.

See
https://github.com/pytorch/pytorch/pull/161583#issuecomment-3229885771
for elaboration.

This is lacking test cases for pagedable host memory copies. We must
make sure that record_event() does not fail in that case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167507
Approved by: https://github.com/eqy, https://github.com/eee4017, https://github.com/ngimel
2025-12-17 07:15:58 +00:00
Nikita Shulga
48f3158d1d [MPS] Migrate pow_tensor_scalar/reciprocal ops to Metal shaders (#170077)
Motivation is  to fix `torch.signal.windows.kaiser` (see https://github.com/pytorch/pytorch/issues/164712 ), which started to fail, because `c10::metal::pow(x, 2)` is no longer identical to `x * x`
- Redispatch to `reciprocal`/`sqrt`/`rsqrt` when scalar is `2.0`/`.5`/`-.5` respectively
- Implement pow operation for complex numbers using $a^b = e^{b \cdot \log(a)} = e^{b \cdot (\log(r) + i\theta)}$ if $a=r e^{i\theta}$
- This also fixes backward for `tanh`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170077
Approved by: https://github.com/Skylion007
2025-12-17 06:55:42 +00:00
Yu, Guangye
118b0d9037 [xpu][feature] [3/6] Add snapshot support on XPU caching allocator (#169203)
# Motivation
This PR introduces memory snapshot functionality for the XPU caching allocator. Our design philosophy is to keep the implementation as simple as possible without unnecessary features. We will be able to extend the functionality in the future if real use cases arise. The `c10::xpu::XPUCachingAllocator::snapshot` API introduced in this PR will be leveraged by the Python frontend in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169203
Approved by: https://github.com/albanD
ghstack dependencies: #168262, #169280
2025-12-15 10:39:18 +00:00
Yu, Guangye
5cc4ebf398 [xpu][feature] [2/6] Track stack context for xpu caching allocator (#169280)
# Motivation
This PR introduces stack context tracking support in the XPU caching allocator. The design remains straightforward and leverages `record_trace` appropriately to capture relevant allocation history for debugging and analysis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169280
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #168262
2025-12-15 10:39:17 +00:00
Yu, Guangye
bfb2ad2a88 [xpu][feature] [1/6] Add trace support on XPU caching allocator (#168262)
# Motivation
This PR introduces the `TraceEntry` structure and the `recordHistory` functionality to enable trace support in the XPU caching allocator. At this stage, the implementation is XPU-specific; however, we plan to refactor and consolidate this logic with the existing CUDA implementation to maximize code reuse and consistency across backends. The design principle is to remain minimal and straightforward while laying the foundation for future enhancements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168262
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-12-15 10:39:00 +00:00
linhaifeng
52c7ad7bb7 [BE] remove redudant items in unordered_set/unodered_map (#170055)
remove redudant items in unordered_set/unodered_map

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170055
Approved by: https://github.com/cyyever, https://github.com/malfet
2025-12-15 07:04:32 +00:00
yifeng.jyf
7a41672d38 [Memory Snapshot] Release alloc_buffer memory after stopping recording history (#169650)
In our use case, we use memory snapshots for on-demand GPU memory allocation analysis. We found that after completion, memory usage would increase significantly. One important reason is that the buffer recording GPU memory allocation history is not released after finishing, so I want to release this buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169650
Approved by: https://github.com/ngimel, https://github.com/guangyey
2025-12-15 04:40:50 +00:00
Jiannan Wang
dc79403a25 Add skip_actions flag to filter out memory snapshot events (#168183)
Fixes [T244199247](https://www.internalfb.com/intern/tasks/?t=244199247).

This PR adds a flag to skip specific events in memory snapshot to reduces trace file size and improves HTML viewer usability.

Docstring
```
      skip_actions (list[str], optional): List of action types to skip when recording
            memory history. This can be used to reduce memory overhead by excluding
            certain types of events from being recorded. Valid action types are:

            - `"alloc"`: Memory allocation events
            - `"free_requested"`: Free requests (memory marked for freeing)
            - `"free_completed"`: Completed free operations (memory actually freed)
            - `"segment_alloc"`: Segment allocation from cudaMalloc
            - `"segment_free"`: Segment freed back to CUDA via cudaFree
            - `"oom"`: Out-of-memory exceptions
            - `"snapshot"`: Memory snapshot generation events

            For example, to skip recording free_requested events:
            `skip_actions=["free_requested"]`
```

Usage:
`torch.cuda.memory._record_memory_history(skip_actions=['free_requested'])`

The following screenshot shows that there are still free_completed events but no free_requested events.

<img width="2516" height="1313" alt="Screenshot 2025-11-20 at 5 05 55 PM" src="https://github.com/user-attachments/assets/2adbc21e-3982-4616-a6d2-08e21e4446ca" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168183
Approved by: https://github.com/sraikund16
2025-12-12 22:26:07 +00:00
Yuanyuan Chen
28fe2d3dbd Fix clang-tidy warnings on c10/xpu files (#169231)
This PR fixes clang-tidy warnings on c10/xpu files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169231
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-12-11 09:48:53 +00:00
Yuanyuan Chen
e33fa0ece3 Apply more clang-tidy fixes (#169794)
This PR applies more clang-tidy fixes about readability and modernization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169794
Approved by: https://github.com/Skylion007
2025-12-11 00:24:28 +00:00
Dan Johnson
f45018788b Mempool use_on_oom order (#169699)
Reorder oom mitigation steps so that we reuse optional mempools before expensive releasing cached blocks.

Additionally, make sure mempools are removed from use_on_oom_pools upon deletion.
New test before fix:
```
======================================================================
ERROR: test_deleted_mempool_not_used_on_oom (__main__.TestMemPool.test_deleted_mempool_not_used_on_oom)
Test that a deleted mempool with use_on_oom=True is properly removed from use_on_oom_pools.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/danielsjohnson/oss_pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3325, in wrapper
    method(*args, **kwargs)
  File "/home/danielsjohnson/oss_pytorch/pytorch/test/test_cuda.py", line 5696, in test_deleted_mempool_not_used_on_oom
    c = torch.randn(20 * nelem_1mb, device="cuda")
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "/home/danielsjohnson/oss_pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":2700, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_cuda.py TestMemPool.test_deleted_mempool_not_used_on_oom
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.691s
FAILED (errors=1)
Segmentation fault (core dumped)
```

New test after fix:
```
----------------------------------------------------------------------
Ran 1 test in 0.651s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169699
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-12-10 22:23:26 +00:00
Isalia20
87a052ffd8 [MPS] Binary dense scalar kernels (#169764)
Partially addresses #168964

This:
```python
import torch
import time

input = torch.rand(16, 8, 343, 64).to("mps")
scalar = 1.0

start = time.time()

for _ in range(10000):
    out = input.mul(scalar)

torch.mps.synchronize()
end = time.time()
print(f"Took {end - start}")
```

Takes 0.89 seconds for torch 2.6(MPS Graph implementation).
On main it takes 2.95 seconds
After these changes, it takes 0.71-0.76 seconds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169764
Approved by: https://github.com/malfet
2025-12-10 18:17:03 +00:00
Jeremy Braun
f795b99ab2 [pytorch] redirect fbcode//caffe2/c10:c10 to the OSS/conda version (#169004)
Summary: Some build rules want to consume c10 from an external source. Convert it to an alias (like other pytorch entry points), and redirect conda builds to use the `fbcode//conda/buck_integration/toolchains/third-party:torch-cpp` target.

Test Plan: CI.

Reviewed By: athmasagar

Differential Revision: D81518888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169004
Approved by: https://github.com/malfet
2025-12-10 17:59:51 +00:00
Jacob Szwejbka
d026ae7be6 Protect files that ET depends on to be pinned to c++17 (#169899)
Needed for https://github.com/pytorch/pytorch/issues/167822

ExecuTorch is still pinned to c++17 and will remain that way for the forseeable future. We take a dep on these headers and pytorch/pytorch wants to bump to c++20. This is a pretty minimal way to avoid the issue recommended by @r-barnes

Added a unittest with the version restricted and then added a new linter that I manually tested by modifying the files to use a c++20 feature and verified it died.

Previous version that went under review before I trashed the git history and broke the CLA job: https://github.com/pytorch/pytorch/pull/168204
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169899
Approved by: https://github.com/larryliu0820
2025-12-09 22:59:44 +00:00
Ke Wen
57beef6e3c [MemPool] Add no-split option (#169739)
SymmMem creates one signal pad per memory allocation. As a result, it is undesired for CachingAllocator to let two tensors share the space of one memory allocation, because otherwise, operations on those two tensors would use the same signal pad, leading to wrong synchronization.

This PR adds a `no_split` flag to MemPool so that the pool would not use the remaining part of a memory segment to allocate for another block (tensor).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169739
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-12-09 22:08:29 +00:00
Yu, Guangye
87449b1c16 Document PYTORCH_ALLOC_CONFIG and mark PYTORCH_CUDA_ALLOC_CONFIG as its alias (#167659)
# Motivation
This PR aims to document `PYTORCH_ALLOC_CONF` in the doc and comments and reminder the user that the previous environment variable `PYTORCH_CUDA_ALLOC_CONF` is its alias.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167659
Approved by: https://github.com/albanD
2025-12-09 08:21:16 +00:00
Yufei Yuan
9c4e5352e4 Fix vhappi_common_backend GCC build by moving Clang-specific flags (#169746)
Summary: The vhappi_common_backend target was applying the Clang-specific warning flag -Wdeprecated-non-prototype to all compiler builds, causing GCC builds to fail since GCC doesn't support this flag. This moves the Clang-specific flag from the target's cpp_compiler_flags to be applied only during Clang compilation via the BUILD_MODE.bzl file, allowing the target to build successfully with both compilers.

Test Plan:
```
buck build fbcode//mode/dbg-gcc fbcode//infra_asic_fpga/validation/common_gen2:vhappi_common_backend
buck build fbcode//mode/dev fbcode//infra_asic_fpga/validation/common_gen2:vhappi_common_backend
```
Both builds succeed.

Reviewed By: ghua-fb

Differential Revision: D88011353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169746
Approved by: https://github.com/drisspg
2025-12-08 19:26:31 +00:00
PyTorch MergeBot
a9dd532d99 Revert "Mempool use_on_oom order (#169699)"
This reverts commit 05e06deaca.

Reverted https://github.com/pytorch/pytorch/pull/169699 on behalf of https://github.com/dsjohns2 due to Failing internal test due to shadow variables. Will reland with fix. ([comment](https://github.com/pytorch/pytorch/pull/169699#issuecomment-3628386257))
2025-12-08 18:15:38 +00:00
Dan Johnson
05e06deaca Mempool use_on_oom order (#169699)
Reorder oom mitigation steps so that we reuse optional mempools before expensive releasing cached blocks.

Additionally, make sure mempools are removed from use_on_oom_pools upon deletion.
New test before fix:
```
======================================================================
ERROR: test_deleted_mempool_not_used_on_oom (__main__.TestMemPool.test_deleted_mempool_not_used_on_oom)
Test that a deleted mempool with use_on_oom=True is properly removed from use_on_oom_pools.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/danielsjohnson/oss_pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3325, in wrapper
    method(*args, **kwargs)
  File "/home/danielsjohnson/oss_pytorch/pytorch/test/test_cuda.py", line 5696, in test_deleted_mempool_not_used_on_oom
    c = torch.randn(20 * nelem_1mb, device="cuda")
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "/home/danielsjohnson/oss_pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":2700, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_cuda.py TestMemPool.test_deleted_mempool_not_used_on_oom
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.691s
FAILED (errors=1)
Segmentation fault (core dumped)
```

New test after fix:
```
----------------------------------------------------------------------
Ran 1 test in 0.651s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169699
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-12-06 05:45:45 +00:00
eqy
c44798d730 [CUDA][CUDA Graphs] Use count when allocating storage for edgeData (#169576)
`cudaGraphNodeGetDependencies` expects the pointer to `edgeData` to have enough storage for as many entries as `deps`

For #169390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169576
Approved by: https://github.com/eee4017, https://github.com/ngimel
2025-12-05 16:25:44 +00:00
Ke Wen
ca6cd47ec8 [SymmMem] Skip multicast init if any CUDA call fails (#168049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168049
Approved by: https://github.com/fduwjj
2025-12-05 16:06:11 +00:00
Pavan Balaji
a70d81a285 [pytorch] Add env variable to enable IPC for expandable segments (#169487)
Summary:
PyTorch's expandable segments IPC capability was disabled in fbcode
due to job failures (see
https://github.com/pytorch/pytorch/pull/132890).  However, some use
cases like CTran require IPC functionality for multi-process GPU
communication. This change introduces
PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC environment variable to allow
opt-in enablement of IPC handle types for expandable segments in
fbcode builds while maintaining backward compatibility.

IPC is enabled by default in non-fbcode builds and disabled by default
in fbcode builds (existing behavior). In both cases, it can be
explicitly enabled by setting PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC=1.

Test Plan: CI

Differential Revision: D88274246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169487
Approved by: https://github.com/ngimel
2025-12-05 08:58:34 +00:00
Frank Lin
e64f1eece0 expandable_segments + memory pool (#169491)
Fixes #147851
Please also see #165419 and #148378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/169491
Approved by: https://github.com/ngimel
2025-12-04 20:49:41 +00:00
Yu, Guangye
2e0c2e170f [xpu][feature] [1/2] Introduce XPUPluggableAllocator in cpp part (#168966)
# Motivation
This PR aims to introduce `XPUPluggableAllocator` and we make it as simple as possible. The follow-up PR would introduce the code related to the Python frontend part.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168966
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/eellison
2025-12-04 02:50:01 +00:00
can-gaa-hou
89e3bbcb5b [Accelerator] Add Accelerator Capabilities API (#165631)
# Motivation
There are several issues related to the data type and precision that an accelerator supports (see #165038 and #143112). Sometimes, we have to check for these capabilities in the document, and then hard-code.  This PR proposes a new unified API for users to check their accelerator capabilities.

# Changes
This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has:
- Supporting DataType (set to be supported as default):
  - `fp16`, `int32`, `complex` ... etc
- Other capabilities (need to be discussed)

To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key).

# Usage
```python
>>> import torch
>>> import torch_openreg
>>> torch.accelerator.get_device_capability('openreg:0')
{'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]}
```
# TODO
- So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension.
- It will support other in-tree accelerators, such as **cuda** and **mps**.
- Clarify whether the capabilities are software or hardware supported. (By @guangyey )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165631
Approved by: https://github.com/guangyey, https://github.com/albanD

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2025-12-03 21:37:30 +00:00
Kurt Mohler
8c73bbbb02 [MPS] Migrate clamp.Tensor_out to Metal (#169407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169407
Approved by: https://github.com/malfet
2025-12-03 19:36:00 +00:00
Frank Lin
d038b0130e The Nested Pool (#168382)
This PR fixes issue #161193 by simply reversing the iteration order over captures_underway.
After discussing with @galv, we decided to land this minimal fix first to unblock nested MemPool usage.

Long-term, the underlying infrastructure (e.g., captures_underway) still needs refactoring to clearly define the interaction between graph capture, MemPools, and threads. That broader cleanup will be addressed in #168137.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168382
Approved by: https://github.com/eqy, https://github.com/ngimel, https://github.com/galv
2025-12-03 18:38:01 +00:00
PyTorch MergeBot
6c261c6cb0 Revert "[Accelerator] Add Accelerator Capabilities API (#165631)"
This reverts commit 285779b162.

Reverted https://github.com/pytorch/pytorch/pull/165631 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it has a small bug when building this internally ([comment](https://github.com/pytorch/pytorch/pull/165631#issuecomment-3604616720))
2025-12-03 01:16:33 +00:00
can-gaa-hou
285779b162 [Accelerator] Add Accelerator Capabilities API (#165631)
# Motivation
There are several issues related to the data type and precision that an accelerator supports (see #165038 and #143112). Sometimes, we have to check for these capabilities in the document, and then hard-code.  This PR proposes a new unified API for users to check their accelerator capabilities.

# Changes
This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has:
- Supporting DataType (set to be supported as default):
  - `fp16`, `int32`, `complex` ... etc
- Other capabilities (need to be discussed)

To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key).

# Usage
```python
>>> import torch
>>> import torch_openreg
>>> torch.accelerator.get_device_capability('openreg:0')
{'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]}
```
# TODO
- So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension.
- It will support other in-tree accelerators, such as **cuda** and **mps**.
- Clarify whether the capabilities are software or hardware supported. (By @guangyey )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165631
Approved by: https://github.com/guangyey, https://github.com/albanD

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2025-12-02 15:14:30 +00:00
Yu, Guangye
5f0030ba63 [xpu][fix] Support xpu custom raw_alloc/delete in caching allocator (#168957)
# Motivation
Memory Pool needs to support the custom `raw_alloc` and `raw_delete` from a custom allocator.

# Solution
When the custom allocator is provided in the memory pool, use its `raw_alloc` and `raw_delete`. Otherwise, use the `sycl::aligned_alloc_device` and `sycl::free` from SYCL runtime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168957
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
ghstack dependencies: #168956
2025-12-01 16:47:05 +00:00
PyTorch MergeBot
7d2a33e4eb Revert "[Accelerator] Add Accelerator Capabilities API (#165631)"
This reverts commit c8210e7d94.

Reverted https://github.com/pytorch/pytorch/pull/165631 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165631#issuecomment-3595505779))
2025-12-01 09:35:13 +00:00
can-gaa-hou
c8210e7d94 [Accelerator] Add Accelerator Capabilities API (#165631)
# Motivation
There are several issues related to the data type and precision that an accelerator supports (see #165038 and #143112). Sometimes, we have to check for these capabilities in the document, and then hard-code.  This PR proposes a new unified API for users to check their accelerator capabilities.

# Changes
This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has:
- Supporting DataType (set to be supported as default):
  - `fp16`, `int32`, `complex` ... etc
- Other capabilities (need to be discussed)

To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key).

# Usage
```python
>>> import torch
>>> import torch_openreg
>>> torch.accelerator.get_device_capability('openreg:0')
{'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]}
```
# TODO
- So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension.
- It will support other in-tree accelerators, such as **cuda** and **mps**.
- Clarify whether the capabilities are software or hardware supported. (By @guangyey )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165631
Approved by: https://github.com/fffrog, https://github.com/guangyey, https://github.com/albanD

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2025-12-01 06:09:15 +00:00
linhaifeng
7bc2a66ded [CUDA][BugFix] fix truncated error messages (#168942)
Inspired by #168369

I found 9a38bb8622/c10/core/Device.h (L19)

When device indices (DeviceIndex) with value 0 are passed to TORCH_CHECK macros,they are interpreted as string terminators (\0), causing error messages to be truncated.

For example:

```cpp
#include <iostream>
#include <sstream>
#include <cstring>
#include <cstdint>

int8_t device = 0;

int main() {
    std::cout << std::strlen((std::stringstream() << "Head" << device << "Tail").str().c_str()) << std::endl;
    std::cout << std::strlen((std::stringstream() << "Head" << static_cast<int>(device) << "Tail").str().c_str()) << std::endl;
    std::cout << std::strlen((std::stringstream() << "Head" << +device << "Tail").str().c_str()) << std::endl;
    return 0;

}

```

output

```bash
4
9
9
```

Maybe we can use `+` instead of `static_cast<int>`, but it needs discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168942
Approved by: https://github.com/cyyever, https://github.com/eqy
2025-11-28 04:58:46 +00:00
Yuanyuan Chen
9cd055e547 [2/N] Remove unused header inclusion (#165831)
Remove unused header inclusion in JIT code and other locations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165831
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-11-28 00:55:01 +00:00
Yu, Guangye
7c350369a7 [xpu][fix] Refine memory pool logic when expandable segement enabled (#168956)
# Motivation
This is a bug in the interaction between the memory pool for XPUGraph and expandable segments.
When `unmap_block` is called, the allocator decreases `allocation_count` as expected:
(see lines 862–867)
265397e178/c10/xpu/XPUCachingAllocator.cpp (L862-L867)

However, when an expandable segment is created via `try_allocate_expandable_block`, we never increment `allocation_count`.
As a result, `allocation_count` can drop below its correct value after unmapping.

# Solution
This patch fixes the issue by ensuring `allocation_count` is incremented when creating a new expandable segment.

# Additional Context
PyTorch currently does not support using a custom allocator together with the expandable-segment feature in the memory pool. Therefore, we add an assertion to fail fast when this unsupported condition is detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168956
Approved by: https://github.com/EikanWang
2025-11-25 01:10:54 +00:00
PyTorch MergeBot
33d4cf4fcb Revert "Move CUDAEvent to c10 (#158219)"
This reverts commit 4909fd89dc.

Reverted https://github.com/pytorch/pytorch/pull/158219 on behalf of https://github.com/jeffdaily due to broke ROCm dynamo inductor benchmarks on ciflow/inductor-periodic label which wasn't run by default for this PR ([comment](https://github.com/pytorch/pytorch/pull/158219#issuecomment-3572110617))
2025-11-24 18:18:31 +00:00
Daniel Galvez
9a38bb8622 [CUDA] Fix truncated error messages in cudaMallocAsync Allocator (#168369)
Previously, these error messages would get truncated when they were hit on device 0 because device is a "char" (actually, an int8_t) and therefore '0' is interpreted as the null byte to terminate a string. Essentially, it is the same issue as https://github.com/pytorch/pytorch/pull/123984.

There's something strange in the TORCH_CHECK_WITH macro that is causing this. I don't feel like figuring out those obscure macro details right now, though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168369
Approved by: https://github.com/eqy
2025-11-24 00:03:19 +00:00
Yu, Guangye
4909fd89dc Move CUDAEvent to c10 (#158219)
# Motivation
When I refactored the caching allocator, I noticed that there are two separate pieces of code of `EventPool` : one in [aten/cuda/CachingHostAllocator.cpp](0f21fa84fb/aten/src/ATen/cuda/CachingHostAllocator.cpp (L23))  and another in [c10/cuda/CUDACachingAllocator](0f21fa84fb/c10/cuda/CUDACachingAllocator.cpp (L869)). I would like to refactor these so that they share a single implementation.
To achieve this, I have to move `aten/cuda/CUDAEvent.h` to `c10/cuda`, which I understand this is a big change. However, I think it makes sense conceptually - `CUDAStream` and `CUDAEvent` are both fundamental CUDA abstractions, and since `CUDAStream` is already in `c10/cuda`, placing `CUDAEvent` there as well seems reasonable for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158219
Approved by: https://github.com/albanD
2025-11-22 04:47:17 +00:00
Lucy Qiu
7717bbaccd Add template for add_overflows (#168035)
Summary: Check for non uint64_t add overflows. See usage in D87115901.

Afterwards, update the pytorch pin in executorch and then land the security patch.

Test Plan: CI

Differential Revision: D87272275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168035
Approved by: https://github.com/larryliu0820
2025-11-21 18:37:23 +00:00
Frank Lin
8b0314d1a7 Fix edge-data handling in cudaGraphNodeGetDependencies for CUDA 13 in graph_capture_record_stream_reuse (#168305)
CUDA 13 introduced stricter behavior for querying graph edges with edge data.
According to the CUDA documentation for [cudaGraphNodeGetDependencies](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g94ee7ba53ade560483e9c5d06e8ef50d)

> If an edge has non-zero (non-default) edge data and edgeData is NULL, this API returns cudaErrorLossyQuery.
If edgeData is non-NULL, then pDependencies must also be non-NULL.

When a graph contains edge data, we must provide a non-NULL edgeData buffer during dependency queries. Otherwise CUDA 13 will raise a cudaErrorLossyQuery.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/168305
Approved by: https://github.com/eqy, https://github.com/ezyang
2025-11-21 07:16:06 +00:00
Yuanyuan Chen
d3ccb8f3d0 Remove c10::is_pod (#166383)
`c10::is_pod` is not used in OSS. New code should instead use `std::is_trivial`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166383
Approved by: https://github.com/albanD
2025-11-21 03:17:06 +00:00
Jagadish Krishnamoorthy
4887c46900 [ROCm] Fix HIP document url. (#168220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168220
Approved by: https://github.com/jeffdaily
2025-11-20 20:12:51 +00:00
arkadip-maitra
f4382d7f98 Fixes floor divide int min overflow issue (#166127)
Fixes #127804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166127
Approved by: https://github.com/albanD
2025-11-20 17:27:50 +00:00