# Motivation
Previously, XPU treated an explicit 1 MB allocation as a large-pool allocation. This PR aligns XPU’s behavior with other backends by treating a 1 MB allocation as a small-pool allocation, while preserving the existing behavior that allocations larger than 1 MB use the large pool.
This issue only occurs when a user explicitly allocates exactly 1 MB.
# Additional Context
fix https://github.com/pytorch/pytorch/issues/171451
Pull Request resolved: https://github.com/pytorch/pytorch/pull/171453
Approved by: https://github.com/EikanWang
# Motivation
This PR adds support for `torch.accelerator.get_device_capability` on XPU. At the current stage, it reports a limited set of basic scalar data types, taking potential software emulation into account where native hardware support may not be available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170747
Approved by: https://github.com/EikanWang
# Motivation
This PR introduces useful metrics into xpu device statistics, including `stats.allocation`, `stats.segment`, `stats.active`, `inactive_split`, and `inactive_split_bytes`.
# Additional Context
The UT will be added in https://github.com/pytorch/pytorch/pull/169442. It will be used to check the memory stats consistency with `torch.xpu.memory_stats` and `torch.xpu.memory_snapshot`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169812
Approved by: https://github.com/EikanWang
Both allocation to a cuda graph's private pool via stream capture and
allocation to a memory pool in non-stream-captured code are supported.
In the case of stream capture, we refuse to reuse a host memory block
as soon as record_event() is called on that block. This is to prevent
a stream-captured CUDA kernel from reading different contents from a
memory block than would be read if, counterfactually, this CUDA
kernels were running eagerly on a cuda stream.
See
https://github.com/pytorch/pytorch/pull/161583#issuecomment-3229885771
for elaboration.
This is lacking test cases for pagedable host memory copies. We must
make sure that record_event() does not fail in that case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167507
Approved by: https://github.com/eqy, https://github.com/eee4017, https://github.com/ngimel
Motivation is to fix `torch.signal.windows.kaiser` (see https://github.com/pytorch/pytorch/issues/164712 ), which started to fail, because `c10::metal::pow(x, 2)` is no longer identical to `x * x`
- Redispatch to `reciprocal`/`sqrt`/`rsqrt` when scalar is `2.0`/`.5`/`-.5` respectively
- Implement pow operation for complex numbers using $a^b = e^{b \cdot \log(a)} = e^{b \cdot (\log(r) + i\theta)}$ if $a=r e^{i\theta}$
- This also fixes backward for `tanh`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170077
Approved by: https://github.com/Skylion007
# Motivation
This PR introduces memory snapshot functionality for the XPU caching allocator. Our design philosophy is to keep the implementation as simple as possible without unnecessary features. We will be able to extend the functionality in the future if real use cases arise. The `c10::xpu::XPUCachingAllocator::snapshot` API introduced in this PR will be leveraged by the Python frontend in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169203
Approved by: https://github.com/albanD
ghstack dependencies: #168262, #169280
# Motivation
This PR introduces the `TraceEntry` structure and the `recordHistory` functionality to enable trace support in the XPU caching allocator. At this stage, the implementation is XPU-specific; however, we plan to refactor and consolidate this logic with the existing CUDA implementation to maximize code reuse and consistency across backends. The design principle is to remain minimal and straightforward while laying the foundation for future enhancements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168262
Approved by: https://github.com/EikanWang, https://github.com/albanD
In our use case, we use memory snapshots for on-demand GPU memory allocation analysis. We found that after completion, memory usage would increase significantly. One important reason is that the buffer recording GPU memory allocation history is not released after finishing, so I want to release this buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169650
Approved by: https://github.com/ngimel, https://github.com/guangyey
Fixes [T244199247](https://www.internalfb.com/intern/tasks/?t=244199247).
This PR adds a flag to skip specific events in memory snapshot to reduces trace file size and improves HTML viewer usability.
Docstring
```
skip_actions (list[str], optional): List of action types to skip when recording
memory history. This can be used to reduce memory overhead by excluding
certain types of events from being recorded. Valid action types are:
- `"alloc"`: Memory allocation events
- `"free_requested"`: Free requests (memory marked for freeing)
- `"free_completed"`: Completed free operations (memory actually freed)
- `"segment_alloc"`: Segment allocation from cudaMalloc
- `"segment_free"`: Segment freed back to CUDA via cudaFree
- `"oom"`: Out-of-memory exceptions
- `"snapshot"`: Memory snapshot generation events
For example, to skip recording free_requested events:
`skip_actions=["free_requested"]`
```
Usage:
`torch.cuda.memory._record_memory_history(skip_actions=['free_requested'])`
The following screenshot shows that there are still free_completed events but no free_requested events.
<img width="2516" height="1313" alt="Screenshot 2025-11-20 at 5 05 55 PM" src="https://github.com/user-attachments/assets/2adbc21e-3982-4616-a6d2-08e21e4446ca" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168183
Approved by: https://github.com/sraikund16
Reorder oom mitigation steps so that we reuse optional mempools before expensive releasing cached blocks.
Additionally, make sure mempools are removed from use_on_oom_pools upon deletion.
New test before fix:
```
======================================================================
ERROR: test_deleted_mempool_not_used_on_oom (__main__.TestMemPool.test_deleted_mempool_not_used_on_oom)
Test that a deleted mempool with use_on_oom=True is properly removed from use_on_oom_pools.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/danielsjohnson/oss_pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3325, in wrapper
method(*args, **kwargs)
File "/home/danielsjohnson/oss_pytorch/pytorch/test/test_cuda.py", line 5696, in test_deleted_mempool_not_used_on_oom
c = torch.randn(20 * nelem_1mb, device="cuda")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "/home/danielsjohnson/oss_pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":2700, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
python test/test_cuda.py TestMemPool.test_deleted_mempool_not_used_on_oom
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.691s
FAILED (errors=1)
Segmentation fault (core dumped)
```
New test after fix:
```
----------------------------------------------------------------------
Ran 1 test in 0.651s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169699
Approved by: https://github.com/ngimel, https://github.com/eqy
Summary: Some build rules want to consume c10 from an external source. Convert it to an alias (like other pytorch entry points), and redirect conda builds to use the `fbcode//conda/buck_integration/toolchains/third-party:torch-cpp` target.
Test Plan: CI.
Reviewed By: athmasagar
Differential Revision: D81518888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169004
Approved by: https://github.com/malfet
Needed for https://github.com/pytorch/pytorch/issues/167822
ExecuTorch is still pinned to c++17 and will remain that way for the forseeable future. We take a dep on these headers and pytorch/pytorch wants to bump to c++20. This is a pretty minimal way to avoid the issue recommended by @r-barnes
Added a unittest with the version restricted and then added a new linter that I manually tested by modifying the files to use a c++20 feature and verified it died.
Previous version that went under review before I trashed the git history and broke the CLA job: https://github.com/pytorch/pytorch/pull/168204
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169899
Approved by: https://github.com/larryliu0820
SymmMem creates one signal pad per memory allocation. As a result, it is undesired for CachingAllocator to let two tensors share the space of one memory allocation, because otherwise, operations on those two tensors would use the same signal pad, leading to wrong synchronization.
This PR adds a `no_split` flag to MemPool so that the pool would not use the remaining part of a memory segment to allocate for another block (tensor).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169739
Approved by: https://github.com/eqy, https://github.com/ngimel
# Motivation
This PR aims to document `PYTORCH_ALLOC_CONF` in the doc and comments and reminder the user that the previous environment variable `PYTORCH_CUDA_ALLOC_CONF` is its alias.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167659
Approved by: https://github.com/albanD
Summary: The vhappi_common_backend target was applying the Clang-specific warning flag -Wdeprecated-non-prototype to all compiler builds, causing GCC builds to fail since GCC doesn't support this flag. This moves the Clang-specific flag from the target's cpp_compiler_flags to be applied only during Clang compilation via the BUILD_MODE.bzl file, allowing the target to build successfully with both compilers.
Test Plan:
```
buck build fbcode//mode/dbg-gcc fbcode//infra_asic_fpga/validation/common_gen2:vhappi_common_backend
buck build fbcode//mode/dev fbcode//infra_asic_fpga/validation/common_gen2:vhappi_common_backend
```
Both builds succeed.
Reviewed By: ghua-fb
Differential Revision: D88011353
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169746
Approved by: https://github.com/drisspg
Reorder oom mitigation steps so that we reuse optional mempools before expensive releasing cached blocks.
Additionally, make sure mempools are removed from use_on_oom_pools upon deletion.
New test before fix:
```
======================================================================
ERROR: test_deleted_mempool_not_used_on_oom (__main__.TestMemPool.test_deleted_mempool_not_used_on_oom)
Test that a deleted mempool with use_on_oom=True is properly removed from use_on_oom_pools.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/danielsjohnson/oss_pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3325, in wrapper
method(*args, **kwargs)
File "/home/danielsjohnson/oss_pytorch/pytorch/test/test_cuda.py", line 5696, in test_deleted_mempool_not_used_on_oom
c = torch.randn(20 * nelem_1mb, device="cuda")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "/home/danielsjohnson/oss_pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":2700, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
python test/test_cuda.py TestMemPool.test_deleted_mempool_not_used_on_oom
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.691s
FAILED (errors=1)
Segmentation fault (core dumped)
```
New test after fix:
```
----------------------------------------------------------------------
Ran 1 test in 0.651s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169699
Approved by: https://github.com/ngimel, https://github.com/eqy
Summary:
PyTorch's expandable segments IPC capability was disabled in fbcode
due to job failures (see
https://github.com/pytorch/pytorch/pull/132890). However, some use
cases like CTran require IPC functionality for multi-process GPU
communication. This change introduces
PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC environment variable to allow
opt-in enablement of IPC handle types for expandable segments in
fbcode builds while maintaining backward compatibility.
IPC is enabled by default in non-fbcode builds and disabled by default
in fbcode builds (existing behavior). In both cases, it can be
explicitly enabled by setting PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC=1.
Test Plan: CI
Differential Revision: D88274246
Pull Request resolved: https://github.com/pytorch/pytorch/pull/169487
Approved by: https://github.com/ngimel
# Motivation
There are several issues related to the data type and precision that an accelerator supports (see #165038 and #143112). Sometimes, we have to check for these capabilities in the document, and then hard-code. This PR proposes a new unified API for users to check their accelerator capabilities.
# Changes
This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has:
- Supporting DataType (set to be supported as default):
- `fp16`, `int32`, `complex` ... etc
- Other capabilities (need to be discussed)
To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key).
# Usage
```python
>>> import torch
>>> import torch_openreg
>>> torch.accelerator.get_device_capability('openreg:0')
{'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]}
```
# TODO
- So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension.
- It will support other in-tree accelerators, such as **cuda** and **mps**.
- Clarify whether the capabilities are software or hardware supported. (By @guangyey )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165631
Approved by: https://github.com/guangyey, https://github.com/albanD
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
This PR fixes issue #161193 by simply reversing the iteration order over captures_underway.
After discussing with @galv, we decided to land this minimal fix first to unblock nested MemPool usage.
Long-term, the underlying infrastructure (e.g., captures_underway) still needs refactoring to clearly define the interaction between graph capture, MemPools, and threads. That broader cleanup will be addressed in #168137.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168382
Approved by: https://github.com/eqy, https://github.com/ngimel, https://github.com/galv
# Motivation
There are several issues related to the data type and precision that an accelerator supports (see #165038 and #143112). Sometimes, we have to check for these capabilities in the document, and then hard-code. This PR proposes a new unified API for users to check their accelerator capabilities.
# Changes
This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has:
- Supporting DataType (set to be supported as default):
- `fp16`, `int32`, `complex` ... etc
- Other capabilities (need to be discussed)
To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key).
# Usage
```python
>>> import torch
>>> import torch_openreg
>>> torch.accelerator.get_device_capability('openreg:0')
{'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]}
```
# TODO
- So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension.
- It will support other in-tree accelerators, such as **cuda** and **mps**.
- Clarify whether the capabilities are software or hardware supported. (By @guangyey )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165631
Approved by: https://github.com/guangyey, https://github.com/albanD
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
# Motivation
Memory Pool needs to support the custom `raw_alloc` and `raw_delete` from a custom allocator.
# Solution
When the custom allocator is provided in the memory pool, use its `raw_alloc` and `raw_delete`. Otherwise, use the `sycl::aligned_alloc_device` and `sycl::free` from SYCL runtime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168957
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
ghstack dependencies: #168956
# Motivation
There are several issues related to the data type and precision that an accelerator supports (see #165038 and #143112). Sometimes, we have to check for these capabilities in the document, and then hard-code. This PR proposes a new unified API for users to check their accelerator capabilities.
# Changes
This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has:
- Supporting DataType (set to be supported as default):
- `fp16`, `int32`, `complex` ... etc
- Other capabilities (need to be discussed)
To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key).
# Usage
```python
>>> import torch
>>> import torch_openreg
>>> torch.accelerator.get_device_capability('openreg:0')
{'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]}
```
# TODO
- So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension.
- It will support other in-tree accelerators, such as **cuda** and **mps**.
- Clarify whether the capabilities are software or hardware supported. (By @guangyey )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165631
Approved by: https://github.com/fffrog, https://github.com/guangyey, https://github.com/albanD
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
# Motivation
This is a bug in the interaction between the memory pool for XPUGraph and expandable segments.
When `unmap_block` is called, the allocator decreases `allocation_count` as expected:
(see lines 862–867)
265397e178/c10/xpu/XPUCachingAllocator.cpp (L862-L867)
However, when an expandable segment is created via `try_allocate_expandable_block`, we never increment `allocation_count`.
As a result, `allocation_count` can drop below its correct value after unmapping.
# Solution
This patch fixes the issue by ensuring `allocation_count` is incremented when creating a new expandable segment.
# Additional Context
PyTorch currently does not support using a custom allocator together with the expandable-segment feature in the memory pool. Therefore, we add an assertion to fail fast when this unsupported condition is detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168956
Approved by: https://github.com/EikanWang
Previously, these error messages would get truncated when they were hit on device 0 because device is a "char" (actually, an int8_t) and therefore '0' is interpreted as the null byte to terminate a string. Essentially, it is the same issue as https://github.com/pytorch/pytorch/pull/123984.
There's something strange in the TORCH_CHECK_WITH macro that is causing this. I don't feel like figuring out those obscure macro details right now, though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168369
Approved by: https://github.com/eqy
# Motivation
When I refactored the caching allocator, I noticed that there are two separate pieces of code of `EventPool` : one in [aten/cuda/CachingHostAllocator.cpp](0f21fa84fb/aten/src/ATen/cuda/CachingHostAllocator.cpp (L23)) and another in [c10/cuda/CUDACachingAllocator](0f21fa84fb/c10/cuda/CUDACachingAllocator.cpp (L869)). I would like to refactor these so that they share a single implementation.
To achieve this, I have to move `aten/cuda/CUDAEvent.h` to `c10/cuda`, which I understand this is a big change. However, I think it makes sense conceptually - `CUDAStream` and `CUDAEvent` are both fundamental CUDA abstractions, and since `CUDAStream` is already in `c10/cuda`, placing `CUDAEvent` there as well seems reasonable for consistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158219
Approved by: https://github.com/albanD
Summary: Check for non uint64_t add overflows. See usage in D87115901.
Afterwards, update the pytorch pin in executorch and then land the security patch.
Test Plan: CI
Differential Revision: D87272275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/168035
Approved by: https://github.com/larryliu0820