pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2026-01-15 12:15:51 +00:00

Author	SHA1	Message	Date
Yuanyuan Chen	77470cdbfb	Remove old CUDA conditions (#171235 ) This PR removes old branches for CUDA <=12.3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/171235 Approved by: https://github.com/ezyang	2025-12-31 09:05:08 +00:00
Yu, Guangye	c4c211a81b	[xpu][fix] Use small pool for 1MB allocation (#171453 ) # Motivation Previously, XPU treated an explicit 1 MB allocation as a large-pool allocation. This PR aligns XPU’s behavior with other backends by treating a 1 MB allocation as a small-pool allocation, while preserving the existing behavior that allocations larger than 1 MB use the large pool. This issue only occurs when a user explicitly allocates exactly 1 MB. # Additional Context fix https://github.com/pytorch/pytorch/issues/171451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/171453 Approved by: https://github.com/EikanWang	2025-12-30 05:47:56 +00:00
Aaron Gokaslan	0db3b1eee4	[BE]: Mark more hash impls as noexcept for efficiency (#171388 ) Mark hashes as noexcept allow them to be recomputed as needed by STL container objects allowing them to use more efficient implementations. Most of these have very simple hash implementation or cache their hash values anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/171388 Approved by: https://github.com/Lucaskabela, https://github.com/drisspg	2025-12-30 03:18:13 +00:00
Yu, Guangye	949476b243	Support torch.accelerator.get_device_capability on XPU (#170747 ) # Motivation This PR adds support for `torch.accelerator.get_device_capability` on XPU. At the current stage, it reports a limited set of basic scalar data types, taking potential software emulation into account where native hardware support may not be available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/170747 Approved by: https://github.com/EikanWang	2025-12-23 09:34:38 +00:00
Yu, Guangye	ec969a2278	[xpu][feature] Add skip actions support to filter out memory trace (#170760 ) # Motivation This PR intends to introduce a flag to skip specific events in the memory snapshot to reduce trace file size and improve HTML viewer usability. This PR does the same thing as in https://github.com/pytorch/pytorch/issues/168183. Pull Request resolved: https://github.com/pytorch/pytorch/pull/170760 Approved by: https://github.com/EikanWang ghstack dependencies: #169812	2025-12-22 15:15:28 +00:00
Isalia20	c2acb4e85a	[MPS] binary dense scalar kernels (#170337 ) Fixes #168964 Pull Request resolved: https://github.com/pytorch/pytorch/pull/170337 Approved by: https://github.com/malfet	2025-12-18 15:36:41 +00:00
Yu, Guangye	d2d70b9822	[xpu][feature] Introduce some additional metrics for memory stats of XPU caching allocator (#169812 ) # Motivation This PR introduces useful metrics into xpu device statistics, including `stats.allocation`, `stats.segment`, `stats.active`, `inactive_split`, and `inactive_split_bytes`. # Additional Context The UT will be added in https://github.com/pytorch/pytorch/pull/169442. It will be used to check the memory stats consistency with `torch.xpu.memory_stats` and `torch.xpu.memory_snapshot`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/169812 Approved by: https://github.com/EikanWang	2025-12-18 06:04:16 +00:00
Daniel Galvez	416fccb5cd	Make CachingHostAllocator work with memory pools. (#167507 ) Both allocation to a cuda graph's private pool via stream capture and allocation to a memory pool in non-stream-captured code are supported. In the case of stream capture, we refuse to reuse a host memory block as soon as record_event() is called on that block. This is to prevent a stream-captured CUDA kernel from reading different contents from a memory block than would be read if, counterfactually, this CUDA kernels were running eagerly on a cuda stream. See https://github.com/pytorch/pytorch/pull/161583#issuecomment-3229885771 for elaboration. This is lacking test cases for pagedable host memory copies. We must make sure that record_event() does not fail in that case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167507 Approved by: https://github.com/eqy, https://github.com/eee4017, https://github.com/ngimel	2025-12-17 07:15:58 +00:00
Nikita Shulga	48f3158d1d	[MPS] Migrate pow_tensor_scalar/reciprocal ops to Metal shaders (#170077 ) Motivation is to fix `torch.signal.windows.kaiser` (see https://github.com/pytorch/pytorch/issues/164712 ), which started to fail, because `c10::metal::pow(x, 2)` is no longer identical to `x * x` - Redispatch to `reciprocal`/`sqrt`/`rsqrt` when scalar is `2.0`/`.5`/`-.5` respectively - Implement pow operation for complex numbers using $a^b = e^{b \cdot \log(a)} = e^{b \cdot (\log(r) + i\theta)}$ if $a=r e^{i\theta}$ - This also fixes backward for `tanh` Pull Request resolved: https://github.com/pytorch/pytorch/pull/170077 Approved by: https://github.com/Skylion007	2025-12-17 06:55:42 +00:00
Yu, Guangye	118b0d9037	[xpu][feature] [3/6] Add snapshot support on XPU caching allocator (#169203 ) # Motivation This PR introduces memory snapshot functionality for the XPU caching allocator. Our design philosophy is to keep the implementation as simple as possible without unnecessary features. We will be able to extend the functionality in the future if real use cases arise. The `c10::xpu::XPUCachingAllocator::snapshot` API introduced in this PR will be leveraged by the Python frontend in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/169203 Approved by: https://github.com/albanD ghstack dependencies: #168262, #169280	2025-12-15 10:39:18 +00:00
Yu, Guangye	5cc4ebf398	[xpu][feature] [2/6] Track stack context for xpu caching allocator (#169280 ) # Motivation This PR introduces stack context tracking support in the XPU caching allocator. The design remains straightforward and leverages `record_trace` appropriately to capture relevant allocation history for debugging and analysis. Pull Request resolved: https://github.com/pytorch/pytorch/pull/169280 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #168262	2025-12-15 10:39:17 +00:00
Yu, Guangye	bfb2ad2a88	[xpu][feature] [1/6] Add trace support on XPU caching allocator (#168262 ) # Motivation This PR introduces the `TraceEntry` structure and the `recordHistory` functionality to enable trace support in the XPU caching allocator. At this stage, the implementation is XPU-specific; however, we plan to refactor and consolidate this logic with the existing CUDA implementation to maximize code reuse and consistency across backends. The design principle is to remain minimal and straightforward while laying the foundation for future enhancements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/168262 Approved by: https://github.com/EikanWang, https://github.com/albanD	2025-12-15 10:39:00 +00:00
linhaifeng	52c7ad7bb7	[BE] remove redudant items in unordered_set/unodered_map (#170055 ) remove redudant items in unordered_set/unodered_map Pull Request resolved: https://github.com/pytorch/pytorch/pull/170055 Approved by: https://github.com/cyyever, https://github.com/malfet	2025-12-15 07:04:32 +00:00
yifeng.jyf	7a41672d38	[Memory Snapshot] Release alloc_buffer memory after stopping recording history (#169650 ) In our use case, we use memory snapshots for on-demand GPU memory allocation analysis. We found that after completion, memory usage would increase significantly. One important reason is that the buffer recording GPU memory allocation history is not released after finishing, so I want to release this buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/169650 Approved by: https://github.com/ngimel, https://github.com/guangyey	2025-12-15 04:40:50 +00:00
Jiannan Wang	dc79403a25	Add skip_actions flag to filter out memory snapshot events (#168183 ) Fixes [T244199247](https://www.internalfb.com/intern/tasks/?t=244199247). This PR adds a flag to skip specific events in memory snapshot to reduces trace file size and improves HTML viewer usability. Docstring ``` skip_actions (list[str], optional): List of action types to skip when recording memory history. This can be used to reduce memory overhead by excluding certain types of events from being recorded. Valid action types are: - `"alloc"`: Memory allocation events - `"free_requested"`: Free requests (memory marked for freeing) - `"free_completed"`: Completed free operations (memory actually freed) - `"segment_alloc"`: Segment allocation from cudaMalloc - `"segment_free"`: Segment freed back to CUDA via cudaFree - `"oom"`: Out-of-memory exceptions - `"snapshot"`: Memory snapshot generation events For example, to skip recording free_requested events: `skip_actions=["free_requested"]` ``` Usage: `torch.cuda.memory._record_memory_history(skip_actions=['free_requested'])` The following screenshot shows that there are still free_completed events but no free_requested events. <img width="2516" height="1313" alt="Screenshot 2025-11-20 at 5 05 55 PM" src="https://github.com/user-attachments/assets/2adbc21e-3982-4616-a6d2-08e21e4446ca" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/168183 Approved by: https://github.com/sraikund16	2025-12-12 22:26:07 +00:00
Yuanyuan Chen	28fe2d3dbd	Fix clang-tidy warnings on c10/xpu files (#169231 ) This PR fixes clang-tidy warnings on c10/xpu files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/169231 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD	2025-12-11 09:48:53 +00:00
Yuanyuan Chen	e33fa0ece3	Apply more clang-tidy fixes (#169794 ) This PR applies more clang-tidy fixes about readability and modernization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/169794 Approved by: https://github.com/Skylion007	2025-12-11 00:24:28 +00:00
Dan Johnson	f45018788b	Mempool use_on_oom order (#169699 ) Reorder oom mitigation steps so that we reuse optional mempools before expensive releasing cached blocks. Additionally, make sure mempools are removed from use_on_oom_pools upon deletion. New test before fix: ``` ====================================================================== ERROR: test_deleted_mempool_not_used_on_oom (__main__.TestMemPool.test_deleted_mempool_not_used_on_oom) Test that a deleted mempool with use_on_oom=True is properly removed from use_on_oom_pools. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/danielsjohnson/oss_pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3325, in wrapper method(args, kwargs) File "/home/danielsjohnson/oss_pytorch/pytorch/test/test_cuda.py", line 5696, in test_deleted_mempool_not_used_on_oom c = torch.randn(20 nelem_1mb, device="cuda") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "/home/danielsjohnson/oss_pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":2700, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_cuda.py TestMemPool.test_deleted_mempool_not_used_on_oom This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.691s FAILED (errors=1) Segmentation fault (core dumped) ``` New test after fix: ``` ---------------------------------------------------------------------- Ran 1 test in 0.651s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/169699 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-12-10 22:23:26 +00:00
Isalia20	87a052ffd8	[MPS] Binary dense scalar kernels (#169764 ) Partially addresses #168964 This: ```python import torch import time input = torch.rand(16, 8, 343, 64).to("mps") scalar = 1.0 start = time.time() for _ in range(10000): out = input.mul(scalar) torch.mps.synchronize() end = time.time() print(f"Took {end - start}") ``` Takes 0.89 seconds for torch 2.6(MPS Graph implementation). On main it takes 2.95 seconds After these changes, it takes 0.71-0.76 seconds Pull Request resolved: https://github.com/pytorch/pytorch/pull/169764 Approved by: https://github.com/malfet	2025-12-10 18:17:03 +00:00
Jeremy Braun	f795b99ab2	[pytorch] redirect `fbcode//caffe2/c10:c10` to the OSS/conda version (#169004 ) Summary: Some build rules want to consume c10 from an external source. Convert it to an alias (like other pytorch entry points), and redirect conda builds to use the `fbcode//conda/buck_integration/toolchains/third-party:torch-cpp` target. Test Plan: CI. Reviewed By: athmasagar Differential Revision: D81518888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/169004 Approved by: https://github.com/malfet	2025-12-10 17:59:51 +00:00
Jacob Szwejbka	d026ae7be6	Protect files that ET depends on to be pinned to c++17 (#169899 ) Needed for https://github.com/pytorch/pytorch/issues/167822 ExecuTorch is still pinned to c++17 and will remain that way for the forseeable future. We take a dep on these headers and pytorch/pytorch wants to bump to c++20. This is a pretty minimal way to avoid the issue recommended by @r-barnes Added a unittest with the version restricted and then added a new linter that I manually tested by modifying the files to use a c++20 feature and verified it died. Previous version that went under review before I trashed the git history and broke the CLA job: https://github.com/pytorch/pytorch/pull/168204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/169899 Approved by: https://github.com/larryliu0820	2025-12-09 22:59:44 +00:00
Ke Wen	57beef6e3c	[MemPool] Add no-split option (#169739 ) SymmMem creates one signal pad per memory allocation. As a result, it is undesired for CachingAllocator to let two tensors share the space of one memory allocation, because otherwise, operations on those two tensors would use the same signal pad, leading to wrong synchronization. This PR adds a `no_split` flag to MemPool so that the pool would not use the remaining part of a memory segment to allocate for another block (tensor). Pull Request resolved: https://github.com/pytorch/pytorch/pull/169739 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-12-09 22:08:29 +00:00
Yu, Guangye	87449b1c16	Document PYTORCH_ALLOC_CONFIG and mark PYTORCH_CUDA_ALLOC_CONFIG as its alias (#167659 ) # Motivation This PR aims to document `PYTORCH_ALLOC_CONF` in the doc and comments and reminder the user that the previous environment variable `PYTORCH_CUDA_ALLOC_CONF` is its alias. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167659 Approved by: https://github.com/albanD	2025-12-09 08:21:16 +00:00
Yufei Yuan	9c4e5352e4	Fix vhappi_common_backend GCC build by moving Clang-specific flags (#169746 ) Summary: The vhappi_common_backend target was applying the Clang-specific warning flag -Wdeprecated-non-prototype to all compiler builds, causing GCC builds to fail since GCC doesn't support this flag. This moves the Clang-specific flag from the target's cpp_compiler_flags to be applied only during Clang compilation via the BUILD_MODE.bzl file, allowing the target to build successfully with both compilers. Test Plan: ``` buck build fbcode//mode/dbg-gcc fbcode//infra_asic_fpga/validation/common_gen2:vhappi_common_backend buck build fbcode//mode/dev fbcode//infra_asic_fpga/validation/common_gen2:vhappi_common_backend ``` Both builds succeed. Reviewed By: ghua-fb Differential Revision: D88011353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/169746 Approved by: https://github.com/drisspg	2025-12-08 19:26:31 +00:00
PyTorch MergeBot	a9dd532d99	Revert "Mempool use_on_oom order (#169699 )" This reverts commit `05e06deaca`. Reverted https://github.com/pytorch/pytorch/pull/169699 on behalf of https://github.com/dsjohns2 due to Failing internal test due to shadow variables. Will reland with fix. ([comment](https://github.com/pytorch/pytorch/pull/169699#issuecomment-3628386257))	2025-12-08 18:15:38 +00:00
Dan Johnson	05e06deaca	Mempool use_on_oom order (#169699 ) Reorder oom mitigation steps so that we reuse optional mempools before expensive releasing cached blocks. Additionally, make sure mempools are removed from use_on_oom_pools upon deletion. New test before fix: ``` ====================================================================== ERROR: test_deleted_mempool_not_used_on_oom (__main__.TestMemPool.test_deleted_mempool_not_used_on_oom) Test that a deleted mempool with use_on_oom=True is properly removed from use_on_oom_pools. ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/danielsjohnson/oss_pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3325, in wrapper method(args, kwargs) File "/home/danielsjohnson/oss_pytorch/pytorch/test/test_cuda.py", line 5696, in test_deleted_mempool_not_used_on_oom c = torch.randn(20 nelem_1mb, device="cuda") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: it->second->use_count > 0 INTERNAL ASSERT FAILED at "/home/danielsjohnson/oss_pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":2700, please report a bug to PyTorch. To execute this test, run the following from the base repo dir: python test/test_cuda.py TestMemPool.test_deleted_mempool_not_used_on_oom This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ---------------------------------------------------------------------- Ran 1 test in 0.691s FAILED (errors=1) Segmentation fault (core dumped) ``` New test after fix: ``` ---------------------------------------------------------------------- Ran 1 test in 0.651s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/169699 Approved by: https://github.com/ngimel, https://github.com/eqy	2025-12-06 05:45:45 +00:00
eqy	c44798d730	[CUDA][CUDA Graphs] Use `count` when allocating storage for `edgeData` (#169576 ) `cudaGraphNodeGetDependencies` expects the pointer to `edgeData` to have enough storage for as many entries as `deps` For #169390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/169576 Approved by: https://github.com/eee4017, https://github.com/ngimel	2025-12-05 16:25:44 +00:00
Ke Wen	ca6cd47ec8	[SymmMem] Skip multicast init if any CUDA call fails (#168049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/168049 Approved by: https://github.com/fduwjj	2025-12-05 16:06:11 +00:00
Pavan Balaji	a70d81a285	[pytorch] Add env variable to enable IPC for expandable segments (#169487 ) Summary: PyTorch's expandable segments IPC capability was disabled in fbcode due to job failures (see https://github.com/pytorch/pytorch/pull/132890). However, some use cases like CTran require IPC functionality for multi-process GPU communication. This change introduces PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC environment variable to allow opt-in enablement of IPC handle types for expandable segments in fbcode builds while maintaining backward compatibility. IPC is enabled by default in non-fbcode builds and disabled by default in fbcode builds (existing behavior). In both cases, it can be explicitly enabled by setting PYTORCH_CUDA_EXPANDABLE_SEGMENTS_IPC=1. Test Plan: CI Differential Revision: D88274246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/169487 Approved by: https://github.com/ngimel	2025-12-05 08:58:34 +00:00
Frank Lin	e64f1eece0	expandable_segments + memory pool (#169491 ) Fixes #147851 Please also see #165419 and #148378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/169491 Approved by: https://github.com/ngimel	2025-12-04 20:49:41 +00:00
Yu, Guangye	2e0c2e170f	[xpu][feature] [1/2] Introduce XPUPluggableAllocator in cpp part (#168966 ) # Motivation This PR aims to introduce `XPUPluggableAllocator` and we make it as simple as possible. The follow-up PR would introduce the code related to the Python frontend part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/168966 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/eellison	2025-12-04 02:50:01 +00:00
can-gaa-hou	89e3bbcb5b	[Accelerator] Add Accelerator Capabilities API (#165631 ) # Motivation There are several issues related to the data type and precision that an accelerator supports (see #165038 and #143112). Sometimes, we have to check for these capabilities in the document, and then hard-code. This PR proposes a new unified API for users to check their accelerator capabilities. # Changes This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has: - Supporting DataType (set to be supported as default): - `fp16`, `int32`, `complex` ... etc - Other capabilities (need to be discussed) To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key). # Usage ```python >>> import torch >>> import torch_openreg >>> torch.accelerator.get_device_capability('openreg:0') {'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]} ``` # TODO - So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension. - It will support other in-tree accelerators, such as cuda and mps. - Clarify whether the capabilities are software or hardware supported. (By @guangyey ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165631 Approved by: https://github.com/guangyey, https://github.com/albanD Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>	2025-12-03 21:37:30 +00:00
Kurt Mohler	8c73bbbb02	[MPS] Migrate `clamp.Tensor_out` to Metal (#169407 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/169407 Approved by: https://github.com/malfet	2025-12-03 19:36:00 +00:00
Frank Lin	d038b0130e	The Nested Pool (#168382 ) This PR fixes issue #161193 by simply reversing the iteration order over captures_underway. After discussing with @galv, we decided to land this minimal fix first to unblock nested MemPool usage. Long-term, the underlying infrastructure (e.g., captures_underway) still needs refactoring to clearly define the interaction between graph capture, MemPools, and threads. That broader cleanup will be addressed in #168137. Pull Request resolved: https://github.com/pytorch/pytorch/pull/168382 Approved by: https://github.com/eqy, https://github.com/ngimel, https://github.com/galv	2025-12-03 18:38:01 +00:00
PyTorch MergeBot	6c261c6cb0	Revert "[Accelerator] Add Accelerator Capabilities API (#165631 )" This reverts commit `285779b162`. Reverted https://github.com/pytorch/pytorch/pull/165631 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it has a small bug when building this internally ([comment](https://github.com/pytorch/pytorch/pull/165631#issuecomment-3604616720))	2025-12-03 01:16:33 +00:00
can-gaa-hou	285779b162	[Accelerator] Add Accelerator Capabilities API (#165631 ) # Motivation There are several issues related to the data type and precision that an accelerator supports (see #165038 and #143112). Sometimes, we have to check for these capabilities in the document, and then hard-code. This PR proposes a new unified API for users to check their accelerator capabilities. # Changes This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has: - Supporting DataType (set to be supported as default): - `fp16`, `int32`, `complex` ... etc - Other capabilities (need to be discussed) To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key). # Usage ```python >>> import torch >>> import torch_openreg >>> torch.accelerator.get_device_capability('openreg:0') {'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]} ``` # TODO - So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension. - It will support other in-tree accelerators, such as cuda and mps. - Clarify whether the capabilities are software or hardware supported. (By @guangyey ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165631 Approved by: https://github.com/guangyey, https://github.com/albanD Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>	2025-12-02 15:14:30 +00:00
Yu, Guangye	5f0030ba63	[xpu][fix] Support xpu custom raw_alloc/delete in caching allocator (#168957 ) # Motivation Memory Pool needs to support the custom `raw_alloc` and `raw_delete` from a custom allocator. # Solution When the custom allocator is provided in the memory pool, use its `raw_alloc` and `raw_delete`. Otherwise, use the `sycl::aligned_alloc_device` and `sycl::free` from SYCL runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/168957 Approved by: https://github.com/EikanWang, https://github.com/gujinghui ghstack dependencies: #168956	2025-12-01 16:47:05 +00:00
PyTorch MergeBot	7d2a33e4eb	Revert "[Accelerator] Add Accelerator Capabilities API (#165631 )" This reverts commit `c8210e7d94`. Reverted https://github.com/pytorch/pytorch/pull/165631 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165631#issuecomment-3595505779))	2025-12-01 09:35:13 +00:00
can-gaa-hou	c8210e7d94	[Accelerator] Add Accelerator Capabilities API (#165631 ) # Motivation There are several issues related to the data type and precision that an accelerator supports (see #165038 and #143112). Sometimes, we have to check for these capabilities in the document, and then hard-code. This PR proposes a new unified API for users to check their accelerator capabilities. # Changes This PR creates a new data structure `DeviceCapability` containing the capabilities that an accelerator commonly has: - Supporting DataType (set to be supported as default): - `fp16`, `int32`, `complex` ... etc - Other capabilities (need to be discussed) To access the structure, this PR defines a new Python API in the Accelerator module -- `get_device_capability`. It takes `device` as an input and returns a dictionary containing the capabilities (now we have `supported_dtypes` as the key). # Usage ```python >>> import torch >>> import torch_openreg >>> torch.accelerator.get_device_capability('openreg:0') {'supported_dtypes': [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.float16, torch.float32, torch.float64, torch.complex32, torch.complex64, torch.complex128, torch.bool, torch.qint8, torch.quint8, torch.qint32, torch.bfloat16, torch.quint4x2, torch.quint2x4, torch.bits1x8, torch.bits2x4, torch.bits4x2, torch.bits8, torch.bits16, torch.float8_e5m2, torch.float8_e4m3fn, torch.float8_e5m2fnuz, torch.float8_e4m3fnuz, torch.uint16, torch.uint32, torch.uint64, torch.uint1, torch.uint2, torch.uint3, torch.uint4, torch.uint5, torch.uint6, torch.uint7, torch.int1, torch.int2, torch.int3, torch.int4, torch.int5, torch.int6, torch.int7, torch.float8_e8m0fnu, torch.float4_e2m1fn_x2]} ``` # TODO - So far, precision is the only capability to track, based on my knowledge. But we can find more capabilities in common, and the API should be designed for good extension. - It will support other in-tree accelerators, such as cuda and mps. - Clarify whether the capabilities are software or hardware supported. (By @guangyey ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165631 Approved by: https://github.com/fffrog, https://github.com/guangyey, https://github.com/albanD Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>	2025-12-01 06:09:15 +00:00
linhaifeng	7bc2a66ded	[CUDA][BugFix] fix truncated error messages (#168942 ) Inspired by #168369 I found `9a38bb8622/c10/core/Device.h (L19)` When device indices (DeviceIndex) with value 0 are passed to TORCH_CHECK macros,they are interpreted as string terminators (\0), causing error messages to be truncated. For example: ```cpp #include <iostream> #include <sstream> #include <cstring> #include <cstdint> int8_t device = 0; int main() { std::cout << std::strlen((std::stringstream() << "Head" << device << "Tail").str().c_str()) << std::endl; std::cout << std::strlen((std::stringstream() << "Head" << static_cast<int>(device) << "Tail").str().c_str()) << std::endl; std::cout << std::strlen((std::stringstream() << "Head" << +device << "Tail").str().c_str()) << std::endl; return 0; } ``` output ```bash 4 9 9 ``` Maybe we can use `+` instead of `static_cast<int>`, but it needs discussion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/168942 Approved by: https://github.com/cyyever, https://github.com/eqy	2025-11-28 04:58:46 +00:00
Yuanyuan Chen	9cd055e547	[2/N] Remove unused header inclusion (#165831 ) Remove unused header inclusion in JIT code and other locations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165831 Approved by: https://github.com/ngimel, https://github.com/albanD	2025-11-28 00:55:01 +00:00
Yu, Guangye	7c350369a7	[xpu][fix] Refine memory pool logic when expandable segement enabled (#168956 ) # Motivation This is a bug in the interaction between the memory pool for XPUGraph and expandable segments. When `unmap_block` is called, the allocator decreases `allocation_count` as expected: (see lines 862–867) `265397e178/c10/xpu/XPUCachingAllocator.cpp (L862-L867)` However, when an expandable segment is created via `try_allocate_expandable_block`, we never increment `allocation_count`. As a result, `allocation_count` can drop below its correct value after unmapping. # Solution This patch fixes the issue by ensuring `allocation_count` is incremented when creating a new expandable segment. # Additional Context PyTorch currently does not support using a custom allocator together with the expandable-segment feature in the memory pool. Therefore, we add an assertion to fail fast when this unsupported condition is detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/168956 Approved by: https://github.com/EikanWang	2025-11-25 01:10:54 +00:00
PyTorch MergeBot	33d4cf4fcb	Revert "Move CUDAEvent to c10 (#158219 )" This reverts commit `4909fd89dc`. Reverted https://github.com/pytorch/pytorch/pull/158219 on behalf of https://github.com/jeffdaily due to broke ROCm dynamo inductor benchmarks on ciflow/inductor-periodic label which wasn't run by default for this PR ([comment](https://github.com/pytorch/pytorch/pull/158219#issuecomment-3572110617))	2025-11-24 18:18:31 +00:00
Daniel Galvez	9a38bb8622	[CUDA] Fix truncated error messages in cudaMallocAsync Allocator (#168369 ) Previously, these error messages would get truncated when they were hit on device 0 because device is a "char" (actually, an int8_t) and therefore '0' is interpreted as the null byte to terminate a string. Essentially, it is the same issue as https://github.com/pytorch/pytorch/pull/123984. There's something strange in the TORCH_CHECK_WITH macro that is causing this. I don't feel like figuring out those obscure macro details right now, though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/168369 Approved by: https://github.com/eqy	2025-11-24 00:03:19 +00:00
Yu, Guangye	4909fd89dc	Move CUDAEvent to c10 (#158219 ) # Motivation When I refactored the caching allocator, I noticed that there are two separate pieces of code of `EventPool` : one in [aten/cuda/CachingHostAllocator.cpp](`0f21fa84fb/aten/src/ATen/cuda/CachingHostAllocator.cpp (L23)`) and another in [c10/cuda/CUDACachingAllocator](`0f21fa84fb/c10/cuda/CUDACachingAllocator.cpp (L869)`). I would like to refactor these so that they share a single implementation. To achieve this, I have to move `aten/cuda/CUDAEvent.h` to `c10/cuda`, which I understand this is a big change. However, I think it makes sense conceptually - `CUDAStream` and `CUDAEvent` are both fundamental CUDA abstractions, and since `CUDAStream` is already in `c10/cuda`, placing `CUDAEvent` there as well seems reasonable for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158219 Approved by: https://github.com/albanD	2025-11-22 04:47:17 +00:00
Lucy Qiu	7717bbaccd	Add template for add_overflows (#168035 ) Summary: Check for non uint64_t add overflows. See usage in D87115901. Afterwards, update the pytorch pin in executorch and then land the security patch. Test Plan: CI Differential Revision: D87272275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/168035 Approved by: https://github.com/larryliu0820	2025-11-21 18:37:23 +00:00
Frank Lin	8b0314d1a7	Fix edge-data handling in cudaGraphNodeGetDependencies for CUDA 13 in graph_capture_record_stream_reuse (#168305 ) CUDA 13 introduced stricter behavior for querying graph edges with edge data. According to the CUDA documentation for [cudaGraphNodeGetDependencies](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g94ee7ba53ade560483e9c5d06e8ef50d) > If an edge has non-zero (non-default) edge data and edgeData is NULL, this API returns cudaErrorLossyQuery. If edgeData is non-NULL, then pDependencies must also be non-NULL. When a graph contains edge data, we must provide a non-NULL edgeData buffer during dependency queries. Otherwise CUDA 13 will raise a cudaErrorLossyQuery. Pull Request resolved: https://github.com/pytorch/pytorch/pull/168305 Approved by: https://github.com/eqy, https://github.com/ezyang	2025-11-21 07:16:06 +00:00
Yuanyuan Chen	d3ccb8f3d0	Remove c10::is_pod (#166383 ) `c10::is_pod` is not used in OSS. New code should instead use `std::is_trivial`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166383 Approved by: https://github.com/albanD	2025-11-21 03:17:06 +00:00
Jagadish Krishnamoorthy	4887c46900	[ROCm] Fix HIP document url. (#168220 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/168220 Approved by: https://github.com/jeffdaily	2025-11-20 20:12:51 +00:00
arkadip-maitra	f4382d7f98	Fixes floor divide int min overflow issue (#166127 ) Fixes #127804 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166127 Approved by: https://github.com/albanD	2025-11-20 17:27:50 +00:00

1 2 3 4 5 ...

3275 Commits