Fixes#167991
Example of the new warning message:
```python
/home/guilhermel/git/pytorch313/torch/_dynamo/variables/functions.py:2159: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function at 'script.py:12'. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS=+dynamo for a DEBUG stack trace.
This call originates from:
File "/path/to/script.py", line 12, in bar
return baz(x)
torch._dynamo.utils.warn_once(msg)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/171496
Approved by: https://github.com/Lucaskabela
## Summary
This PR updates the expected accuracy CSV files for inductor benchmarks based on CI results from PyTorch commit 3c98eef883.
These files serve as reference points for dynamo/inductor CI to track:
- Graph breaks
- Model accuracy
## Changes
- Updated CUDA expected accuracy files in `benchmarks/dynamo/ci_expected_accuracy/`
- Updated ROCm expected accuracy files in `benchmarks/dynamo/ci_expected_accuracy/rocm/`
## Test Plan
- [ ] Verify that the CI jobs pass with the updated expected accuracy files
- [ ] Review the diff to ensure changes are reasonable and expected
- [ ] Check that no unexpected regressions are being marked as "expected"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/171533
Approved by: https://github.com/jataylo, https://github.com/atalman
**Summary:** Currently, whenever we subtract two partial dtensors, we redistribute since linearity is -1 for aten.sub.tensor. However, this is an unnecessary redistribution that can be avoided in similar ways to its add counterpart. I moved the op to linear_ops and ensured subtracting a scalar from a partial dtensor continues to redistribute.
**Test Cases:**
1. pytest test/distributed/tensor/test_pointwise_ops.py -k test_add_sub_scalar_norm_partial
2. pytest test/distributed/tensor/test_pointwise_ops.py -k test_add_sub_scalar_partial
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170040
Approved by: https://github.com/wconstab
ghstack dependencies: #170030, #170035
We will eventually remove `current_tx` in favor of directly passing it to VT's. We also eventually intend to change callsites involving TX'es so that the leaf TX is always passed. Currently, this is inconsisten since `InstructionTranslator.current_tx()` returns the root TX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170234
Approved by: https://github.com/guilhermeleobas
Fixes#168862
Previously the test would break on MI300x on this assertion
```cpp
TORCH_INTERNAL_ASSERT(2 * BLOCK_THREADS >= grid_size);
```
because the MI300x has 304 SMs, so the grid_size would get set to at least 304 while the number of threads within a block would be 128 for 16-byte types (2 * 128 = 256 which is not >= 304).
It seems like the reason for this assertion was because the kernel performed a simple reduction for the block aggregates: each thread held a block's aggregate, and if there were less threads per block than the number of blocks, then each thread would add up 2 aggregates (hence the assertion as a safe guard). Changing the conditional to a loop should incur very minimal overhead since it's executed at most one more time per thread than the old behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170763
Approved by: https://github.com/jeffdaily
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Previously we can not override auto_chunker.num_chunks with the options argument of torch.compile due to lack of type annotation. The type of the config is decided as the default value which is None. Overriding it as an integer during compilation will trigger type mismatch and fail.
Adding type annotations fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/171477
Approved by: https://github.com/v0i0, https://github.com/eellison
ghstack dependencies: #171359
[No function change] Refactor the `redistribute_cost` code by extracting out the logic to compute the cost of a single collective op into `_compute_placement_transition_cost`. This helps for `DTensorRedistributePlanner` to use the same single collective op cost from `_collective_utils.py` when traversing the graph. Below is how the calling stack will look like:
```
DTensorRedistributePlanner --> one_step_redistribute_cost ----------------|
| -----> _compute_placement_transition_cost
|
redistribute_cost ---> DTensorRedistributePlanner (get transform_infos)---|
```
Without the refactor, `redistribute_cost` and `DTensorRedistributePlanner` will circular calling each other.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170108
Approved by: https://github.com/mori360
ghstack dependencies: #170106, #170107
Notice that the redistribute_cost is incorrect while looking into https://github.com/pytorch/pytorch/issues/169439 to verify cost with following condition:
```
redistribute_cost(SRC, DST) <= redistribute_cost(SRC, INT) + redistribute_cost(INT, DST) for all INT
```
The failing case is:
1. For SRC --> DST, the redistribution path is: `S(1)S(0)[0]S(0)[1]->S(1)S(0)R->S(1)[0]S(0)S(1)[1]->S(1)[0]RS(1)[1]->S(1)[0]S(1)[2]S(1)[1]`
Then redistribute cost is summed up from following four costs:
```
current=S(0), target=R, comm_bytes_gb=1.1920928955078125e-07, step_cost=7.2006796424717825
current=R, target=S(1), comm_bytes_gb=1.1920928955078125e-07, step_cost=0.0 <<<<<<<<<<<<<<<<<<< comm_bytes_gb incorrect
current=S(0), target=R, comm_bytes_gb=2.384185791015625e-07, step_cost=7.201359284943566 <<< mismatch with number 7.2006796424717825
current=R, target=S(1), comm_bytes_gb=2.384185791015625e-07, step_cost=0.0
```
2. For SRC --> INT, the redistribution path is: 'S(1)S(0)[0]S(0)[1]->S(1)S(0)R->S(1)[0]S(0)S(1)[1]'
Then redistribute cost is summed up from following two costs:
```
current=S(0), target=R, comm_bytes_gb=1.1920928955078125e-07, step_cost=7.2006796424717825
current=R, target=S(1), comm_bytes_gb=1.1920928955078125e-07, step_cost=0.0
```
3. For INT --> DST, the redistribution path is `S(1)[0]S(0)S(1)[1]->S(1)[0]RS(1)[1]->S(1)[0]S(1)[2]S(1)[1]'
Then redistribute cost is summed up from following two costs:
```
current=S(0), target=R, comm_bytes_gb=1.1920928955078125e-07, step_cost=7.2006796424717825
current=R, target=S(1), comm_bytes_gb=1.1920928955078125e-07, step_cost=0.0
```
As we can see, `redistribute_cost(SRC, DST) > redistribute_cost(SRC, INT) + redistribute_cost(INT, DST) ` in this failing test. The difference is from converting from `R` to `S(1)`, which results in the incorrect `comm_bytes_gb` for the upcoming cost computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170107
Approved by: https://github.com/mori360
ghstack dependencies: #170106
When source placements and target placements are the same, we may still need to compute the redistribute_cost because they can be under different shard order.
Skipped the testcase for this PR as there will be a more strict test in #170109.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/170106
Approved by: https://github.com/mori360
Replace exception-based control flow pattern with attribute-based approach for
SkipCodeRecursiveException and RecompileLimitExceeded.
Instead of using specific exception types for control flow, add a frame_exec_strategy
attribute to TorchDynamoException that allows exceptions to optionally specify how
convert_frame should handle them.
Benefits:
- Cleaner separation of concerns (exceptions for errors, attributes for control flow)
- More flexible - any exception can specify a frame execution strategy
- Easier to extend - no need for new exception types for new strategies
- Better type safety with isinstance(e, exc.TorchDynamoException) check
Changes:
- torch/_dynamo/exc.py:
* Add frame_exec_strategy attribute to TorchDynamoException with documentation
* Remove SkipCodeRecursiveException and RecompileLimitExceeded classes
- torch/_dynamo/convert_frame.py:
* Remove imports of removed exception classes
* Replace isinstance checks with frame_exec_strategy attribute check
* Set frame_exec_strategy on Unsupported exception in recompile limit handler
Pull Request resolved: https://github.com/pytorch/pytorch/pull/171358
Approved by: https://github.com/Lucaskabela, https://github.com/guilhermeleobas
ghstack dependencies: #170587