Raise XPU tolerances for bf16 ResNet & BotNet TorchBench (#170552)

Multiple TorchBench models on XPU fail accuracy tests due to numeric tolerance being too strict rather. Two contributing factors identified:

1. Measurement methodology change (PyTorch 2.6.0 enforcing cosine_similarity https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2227) surfaced limitations and increased sensitivity in error checks for phlippe_resnet.
2. BatchNorm decomposition noise (~1e-5 RMSE per BN in fp16) accumulates through the iteration in botnet26t_256, pushing aggregate diffs beyond current thresholds.

**Analysis**

- phlippe_resnet failures reproduce across CPU and XPU; fp16 already uses higher tolerance, implying bf16 thresholds are misaligned.
- Disabling BN decomposition brings botnet26t_256 outputs within tolerance; with decomposition enabled, cumulative numeric error is expected.
- CI health indicates changes are non-disruptive; failures, where present, are unrelated to these PRs.

Fixes https://github.com/intel/torch-xpu-ops/issues/1799
Fixes https://github.com/intel/torch-xpu-ops/issues/1305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/170552
Approved by: https://github.com/EikanWang, https://github.com/desertfire

Co-authored-by: Tomasz Bohutyn <tbohutyn@habana.ai>
This commit is contained in:
Tomasz Bohutyn
2025-12-17 21:04:13 +00:00
committed by PyTorch MergeBot
parent 643d3a9676
commit 7d355795e4
2 changed files with 11 additions and 0 deletions

View File

@@ -71,6 +71,10 @@ REQUIRE_HIGHER_TOLERANCE = {
"mobilenetv3_large_100",
}
REQUIRE_HIGHER_TOLERANCE_FP16_XPU = {
"botnet26t_256",
}
REQUIRE_HIGHER_TOLERANCE_AMP = {}
REQUIRE_EVEN_HIGHER_TOLERANCE = {
@@ -366,6 +370,12 @@ class TimmRunner(BenchmarkRunner):
self.args.amp and name in REQUIRE_HIGHER_TOLERANCE_AMP
):
tolerance = 4 * 1e-2
elif (
name in REQUIRE_HIGHER_TOLERANCE_FP16_XPU
and self.args.float16
and current_device == "xpu"
):
tolerance = 4 * 1e-2
else:
tolerance = 1e-2
return tolerance, cosine

View File

@@ -52,6 +52,7 @@ tolerance:
# These models need higher tolerance for xpu devices with bf16
higher_bf16_xpu:
- squeezenet1_1
- phlippe_resnet
freezing:
# Similar logic to timm_models.py:get_tolerance_and_cosine_flag