mirror of
https://github.com/zebrajr/pytorch.git
synced 2026-01-15 12:15:51 +00:00
Raise XPU tolerances for bf16 ResNet & BotNet TorchBench (#170552)
Multiple TorchBench models on XPU fail accuracy tests due to numeric tolerance being too strict rather. Two contributing factors identified: 1. Measurement methodology change (PyTorch 2.6.0 enforcing cosine_similarity https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2227) surfaced limitations and increased sensitivity in error checks for phlippe_resnet. 2. BatchNorm decomposition noise (~1e-5 RMSE per BN in fp16) accumulates through the iteration in botnet26t_256, pushing aggregate diffs beyond current thresholds. **Analysis** - phlippe_resnet failures reproduce across CPU and XPU; fp16 already uses higher tolerance, implying bf16 thresholds are misaligned. - Disabling BN decomposition brings botnet26t_256 outputs within tolerance; with decomposition enabled, cumulative numeric error is expected. - CI health indicates changes are non-disruptive; failures, where present, are unrelated to these PRs. Fixes https://github.com/intel/torch-xpu-ops/issues/1799 Fixes https://github.com/intel/torch-xpu-ops/issues/1305 Pull Request resolved: https://github.com/pytorch/pytorch/pull/170552 Approved by: https://github.com/EikanWang, https://github.com/desertfire Co-authored-by: Tomasz Bohutyn <tbohutyn@habana.ai>
This commit is contained in:
committed by
PyTorch MergeBot
parent
643d3a9676
commit
7d355795e4
@@ -71,6 +71,10 @@ REQUIRE_HIGHER_TOLERANCE = {
|
||||
"mobilenetv3_large_100",
|
||||
}
|
||||
|
||||
REQUIRE_HIGHER_TOLERANCE_FP16_XPU = {
|
||||
"botnet26t_256",
|
||||
}
|
||||
|
||||
REQUIRE_HIGHER_TOLERANCE_AMP = {}
|
||||
|
||||
REQUIRE_EVEN_HIGHER_TOLERANCE = {
|
||||
@@ -366,6 +370,12 @@ class TimmRunner(BenchmarkRunner):
|
||||
self.args.amp and name in REQUIRE_HIGHER_TOLERANCE_AMP
|
||||
):
|
||||
tolerance = 4 * 1e-2
|
||||
elif (
|
||||
name in REQUIRE_HIGHER_TOLERANCE_FP16_XPU
|
||||
and self.args.float16
|
||||
and current_device == "xpu"
|
||||
):
|
||||
tolerance = 4 * 1e-2
|
||||
else:
|
||||
tolerance = 1e-2
|
||||
return tolerance, cosine
|
||||
|
||||
@@ -52,6 +52,7 @@ tolerance:
|
||||
# These models need higher tolerance for xpu devices with bf16
|
||||
higher_bf16_xpu:
|
||||
- squeezenet1_1
|
||||
- phlippe_resnet
|
||||
|
||||
freezing:
|
||||
# Similar logic to timm_models.py:get_tolerance_and_cosine_flag
|
||||
|
||||
Reference in New Issue
Block a user