Files
tensorflow/third_party/xla/docs/errors_overview.md
Mateusz Sokół 4785243e41 PR #32628: DOC: errors doc pages
Imported from GitHub PR https://github.com/openxla/xla/pull/32628

📝 Summary of Changes

This PR introduces errors overview docs and error codes pages.

🎯 Justification

Errors and error code docs.

🚀 Kind of Contribution

📚 Documentation

Copybara import of the project:

--
27585e2606e334c5233784e6f105d2d3892f208f by Mateusz Sokół <mat646@gmail.com>:

Skeleton of errors doc pages

--
ed6133c7e1093c4fddf37c0f5f8dee27dbfbc705 by Mateusz Sokół <mat646@gmail.com>:

Remove all errors' details for separate PRs

--
3c463b91704a80205a7bbed346a4f6a83425d65b by Mateusz Sokół <mat646@gmail.com>:

Remove empty pages

--
cdff9f75fe0154929e737129e4190091cb54bb61 by Mateusz Sokół <mat646@gmail.com>:

Remove dangling error page

--
fed970962e5b38eb2ae977bf57d50eacd0f85acf by Mateusz Sokół <mat646@gmail.com>:

Removed unnecessary please

Merging this change closes #32628

PiperOrigin-RevId: 828881269
2025-11-06 05:04:38 -08:00

2.0 KiB

XLA Errors Overview

XLA errors are categorized into different XLA error sources. Each source has a list of an additional context other than the error message, which will be attached to each error within the category.

🚧 Note that this standarization effort is a work in progress so not all error messages will have an attached error code yet.

An example error log might look like:

XlaRuntimeError: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space hbm. Used 49.34G of 32.00G hbm. Exceeded hbm capacity by 17.34G. Total hbm usage >= 49.34G: reserved 3.12M program unknown size arguments 49.34G

JaxRuntimeError: RESOURCE_EXHAUSTED: Ran out of memory in memory space vmem while allocating on stack for %ragged_latency_optimized_all_gather_lhs_contracting_gated_matmul_kernel.18 = bf16[2048,4096]{1,0:T(8,128)(2,1)} custom-call(%get-tuple-element.18273, %get-tuple-element.18274, %get-tuple-element.18275, %get-tuple-element.18276, %get-tuple-element.18277, /*index=5*/%bitcast.8695, %get-tuple-element.19201, %get-tuple-element.19202, %get-tuple-element.19203, %get-tuple-element.19204), custom_call_target=""

Statuses and CHECK failures

In general, in XLA we can flag corrupted execution with two mechanisms: statuses and CHECK macro failures.

Statuses are meant for non-fatal, recoverable errors. The assumption is that the function returns, and execution continues down the path where the caller explicitly checks the returned Status object. It's useful for handling invalid user input or expected resource constraints.

On the other hand, CHECK failures cover programmer's errors or violations of invariants that should never happen if the code is correct. In case of an activated CHECK the program will log the error message and immediately terminate. It could ensure internal consistency, such as checking that a pointer is non-null before dereferencing it.

Error codes

Here is an index list with all error codes.