diff --git a/doc/tutorials/core/univ_intrin/univ_intrin.markdown b/doc/tutorials/core/univ_intrin/univ_intrin.markdown
index a80b6d4bd3..2f4c116ac7 100644
--- a/doc/tutorials/core/univ_intrin/univ_intrin.markdown
+++ b/doc/tutorials/core/univ_intrin/univ_intrin.markdown
@@ -7,7 +7,7 @@ Vectorizing your code using Universal Intrinsics {#tutorial_univ_intrin}
| | |
| -: | :- |
-| Compatibility | OpenCV >= 3.0 |
+| Compatibility | OpenCV >= 4.11 |
Goal
----
@@ -28,19 +28,16 @@ SIMD stands for **Single Instruction, Multiple Data**. SIMD Intrinsics allow the
Depending on what *Instruction Sets* your CPU supports, you may be able to use the different registers. To learn more, look [here](https://en.wikipedia.org/wiki/Instruction_set_architecture)
+### VLA
+VLA stands for **Vector Length Agnostic** .
+A mechanism where the register width is determined by the hardware at runtime rather than being fixed at compile time.
+This allows a single binary to scale its performance across different CPUs within the same architecture (e.g., RVV or SVE).
+
Universal Intrinsics
--------------------
-OpenCVs universal intrinsics provides an abstraction to SIMD vectorization methods and allows the user to use intrinsics without the need to write system specific code.
-
-OpenCV Universal Intrinsics support the following instruction sets:
-* *128 bit* registers of various types support is implemented for a wide range of architectures including
- * x86(SSE/SSE2/SSE4.2),
- * ARM(NEON),
- * PowerPC(VSX),
- * MIPS(MSA).
-* *256 bit* registers are supported on x86(AVX2) and
-* *512 bit* registers are supported on x86(AVX512)
+OpenCV's universal intrinsics provides an abstraction to SIMD and VLA vectorization methods and allows the user to use intrinsics without the need to write system specific code.
+Supported SIMD/VLA technologies are detailed in @ref core_hal_intrin .
**We will now introduce the available structures and functions:**
* Register structures
@@ -150,33 +147,35 @@ Now that we know how registers work, let us look at the functions used for filli
The universal intrinsics set provides element wise binary and unary operations.
+@note Since OpenCV 4.11, C++ operator overloading (e.g., +, ) in Universal Intrinsics has been deprecated in favor of explicit wrapper functions (e.g., v_add, v_mul) to ensure compatibility with VLA architectures.
+See also: https://github.com/opencv/opencv/issues/27267
+
* **Arithmetics**: We can add, subtract, multiply and divide two registers element-wise. The registers must be of the same width and hold the same type. To multiply two registers, for example:
v_float32 a, b; // {a1, ..., an}, {b1, ..., bn}
- v_float32 c;
- c = a + b // {a1 + b1, ..., an + bn}
- c = a * b; // {a1 * b1, ..., an * bn}
+ v_float32 c = v_add(a, b); // {a1 + b1, ..., an + bn}
+ v_flaot32 d = v_mul(a, b); // {a1 * b1, ..., an * bn}
-* **Bitwise Logic and Shifts**: We can left shift or right shift the bits of each element of the register. We can also apply bitwise &, |, ^ and ~ operators between two registers element-wise:
+* **Bitwise Logic and Shifts**: We can left shift or right shift the bits of each element of the register. We can also apply bitwise and, or, xor and not operators between two registers element-wise:
v_int32 as; // {a1, ..., an}
- v_int32 al = as << 2; // {a1 << 2, ..., an << 2}
- v_int32 bl = as >> 2; // {a1 >> 2, ..., an >> 2}
+ v_int32 al = v_shl(as, 2); // {a1 << 2, ..., an << 2}
+ v_int32 bl = v_shr(as, 2); // {a1 >> 2, ..., an >> 2}
v_int32 a, b;
- v_int32 a_and_b = a & b; // {a1 & b1, ..., an & bn}
+ v_int32 a_and_b = v_and(a, b); // {a1 & b1, ..., an & bn}
-* **Comparison Operators**: We can compare values between two registers using the <, >, <= , >=, == and != operators. Since each register contains multiple values, we don't get a single bool for these operations. Instead, for true values, all bits are converted to one (0xff for 8 bits, 0xffff for 16 bits, etc), while false values return bits converted to zero.
+* **Comparison Operators**: We can compare values between two registers using the v_lt(<), v_gt(>), v_le(<=) , v_ge(>=), v_eq(==) and v_ne(!=). Since each register contains multiple values, we don't get a single bool for these operations. Instead, for true values, all bits are converted to one (0xff for 8 bits, 0xffff for 16 bits, etc), while false values return bits converted to zero.
// let us consider the following code is run in a 128-bit register
- v_uint8 a; // a = {0, 1, 2, ..., 15}
- v_uint8 b; // b = {15, 14, 13, ..., 0}
+ v_uint8 a; // a = {0, 1, 2, ..., 13, 14, 15}
+ v_uint8 b; // b = {15, 14, 13, ..., 2, 1, 0}
- v_uint8 c = a < b;
+ v_uint8 c = v_lt(a, b); // c = {255, 255, 255, ..., 0, 0, 0}
/*
let us look at the first 4 values in binary
@@ -192,7 +191,7 @@ The universal intrinsics set provides element wise binary and unary operations.
v_int32 a; // a = {1, 2, 3, 4, 5, 6, 7, 8}
v_int32 b; // b = {8, 7, 6, 5, 4, 3, 2, 1}
- v_int32 c = (a < b); // c = {-1, -1, -1, -1, 0, 0, 0, 0}
+ v_int32 c = v_lt(a, b); // c = {-1, -1, -1, -1, 0, 0, 0, 0}
/*
The true values are 0xffffffff, which in signed 32-bit integer representation is equal to -1.
diff --git a/modules/core/include/opencv2/core/hal/intrin_cpp.hpp b/modules/core/include/opencv2/core/hal/intrin_cpp.hpp
index 9c7922445f..756602c710 100644
--- a/modules/core/include/opencv2/core/hal/intrin_cpp.hpp
+++ b/modules/core/include/opencv2/core/hal/intrin_cpp.hpp
@@ -81,9 +81,26 @@ CV_CPU_OPTIMIZATION_HAL_NAMESPACE_BEGIN
"Universal intrinsics" is a types and functions set intended to simplify vectorization of code on
different platforms. Currently a few different SIMD extensions on different architectures are supported.
-128 bit registers of various types support is implemented for a wide range of architectures
-including x86(__SSE/SSE2/SSE4.2__), ARM(__NEON__), PowerPC(__VSX__), MIPS(__MSA__).
-256 bit long registers are supported on x86(__AVX2__) and 512 bit long registers are supported on x86(__AVX512__).
+
+OpenCV Universal Intrinsics support the following instruction sets:
+
+- *128 bit* registers of various types support is implemented for a wide range of architectures including
+ - x86(SSE/SSE2/SSE4.2),
+ - ARM(NEON): 64-bit float (64F) requires AArch64,
+ - PowerPC(VSX),
+ - MIPS(MSA),
+ - LoongArch(LSX),
+ - RISC-V(RVV 0.7.1): Fixed-length implementation,
+ - WASM: 64-bit float (64F) is not supported,
+- *256 bit* registers are supported on
+ - x86(AVX2),
+ - LoongArch (LASX),
+- *512 bit* registers are supported on
+ - x86(AVX512),
+- *Vector Length Agnostic (VLA)* registers are supported on
+ - RISC-V(RVV 1.0)
+ - ARM(SVE/SVE2): Powered by Arm KleidiCV integration (OpenCV 4.11+),
+
In case when there is no SIMD extension available during compilation, fallback C++ implementation of intrinsics
will be chosen and code will work as expected although it could be slower.