docs(core): update Universal Intrinsics for VLA (RVV/SVE) and OpenCV 4.11+ API changes

2026-01-15 12:15:17 +00:00 · 2025-12-27 10:54:06 +09:00
parent a49eb47057
commit 364f21fd24
2 changed files with 42 additions and 26 deletions
--- a/doc/tutorials/core/univ_intrin/univ_intrin.markdown
+++ b/doc/tutorials/core/univ_intrin/univ_intrin.markdown
@@ -7,7 +7,7 @@ Vectorizing your code using Universal Intrinsics {#tutorial_univ_intrin}

 |    |    |
 | -: | :- |
-| Compatibility | OpenCV >= 3.0 |
+| Compatibility | OpenCV >= 4.11 |

 Goal
 ----
@@ -28,19 +28,16 @@ SIMD stands for **Single Instruction, Multiple Data**. SIMD Intrinsics allow the

 Depending on what *Instruction Sets* your CPU supports, you may be able to use the different registers. To learn more, look [here](https://en.wikipedia.org/wiki/Instruction_set_architecture)

+### VLA
+VLA stands for **Vector Length Agnostic** .
+A mechanism where the register width is determined by the hardware at runtime rather than being fixed at compile time.
+This allows a single binary to scale its performance across different CPUs within the same architecture (e.g., RVV or SVE).
+
 Universal Intrinsics
 --------------------

-OpenCVs universal intrinsics provides an abstraction to SIMD vectorization methods and allows the user to use intrinsics without the need to write system specific code.
-
-OpenCV Universal Intrinsics support the following instruction sets:
-* *128 bit* registers of various types support is implemented for a wide range of architectures including
-    * x86(SSE/SSE2/SSE4.2),
-    * ARM(NEON),
-    * PowerPC(VSX),
-    * MIPS(MSA).
-* *256 bit* registers are supported on x86(AVX2) and
-* *512 bit* registers are supported on x86(AVX512)
+OpenCV's universal intrinsics provides an abstraction to SIMD and VLA vectorization methods and allows the user to use intrinsics without the need to write system specific code.
+Supported SIMD/VLA technologies are detailed in @ref core_hal_intrin .

 **We will now introduce the available structures and functions:**
 * Register structures
@@ -150,33 +147,35 @@ Now that we know how registers work, let us look at the functions used for filli

 The universal intrinsics set provides element wise binary and unary operations.

+@note Since OpenCV 4.11, C++ operator overloading (e.g., +, ) in Universal Intrinsics has been deprecated in favor of explicit wrapper functions (e.g., v_add, v_mul) to ensure compatibility with VLA architectures.
+See also: https://github.com/opencv/opencv/issues/27267
+
 * **Arithmetics**: We can add, subtract, multiply and divide two registers element-wise. The registers must be of the same width and hold the same type. To multiply two registers, for example:

        v_float32 a, b;                          // {a1, ..., an}, {b1, ..., bn}
-        v_float32 c;
-        c = a + b                                // {a1 + b1, ..., an + bn}
-        c = a * b;                               // {a1 * b1, ..., an * bn}
+        v_float32 c = v_add(a, b);               // {a1 + b1, ..., an + bn}
+        v_flaot32 d = v_mul(a, b);               // {a1 * b1, ..., an * bn}

 <br>

-* **Bitwise Logic and Shifts**: We can left shift or right shift the bits of each element of the register. We can also apply bitwise &, |, ^ and ~ operators between two registers element-wise:
+* **Bitwise Logic and Shifts**: We can left shift or right shift the bits of each element of the register. We can also apply bitwise and, or, xor and not operators between two registers element-wise:

        v_int32 as;                              // {a1, ..., an}
-        v_int32 al = as << 2;                    // {a1 << 2, ..., an << 2}
-        v_int32 bl = as >> 2;                    // {a1 >> 2, ..., an >> 2}
+        v_int32 al = v_shl(as, 2);               // {a1 << 2, ..., an << 2}
+        v_int32 bl = v_shr(as, 2);               // {a1 >> 2, ..., an >> 2}

        v_int32 a, b;
-        v_int32 a_and_b = a & b;                 // {a1 & b1, ..., an & bn}
+        v_int32 a_and_b = v_and(a, b);           // {a1 & b1, ..., an & bn}

 <br>

-* **Comparison Operators**: We can compare values between two registers using the <, >, <= , >=, == and != operators. Since each register contains multiple values, we don't get a single bool for these operations. Instead, for true values, all bits are converted to one (0xff for 8 bits, 0xffff for 16 bits, etc), while false values return bits converted to zero.
+* **Comparison Operators**: We can compare values between two registers using the v_lt(<), v_gt(>), v_le(<=) , v_ge(>=), v_eq(==) and v_ne(!=). Since each register contains multiple values, we don't get a single bool for these operations. Instead, for true values, all bits are converted to one (0xff for 8 bits, 0xffff for 16 bits, etc), while false values return bits converted to zero.

        // let us consider the following code is run in a 128-bit register
-        v_uint8 a;                               // a = {0, 1, 2, ..., 15}
-        v_uint8 b;                               // b = {15, 14, 13, ..., 0}
+        v_uint8 a;                               // a = {0, 1, 2, ..., 13, 14, 15}
+        v_uint8 b;                               // b = {15, 14, 13, ..., 2, 1, 0}

-        v_uint8 c = a < b;
+        v_uint8 c = v_lt(a, b);                  // c = {255, 255, 255, ..., 0, 0, 0}

        /*
            let us look at the first 4 values in binary
@@ -192,7 +191,7 @@ The universal intrinsics set provides element wise binary and unary operations.
        v_int32 a;                               // a = {1, 2, 3, 4, 5, 6, 7, 8}
        v_int32 b;                               // b = {8, 7, 6, 5, 4, 3, 2, 1}

-        v_int32 c = (a < b);                     // c = {-1, -1, -1, -1, 0, 0, 0, 0}
+        v_int32 c = v_lt(a, b);                  // c = {-1, -1, -1, -1, 0, 0, 0, 0}

        /*
            The true values are 0xffffffff, which in signed 32-bit integer representation is equal to -1.
--- a/modules/core/include/opencv2/core/hal/intrin_cpp.hpp
+++ b/modules/core/include/opencv2/core/hal/intrin_cpp.hpp
@@ -81,9 +81,26 @@ CV_CPU_OPTIMIZATION_HAL_NAMESPACE_BEGIN

 "Universal intrinsics" is a types and functions set intended to simplify vectorization of code on
 different platforms. Currently a few different SIMD extensions on different architectures are supported.
-128 bit registers of various types support is implemented for a wide range of architectures
-including x86(__SSE/SSE2/SSE4.2__), ARM(__NEON__), PowerPC(__VSX__), MIPS(__MSA__).
-256 bit long registers are supported on x86(__AVX2__) and 512 bit long registers are supported on x86(__AVX512__).
+
+OpenCV Universal Intrinsics support the following instruction sets:
+
+- *128 bit* registers of various types support is implemented for a wide range of architectures including
+  - x86(SSE/SSE2/SSE4.2),
+  - ARM(NEON): 64-bit float (64F) requires AArch64,
+  - PowerPC(VSX),
+  - MIPS(MSA),
+  - LoongArch(LSX),
+  - RISC-V(RVV 0.7.1): Fixed-length implementation,
+  - WASM: 64-bit float (64F) is not supported,
+- *256 bit* registers are supported on
+  - x86(AVX2),
+  - LoongArch (LASX),
+- *512 bit* registers are supported on
+  - x86(AVX512),
+- *Vector Length Agnostic (VLA)* registers are supported on
+  - RISC-V(RVV 1.0)
+  - ARM(SVE/SVE2): Powered by Arm KleidiCV integration (OpenCV 4.11+),
+
 In case when there is no SIMD extension available during compilation, fallback C++ implementation of intrinsics
 will be chosen and code will work as expected although it could be slower.