mirror of
https://github.com/zebrajr/opencv.git
synced 2026-01-15 12:15:17 +00:00
docs(core): update Universal Intrinsics for VLA (RVV/SVE) and OpenCV 4.11+ API changes
This commit is contained in:
@@ -7,7 +7,7 @@ Vectorizing your code using Universal Intrinsics {#tutorial_univ_intrin}
|
|||||||
|
|
||||||
| | |
|
| | |
|
||||||
| -: | :- |
|
| -: | :- |
|
||||||
| Compatibility | OpenCV >= 3.0 |
|
| Compatibility | OpenCV >= 4.11 |
|
||||||
|
|
||||||
Goal
|
Goal
|
||||||
----
|
----
|
||||||
@@ -28,19 +28,16 @@ SIMD stands for **Single Instruction, Multiple Data**. SIMD Intrinsics allow the
|
|||||||
|
|
||||||
Depending on what *Instruction Sets* your CPU supports, you may be able to use the different registers. To learn more, look [here](https://en.wikipedia.org/wiki/Instruction_set_architecture)
|
Depending on what *Instruction Sets* your CPU supports, you may be able to use the different registers. To learn more, look [here](https://en.wikipedia.org/wiki/Instruction_set_architecture)
|
||||||
|
|
||||||
|
### VLA
|
||||||
|
VLA stands for **Vector Length Agnostic** .
|
||||||
|
A mechanism where the register width is determined by the hardware at runtime rather than being fixed at compile time.
|
||||||
|
This allows a single binary to scale its performance across different CPUs within the same architecture (e.g., RVV or SVE).
|
||||||
|
|
||||||
Universal Intrinsics
|
Universal Intrinsics
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
OpenCVs universal intrinsics provides an abstraction to SIMD vectorization methods and allows the user to use intrinsics without the need to write system specific code.
|
OpenCV's universal intrinsics provides an abstraction to SIMD and VLA vectorization methods and allows the user to use intrinsics without the need to write system specific code.
|
||||||
|
Supported SIMD/VLA technologies are detailed in @ref core_hal_intrin .
|
||||||
OpenCV Universal Intrinsics support the following instruction sets:
|
|
||||||
* *128 bit* registers of various types support is implemented for a wide range of architectures including
|
|
||||||
* x86(SSE/SSE2/SSE4.2),
|
|
||||||
* ARM(NEON),
|
|
||||||
* PowerPC(VSX),
|
|
||||||
* MIPS(MSA).
|
|
||||||
* *256 bit* registers are supported on x86(AVX2) and
|
|
||||||
* *512 bit* registers are supported on x86(AVX512)
|
|
||||||
|
|
||||||
**We will now introduce the available structures and functions:**
|
**We will now introduce the available structures and functions:**
|
||||||
* Register structures
|
* Register structures
|
||||||
@@ -150,33 +147,35 @@ Now that we know how registers work, let us look at the functions used for filli
|
|||||||
|
|
||||||
The universal intrinsics set provides element wise binary and unary operations.
|
The universal intrinsics set provides element wise binary and unary operations.
|
||||||
|
|
||||||
|
@note Since OpenCV 4.11, C++ operator overloading (e.g., +, ) in Universal Intrinsics has been deprecated in favor of explicit wrapper functions (e.g., v_add, v_mul) to ensure compatibility with VLA architectures.
|
||||||
|
See also: https://github.com/opencv/opencv/issues/27267
|
||||||
|
|
||||||
* **Arithmetics**: We can add, subtract, multiply and divide two registers element-wise. The registers must be of the same width and hold the same type. To multiply two registers, for example:
|
* **Arithmetics**: We can add, subtract, multiply and divide two registers element-wise. The registers must be of the same width and hold the same type. To multiply two registers, for example:
|
||||||
|
|
||||||
v_float32 a, b; // {a1, ..., an}, {b1, ..., bn}
|
v_float32 a, b; // {a1, ..., an}, {b1, ..., bn}
|
||||||
v_float32 c;
|
v_float32 c = v_add(a, b); // {a1 + b1, ..., an + bn}
|
||||||
c = a + b // {a1 + b1, ..., an + bn}
|
v_flaot32 d = v_mul(a, b); // {a1 * b1, ..., an * bn}
|
||||||
c = a * b; // {a1 * b1, ..., an * bn}
|
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
* **Bitwise Logic and Shifts**: We can left shift or right shift the bits of each element of the register. We can also apply bitwise &, |, ^ and ~ operators between two registers element-wise:
|
* **Bitwise Logic and Shifts**: We can left shift or right shift the bits of each element of the register. We can also apply bitwise and, or, xor and not operators between two registers element-wise:
|
||||||
|
|
||||||
v_int32 as; // {a1, ..., an}
|
v_int32 as; // {a1, ..., an}
|
||||||
v_int32 al = as << 2; // {a1 << 2, ..., an << 2}
|
v_int32 al = v_shl(as, 2); // {a1 << 2, ..., an << 2}
|
||||||
v_int32 bl = as >> 2; // {a1 >> 2, ..., an >> 2}
|
v_int32 bl = v_shr(as, 2); // {a1 >> 2, ..., an >> 2}
|
||||||
|
|
||||||
v_int32 a, b;
|
v_int32 a, b;
|
||||||
v_int32 a_and_b = a & b; // {a1 & b1, ..., an & bn}
|
v_int32 a_and_b = v_and(a, b); // {a1 & b1, ..., an & bn}
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
* **Comparison Operators**: We can compare values between two registers using the <, >, <= , >=, == and != operators. Since each register contains multiple values, we don't get a single bool for these operations. Instead, for true values, all bits are converted to one (0xff for 8 bits, 0xffff for 16 bits, etc), while false values return bits converted to zero.
|
* **Comparison Operators**: We can compare values between two registers using the v_lt(<), v_gt(>), v_le(<=) , v_ge(>=), v_eq(==) and v_ne(!=). Since each register contains multiple values, we don't get a single bool for these operations. Instead, for true values, all bits are converted to one (0xff for 8 bits, 0xffff for 16 bits, etc), while false values return bits converted to zero.
|
||||||
|
|
||||||
// let us consider the following code is run in a 128-bit register
|
// let us consider the following code is run in a 128-bit register
|
||||||
v_uint8 a; // a = {0, 1, 2, ..., 15}
|
v_uint8 a; // a = {0, 1, 2, ..., 13, 14, 15}
|
||||||
v_uint8 b; // b = {15, 14, 13, ..., 0}
|
v_uint8 b; // b = {15, 14, 13, ..., 2, 1, 0}
|
||||||
|
|
||||||
v_uint8 c = a < b;
|
v_uint8 c = v_lt(a, b); // c = {255, 255, 255, ..., 0, 0, 0}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
let us look at the first 4 values in binary
|
let us look at the first 4 values in binary
|
||||||
@@ -192,7 +191,7 @@ The universal intrinsics set provides element wise binary and unary operations.
|
|||||||
v_int32 a; // a = {1, 2, 3, 4, 5, 6, 7, 8}
|
v_int32 a; // a = {1, 2, 3, 4, 5, 6, 7, 8}
|
||||||
v_int32 b; // b = {8, 7, 6, 5, 4, 3, 2, 1}
|
v_int32 b; // b = {8, 7, 6, 5, 4, 3, 2, 1}
|
||||||
|
|
||||||
v_int32 c = (a < b); // c = {-1, -1, -1, -1, 0, 0, 0, 0}
|
v_int32 c = v_lt(a, b); // c = {-1, -1, -1, -1, 0, 0, 0, 0}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
The true values are 0xffffffff, which in signed 32-bit integer representation is equal to -1.
|
The true values are 0xffffffff, which in signed 32-bit integer representation is equal to -1.
|
||||||
|
|||||||
@@ -81,9 +81,26 @@ CV_CPU_OPTIMIZATION_HAL_NAMESPACE_BEGIN
|
|||||||
|
|
||||||
"Universal intrinsics" is a types and functions set intended to simplify vectorization of code on
|
"Universal intrinsics" is a types and functions set intended to simplify vectorization of code on
|
||||||
different platforms. Currently a few different SIMD extensions on different architectures are supported.
|
different platforms. Currently a few different SIMD extensions on different architectures are supported.
|
||||||
128 bit registers of various types support is implemented for a wide range of architectures
|
|
||||||
including x86(__SSE/SSE2/SSE4.2__), ARM(__NEON__), PowerPC(__VSX__), MIPS(__MSA__).
|
OpenCV Universal Intrinsics support the following instruction sets:
|
||||||
256 bit long registers are supported on x86(__AVX2__) and 512 bit long registers are supported on x86(__AVX512__).
|
|
||||||
|
- *128 bit* registers of various types support is implemented for a wide range of architectures including
|
||||||
|
- x86(SSE/SSE2/SSE4.2),
|
||||||
|
- ARM(NEON): 64-bit float (64F) requires AArch64,
|
||||||
|
- PowerPC(VSX),
|
||||||
|
- MIPS(MSA),
|
||||||
|
- LoongArch(LSX),
|
||||||
|
- RISC-V(RVV 0.7.1): Fixed-length implementation,
|
||||||
|
- WASM: 64-bit float (64F) is not supported,
|
||||||
|
- *256 bit* registers are supported on
|
||||||
|
- x86(AVX2),
|
||||||
|
- LoongArch (LASX),
|
||||||
|
- *512 bit* registers are supported on
|
||||||
|
- x86(AVX512),
|
||||||
|
- *Vector Length Agnostic (VLA)* registers are supported on
|
||||||
|
- RISC-V(RVV 1.0)
|
||||||
|
- ARM(SVE/SVE2): Powered by Arm KleidiCV integration (OpenCV 4.11+),
|
||||||
|
|
||||||
In case when there is no SIMD extension available during compilation, fallback C++ implementation of intrinsics
|
In case when there is no SIMD extension available during compilation, fallback C++ implementation of intrinsics
|
||||||
will be chosen and code will work as expected although it could be slower.
|
will be chosen and code will work as expected although it could be slower.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user