Files
opencv/modules
nishith-fujitsu 8efc0fd47b Merge pull request #28055 from nishith-fujitsu:sve_fastGEMM1t
dnn: add SVE optimized fastGEMM1T function and SVE dispatch #28055

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [x] The PR is proposed to the proper branch

**Description**
This PR enables fastGemm1t vectorized with SVE for AARCH64 architecture that called by recurrent layers and fully connected layers with SVE dispatching mechanism.

**ARM Compatibility:**
Modified the build scripts, and configuration files to ensure compatibility with ARM processors.

**Checklist**

Code changes have been tested on ARM devices (Graviton3).

**Modifications**

- Implemented FastGemm1T kernel in SVE with Vector length agnostic approach.

- Added Flags and checks to call our ported Kernel in Recurrent Layer and FullyConnected layer.

- Changes made to cmakelist.txt to dispatch our ported kernel for SVE.

- Flag OpenCV Dispatch with SVE optimization is added to support SVE implemented kernel for OpenCV. According to OpenCV build optimization https://github.com/opencv/opencv/wiki/CPU-optimizations-build-options 
cmake \
    -DCPU_BASELINE=NEON\
    -D CPU_DISPATCH=SVE\

**Performance Improvement**
- The suggested optimizations Improves the performance of LSTM layer and fully connected layer.
<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/jaiswaln/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/jaiswaln/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<style>
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:"Aptos Narrow", sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
.xl63
	{border:.5pt solid windowtext;}
.xl64
	{text-align:center;}
.xl65
	{text-align:center;
	border:.5pt solid windowtext;}
-->
</style>
</head>

<body link="#467886" vlink="#96607D">


Name of Test | dnn_neon | dnn_sve | dnn_sve   vs dnn_neon(x-factor)
-- | -- | -- | --
lstm::Layer_LSTM::BATCH=1,   IN=64, HIDDEN=192, TS=100 | 2.878 | 2.326 | 1.24
lstm::Layer_LSTM::BATCH=1,   IN=192, HIDDEN=192, TS=100 | 4.162 | 3.08 | 1.35
lstm::Layer_LSTM::BATCH=1,   IN=192, HIDDEN=512, TS=100 | 18.627 | 16.152 | 1.15
lstm::Layer_LSTM::BATCH=1,   IN=1024, HIDDEN=192, TS=100 | 10.98 | 7.976 | 1.38
lstm::Layer_LSTM::BATCH=64,   IN=64, HIDDEN=192, TS=2 | 4.41 | 3.459 | 1.27
lstm::Layer_LSTM::BATCH=64,   IN=192, HIDDEN=192, TS=2 | 6.567 | 4.807 | 1.37
lstm::Layer_LSTM::BATCH=64,   IN=192, HIDDEN=512, TS=2 | 28.471 | 22.909 | 1.24
lstm::Layer_LSTM::BATCH=64,   IN=1024, HIDDEN=192, TS=2 | 15.491 | 12.537 | 1.24
lstm::Layer_LSTM::BATCH=128,   IN=64, HIDDEN=192, TS=2 | 8.848 | 6.821 | 1.3
lstm::Layer_LSTM::BATCH=128,   IN=192, HIDDEN=192, TS=2 | 12.969 | 9.522 | 1.36
lstm::Layer_LSTM::BATCH=128,   IN=192, HIDDEN=512, TS=2 | 55.52 | 45.746 | 1.21
lstm::Layer_LSTM::BATCH=128,   IN=1024, HIDDEN=192, TS=2 | 31.226 | 26.132 | 1.19

</body>

</html>

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/jaiswaln/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/jaiswaln/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<style>
<!--table
	{mso-displayed-decimal-separator:"\.";
	mso-displayed-thousand-separator:"\,";}
@page
	{margin:.75in .7in .75in .7in;
	mso-header-margin:.3in;
	mso-footer-margin:.3in;}
tr
	{mso-height-source:auto;}
col
	{mso-width-source:auto;}
br
	{mso-data-placement:same-cell;}
td
	{padding-top:1px;
	padding-right:1px;
	padding-left:1px;
	mso-ignore:padding;
	color:black;
	font-size:11.0pt;
	font-weight:400;
	font-style:normal;
	text-decoration:none;
	font-family:"Aptos Narrow", sans-serif;
	mso-font-charset:0;
	mso-number-format:General;
	text-align:general;
	vertical-align:bottom;
	border:none;
	mso-background-source:auto;
	mso-pattern:auto;
	mso-protection:locked visible;
	white-space:nowrap;
	mso-rotate:0;}
.xl65
	{border:.5pt solid windowtext;}
.xl66
	{text-align:center;}
.xl67
	{text-align:center;
	border:.5pt solid windowtext;}
-->
</style>
</head>

<body link="#467886" vlink="#96607D">


Name of Test | dnn_neon | dnn_sve | dnn_sve   vs dnn_neon(x-factor)
-- | -- | -- | --
fc::Layer_FullyConnected::([5,   16, 512, 128], 256, false, OCV/CPU) | 5.086 | 4.483 | 1.13
fc::Layer_FullyConnected::([5,   16, 512, 128], 256, true, OCV/CPU) | 8.512 | 8.347 | 1.02
fc::Layer_FullyConnected::([5,   16, 512, 128], 512, false, OCV/CPU) | 9.467 | 8.965 | 1.06
fc::Layer_FullyConnected::([5,   16, 512, 128], 512, true, OCV/CPU) | 14.855 | 13.527 | 1.1
fc::Layer_FullyConnected::([5,   16, 512, 128], 1024, false, OCV/CPU) | 18.821 | 18.023 | 1.04
fc::Layer_FullyConnected::([5,   16, 512, 128], 1024, true, OCV/CPU) | 27.558 | 24.966 | 1.1
fc::Layer_FullyConnected::([5,   512, 384, 0], 256, false, OCV/CPU) | 0.924 | 0.804 | 1.15
fc::Layer_FullyConnected::([5,   512, 384, 0], 256, true, OCV/CPU) | 1.259 | 1.126 | 1.12
fc::Layer_FullyConnected::([5,   512, 384, 0], 512, false, OCV/CPU) | 1.957 | 1.655 | 1.18
fc::Layer_FullyConnected::([5,   512, 384, 0], 512, true, OCV/CPU) | 2.831 | 2.775 | 1.02
fc::Layer_FullyConnected::([5,   512, 384, 0], 1024, false, OCV/CPU) | 5.92 | 6.379 | 0.93
fc::Layer_FullyConnected::([5,   512, 384, 0], 1024, true, OCV/CPU) | 8.924 | 8.993 | 0.99

</body>

</html>
2025-12-03 10:42:28 +03:00
..
2025-11-30 17:57:21 +05:30
2025-10-16 22:58:55 +03:00