Files
pytorch/caffe2/python/operator_test
Aapo Kyrola 1ed746df45 BatchMatMulOp: use cuBLAS batched strided gemm for CUDA
Summary:
Instead of doing gemms in a for-loop (which is not parallelized), it is much better to do the batched matmuls using CUDA 8's new batched-striped version of gemm.

With the MT team's test, we get 5-10% improvement in overall walltime, so it is significant improvement:

----

Without batched gemm:

I0328 10:46:48.118605 58068 prof_dag_net.cc:136]    424.757 ms/iter (   283.878 ms/iter) RecurrentNetwork
I0328 10:46:48.118609 58068 prof_dag_net.cc:136]    352.603 ms/iter (    265.85 ms/iter) RecurrentNetworkGradient

With batched gemm:
I0328 10:53:48.169996 85617 prof_dag_net.cc:136]    407.438 ms/iter (   269.564 ms/iter) RecurrentNetwork
I0328 10:53:48.169999 85617 prof_dag_net.cc:136]    322.393 ms/iter (   287.625 ms/iter) RecurrentNetworkGradient

Reviewed By: jamesr66a

Differential Revision: D4788272

fbshipit-source-id: 210e8b94c1e036b6ef0f039ce000d455258651f4
2017-03-28 11:54:09 -07:00
..
2017-03-23 22:02:02 -07:00
2017-03-07 03:03:07 -08:00
2017-02-25 14:31:42 -08:00
2017-01-23 09:59:30 -08:00
2017-01-23 09:59:30 -08:00
2017-03-28 07:47:46 -07:00
2017-03-16 18:49:01 -07:00
2017-02-21 16:31:24 -08:00
2017-02-28 23:17:26 -08:00
2017-03-16 17:32:20 -07:00