A. Abdelfattah, D. Keyes, and H. Ltaief, I. Caragiannis, M. Alexander, R. Badia, M. Cannataro, A. Costan, M. Danelutto, F. Desprez, B. Krammer, J. Sahuquillo, S. Scott, and J. Weidendorfer
vol. 7640 of Lecture Notes in Computer Science, Springer, pp. 207-216, (2013)
Matrix-Vector Multiplication,
GPU Optimizations,
Memory-Bound Operations,
Hessenberg Reduction,
Bidiagonal Reduction
The use of GPUs has been very beneficial in accelerating dense linear
algebra computational kernels (DLA). Many high performance numerical
libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK
implementations on GPUs as well as hybrid computations involving both,
CPUs and GPUs. GPUs usually score better performance than CPUs for
compute-bound operations, especially those characterized by a regular
data access pattern. This paper highlights a systematic approach for
efficiently implementing memory-bound DLA kernels on GPUs, by taking
advantage of the underlying device’s architecture (e.g., high
throughput). This methodology proved to outperform existing
state-of-the-art GPU implementations for the symmetric matrix-vector
multiplication (SYMV), characterized by an irregular data access
pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose
to extend this methodology to the general matrix-vector multiplication
(GEMV) kernel. The performance results show that our GEMV implementation
achieves better performance for relatively small to medium matrix
sizes, making it very influential in calculating the Hessenberg and
bidiagonal reductions of general matrices (radar applications), which
are the first step toward computing eigenvalues and singular values,
respectively. Considering small and medium size matrices (≤4500), our
GEMV kernel achieves an average 60% improvement in single precision (SP)
and an average 25% in double precision (DP) over existing open-source
and commercial software solutions. These results improve reduction
algorithms for both small and large matrices. The improved GEMV
performances engender an averge 30% (SP) and 15% (DP) in Hessenberg
reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal
reduction over the implementation provided by CUBLAS 5.0.