A. Abdelfattah, J. Dongarra, D. Keyes, and H. Ltaief, M. Dayd, O. Marques, and K. Nakajima
vol. 7851 of Lecture Notes in Computer Science, Springer, 2013, pp. 72-79, (2013)
Hardware accelerators are becoming ubiquitous high performance
scientific computing. They are capable of delivering an unprecedented
level of concurrent execution contexts. High-level programming language
extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA
Profiler) are paramount to improve productivity, while effectively
exploiting the underlying hardware. We present an optimized numerical
kernel for computing the symmetric matrix-vector product on nVidia Fermi
GPUs. Due to its inherent memory-bound nature, this kernel is very
critical in the tridiagonalization of a symmetric dense matrix, which is
a preprocessing step to calculate the eigenpairs. Using a novel design
to address the irregular memory accesses by hiding latency and
increasing bandwidth, our preliminary asymptotic results show 3.5x and
2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30%
fold improvement over the Matrix Algebra on GPU and Multicore
Architectures (MAGMA) library in single and double precision
arithmetics, respectively.