A. Abdelfattah, H. Ltaief, D. Keyes, J. Dongarra
Computation and Concurrency: Practice and Experience 28, 3447-3465, (2016)
Simulations of many multi-component PDE-based applications, such as
petroleum reservoirs or reacting flows, are dominated by the solution,
on each time step and within each Newton step, of large sparse linear
systems. The standard solver is a preconditioned Krylov method. Along
with application of the preconditioner, memory-bound Sparse
Matrix-Vector Multiplication (SpMV) is the most time-consuming operation
in such solvers. Multi-species models produce Jacobians with a dense
block structure, where the block size can be as large as a few dozen.
Failing to exploit this dense block structure vastly underutilizes
hardware capable of delivering high performance on dense BLAS
operations. This paper presents a GPU-accelerated SpMV kernel for
block-sparse matrices. Dense matrix-vector multiplications within the
sparse-block structure leverage optimization techniques from the KBLAS
library, a high performance library for dense BLAS kernels. The design
ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a
technique is proposed to balance the workload among thread blocks when
there are large variations in the lengths of nonzero rows. Multi-GPU
performance is highlighted. The proposed SpMV kernel outperforms
existing state-of-the-art implementations using matrices with real
structures from different applications. Copyright © 2016 John Wiley
& Sons, Ltd.