Toward Accelerating the Matrix Inversion Computation of Symmetric Positive-Definite Matrices on Heterogenous GPU-Based Systems

​Block vs. Tile Algorithms

Block Algorithm

  • Computation proceeds with two phases:

    • Panel

    • Update

  • Bottlenecks

    • Unnecessary synchronization points preventing lookaheads

    • The fork-join model

 

Tile Algorithm

  • Split the original column-major matrix into tiles using block data layout

  • Use tiles as the fundamental unit of computations

  • Out-of-order execution of tasks

 

Cholesky-based Matrix Invverse Computation​

  1.     Compute the Cholesky factorization

                       A=LLT

    2.      Invert teh Cholesky factor
                       T=L-1
    3.      Form the product of the inverted Cholesky factor with its transpose to get the final inverted matrix
                        TTT=(LLT​)-1=A-1​

StarPU Runtime System

  • Dynamic, out-of-order task scheduling on accelerator-based platforms

  • Ensures data availability and coherency between the memories of different units

Mixing PLASMA and MAGMA with StarPU

  • PLASMA kernels on CPUs, MAGMA kernels on GPUs

  • Scheduling tasks with StarPU

  • Advantage: Programmability and Productivity

Experimental Results

Compared our implementation with state-of-the-art, high performance dense linear algebra software libraries: LAPACK, PLASMA, and MAGMA

Experimental Platform

  • Dual-socket quad-core Intel Xeon (24GB)

  • Enhanced with two Fermi GPUs 2070 (6GB)

​Results

Our high performance implementation achieves almost half a Tflop/s (448 Gflop/s), which corresponds to 5 and 6-fold improvement compared to the equivalent routines from MAGMA adn PLASMA, respectively, and 10-fold improvement compared to LAPACK

References

  1. StarPU Users Guide, A Unified Runtime System for Heterogenous Multicore Architectures (version 0.9.2), INRIA Bordeaux, France, September 2011

  2. E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov, ``Faster, Cheaper, Better: A Hybridyzation Methodology to Develop Linear Algebra Software for GPUs", in GPU Gems, 34:473--484. (2010)

Related Publications