Block vs. Tile Algorithms
Block Algorithm
Tile Algorithm
Split the original column-major matrix into tiles using block data layout
Use tiles as the fundamental unit of computations
Out-of-order execution of tasks
Cholesky-based Matrix Invverse Computation
- Compute the Cholesky factorization
A=LLT
2. Invert teh Cholesky factor
T=L-1
3. Form the product of the inverted Cholesky factor with its transpose to get the final inverted matrix
TTT=(LLT)-1=A-1

StarPU Runtime System
Dynamic, out-of-order task scheduling on accelerator-based platforms
Ensures data availability and coherency between the memories of different units

Mixing PLASMA and MAGMA with StarPU
PLASMA kernels on CPUs, MAGMA kernels on GPUs
Scheduling tasks with StarPU
Advantage: Programmability and Productivity
Experimental Results
Compared our implementation with state-of-the-art, high performance dense linear algebra software libraries: LAPACK, PLASMA, and MAGMA
Experimental Platform
Results
Our high performance implementation achieves almost half a Tflop/s (448 Gflop/s), which corresponds to 5 and 6-fold improvement compared to the equivalent routines from MAGMA adn PLASMA, respectively, and 10-fold improvement compared to LAPACK

References
StarPU Users Guide, A Unified Runtime System for Heterogenous Multicore Architectures (version 0.9.2), INRIA Bordeaux, France, September 2011
E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, R. Namyst, S. Thibault, and S. Tomov, ``Faster, Cheaper, Better: A Hybridyzation Methodology to Develop Linear Algebra Software for GPUs", in GPU Gems, 34:473--484. (2010)