D. Sukkari, H. Ltaief, M. Faverge, D.E. Keyes
IEEE Transactions on Parallel and Distributed Systems, volume 29, issue 2, pp. 312-323, (2017)
Polar decomposition, Asynchronous execution, Dynamic runtime system, Fine-grained execution
This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its correspondingimplementation on manycore architectures. Based on a new formulation of the iterativeQRdynamically-weighted Halley algorithm(QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorizationfor the condition number estimator by the more adequateQRfactorization to enable software portability across various architectures.Relying on fine-grained computations, the novel task-based implementation is also capable of taking advantage of the identity structureof the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactualsynchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for betterhardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG),where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime systemis employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks onthe underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significantimprovements against existing state-of-the-art high performance implementations (i.e., Intel MKL and Elemental) for the polardecomposition on latest shared-memory vendors’ systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs andIBM Power8), while maintaining high numerical accuracy.