T. Malas, A. J. Ahmadia, J. Brown, J. A. Gunnels, and D. E. Keyes
International Journal of High Performance Computing Applications, 27, pp. 193-209, (2013)
High Performance Computing, Performance Optimization, Code Generation, SIMD, Blue Gene/P
Several emerging petascale architectures use energy-efficient processors with
vectorized computational units and in-order thread processing. On these
architectures the sustained performance of streaming numerical kernels,
ubiquitous in the solution of partial differential equations, represents a
challenge despite the regularity of memory access. Sophisticated optimization
techniques are required to fully utilize the Central Processing Unit (CPU).
We propose a new method for constructing streaming numerical kernels using a
high-level assembly synthesis and optimization framework. We describe an
implementation of this method in Python targeting the IBM Blue Gene/P
supercomputer's PowerPC 450 core. This paper details the high-level design,
construction, simulation, verification, and analysis of these kernels utilizing
a subset of the CPU's instruction set.
We demonstrate the effectiveness of our approach by implementing several
three-dimensional stencil kernels over a variety of cached memory scenarios and
analyzing the mechanically scheduled variants, including a 27-point stencil
achieving a 1.7x speedup over the best previously published results.