T. Malas, G. Hager, H. Ltaief, D.E. Keyes
ACM Transactions on Parallel Computing (TOPC), Volume 4, Issue 3, Article No. 12 , (2018)
Optimizing the performance of stencil algorithms has been the subject of
intense research over the last two decades. Since many stencil schemes
have low arithmetic intensity, most optimizations focus on increasing
the temporal data access locality, thus reducing the data traffic
through the main memory interface with the ultimate goal of decoupling
from this bottleneck. There are, however, only a few approaches that
explicitly leverage the shared cache feature of modern multicore chips.
If every thread works on its private, separate cache block, the
available cache space can become too small, and sufficient temporal
locality may not be achieved. We propose a flexible multidimensional
intratile parallelization method for stencil algorithms on multicore
CPUs with a shared outer-level cache. This method leads to a significant
reduction in the required cache space without adverse effects from
hardware prefetching or TLB shortage. Our Girih framework
includes an autotuner to select optimal parameter configurations on the
target hardware. We conduct performance experiments on two contemporary
Intel processors and compare with the state-of-the-art stencil
frameworks Pluto and Pochoir, using four corner-case stencil schemes and
a wide range of problem sizes. Girih shows substantial
performance advantages and best arithmetic intensity at almost all
problem sizes, especially on low-intensity stencils with variable
coefficients. We study in detail the performance behavior at varying
grid sizes using phenomenological performance modeling. Our analysis of
energy consumption reveals that our method can save energy through
reduced DRAM bandwidth usage even at a marginal performance gain. It is
thus well suited for future architectures that will be strongly
challenged by the cost of data movement, be it in terms of performance
or energy consumption.