Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Processor

​Streaming Kernels in Scientific Applications​

  • Bottleneck in scientific codes

  • KAUST applications:

    • ​​​Explicit high-order methods for hyperbolic PDEs (Ketcheson)

    • Inverse seismic imaging (Schuster)

    • Navier-Stokes-Korteweg (Calo)

    • Plasmoid simulations (Samtaney)

​​

 

​Challenges in Tuning for PowerPC 450 Processor

  • Inhibition in the utilization of its SIMD and the dual-issue pipeline

  • PowerPC 450 requires reordered instructions to complete earlier

 

Efficient SIMD Algorithm for the 3-point Stencil

  • The 3-point stencil is a building block to many stencils

  • Fully utilizes SIMD-like capabilities with no instruction waste

 

Python Code Synthesis and Modeling Framework

  • Simplifies coding with a faster development-testing loop

  • Automates out-of-order scheduling and cycle-acurate performance modeling

 

Modeling and Performance Within L1 Cache

  • Optimization-enabling accurate modeling of the PowerPC 450 pipeline

 

27-point Stencil Results

  • 1.72x speedup over the best published results for large size problems [1]

  • 2.16x speedup over optimized C codes for domain size fitting in L1 cache

 

References

  1. 1.​K. Datta, ``Auto-tuning Stencial Codes for Cache-Based Multicore Platforms", PhD Thesis, EECS Department, University of California, Berkeley, December 2009

  2. 2.T. Malas, A. Ahmadia, J. Brown, J. Gunnels, and D. Keyes, ``Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Processor", Conditionally accepted, IJHPCA Journal

Related Publications