Towards HPC System Throughput Optimization

published Jun 13, 2018

It is easy to focus on the more glamorous aspects of the Earth System Model / High Performance Computing intersection: Novel Algorithm Development / Big Models / Glittering Hardware. But one must not forget that the modeling efforts and thus many aspects of the science progress through a constant flux of model history output transformed into the graphs, tables and charts that are the input to day-to-day research efforts. Further, a given line of research requires not one but many simulation runs related by time series and/or parameter variations. All of this implies workflow throughput requirements to accomplish the science.

At the same time, the environments in which these simulations run have been undergoing enormous (and perhaps some would say catastrophic) growth in size and complexity. This complexity drives myriads of interactions between and among system hardware and software components and user applications producing reactions sometimes subtle; sometimes not so. In addition to the loss of individual job executions, these complex interactions can rob workflows of throughput performance in ways that often vary over time. And regardless of time variation, root causes for throughput slow down are typically quite time consuming and difficult to track down.

This talk will review some previous and current efforts at the GFDL to capture and utilize information generated by the workflow itself. While the current state represents progress over the almost 20 years I've worked with the lab, you will readily see that it only encompasses islands of data capture and analysis. Motivated by this body of work and the sometimes painful lessons learned, I will describe efforts to design and build a much more comprehensive workflow data gathering infrastructure to enable detailed throughput analysis. At need the infrastructure must be light weight, non-intrusive and deal gracefully with missing data. Further, the infrastructure must be modular, encapsulated and extensible since economics dictate that it will be deployed in stages and start from simplicity to build complexity. The end goal is to understand and optimize the scientific data production throughput in environments of increasing complexity for I fear that without such analysis capabilities, the ability to run at exascale may do us little good if the throughput is held to petascale levels.