Dealing with unreliable computing platforms at extreme scale

A image


High Performance Computing
SC4I/Digitization, Innovation, and Competitiveness of the Production System
Luc Giraud
INRIA (Inria Bordeaux - Sud-Ouest)
Wednesday 23rd January 2019
Aula Consiglio VII Piano - Edificio 14, Dipartimento di Matematica POLITECNICO DI MILANO
The advent of extreme scale computing platforms will require the use of parallel resources at an unprecedented scale. On the technological side, the continuous shrinking of transistor geometry and the increasing complexity of these devices affect dramatically their sensitivity to natural radiation leading to a high rate of hardware faults, and thus diminish their reliability. Handling fully these faults at the computer system level may have a prohibitive computational and energetic cost. High performance computing applications that aim at exploiting all these resources will thus need to be resilient. In this talk, we will first give an overview of the current trends towards exascale. We will discuss the new challenges to face in terms of platform reliability and associated variety of possible faults. We will then discuss some of the solutions that have been proposed to tackle these errors before discussing in more details some contributions in sparse numerical linear algebra. First, in the context of computing node crashes, we will discuss possible remedies in the framework of linear system or eigenproblem solutions, that are the inner most numerical kernels in many scientific and engineering applications and also ones of the most time consuming parts. Second, we will discuss a somehow more challenging problem related to silent transient soft-errors produced by natural radiation and consisting in a bit-flip in a memory cell producing unexpected results at the application level. In that context we will consider the conjugate gradient (CG) method that is the most widely used iterative scheme for the solution of large sparse systems of linear equations when the matrix is symmetric positive definite. We will investigate through extensive numerical experiments the sensitivity of of CG to bit-flips and further discuss possible numerical criteria to detect the occurrence of such faults. The above mentioned research activities have been conducted in collaboration with many colleagues including E. Agullo (Inria), S. Cools (University of Antwerpen), E. Fatih-Yetkin (Kadir Has University), P. Salas (CERFACS), W. Vanroose (University of Antwerpen) and M. Zounon (NAG). contact:
Luc Giraud got his PhD from the Institut National Polytechnique de Toulouse in 1991, he then joint CERFACS as a post-doc, then senior researcher and deputy project leader. In 2005, he left for a full professor position in applied mathematics in the engineer School ENSEEIHT. He joint Inria in 2009 where he leads the Inria project team HiePACS that works on the design of scalable numerical techniques for emerging parallel computers. His main research interests are in numerical linear algebra with applications in various engineering domains. He currently serves on the editorial board of SIAM SISC and SIMAX.