FT-GCR: a fault-tolerant generalized conjugate residual elliptic solver
Code:
19/2021
Title:
FT-GCR: a fault-tolerant generalized conjugate residual elliptic solver
Date:
Wednesday 31st March 2021
Author(s):
Gillard, M.; Benacchio, T.
Abstract:
With the steady advance of high performance computing systems
featuring smaller and smaller hardware components, the systems and
algorithms used for numerical simulations increasingly contend with
disruptions caused by hardware failures and bit-levels misrepresenta-
tions of computing data. In numerical frameworks exploiting massive
processing power, the solution of linear systems often represents the
most computationally intensive component. Given the large amount
of repeated operations involved, iterative solvers are particularly vulnerable to bit-flips.
A new method named FT-GCR is proposed here that supplies the
preconditioned Generalized Conjugate Residual Krylov solver with
detection of, and recovery from, soft faults. The algorithm tests on the monotonic decrease of the residual norm and, upon failure, restarts
the iteration within the local Krylov space. Numerical experiments
on the solution of an elliptic problem arising from a stationary flow
over an isolated hill on the sphere show the skill of the method in
addressing bit-flips on a range of grid sizes and data loss scenarios,
with best returns and detection rates obtained for larger corruption
events. The simplicity of the method makes it easily extendable to
other solvers and an ideal candidate for algorithmic fault tolerance
within integrated model resilience strategies.