Fault-Tolerance Techniques for High-Performance Computing

Fault-Tolerance Techniques for High-Performance Computing
Author: Thomas Herault
Publisher: Springer
Total Pages: 325
Release: 2015-07-01
Genre: Computers
ISBN: 3319209434

Download Fault-Tolerance Techniques for High-Performance Computing Book in PDF, Epub and Kindle

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

Fault Tolerance

Fault Tolerance
Author: Peter A. Lee
Publisher: Springer Science & Business Media
Total Pages: 326
Release: 2012-12-06
Genre: Computers
ISBN: 370918990X

Download Fault Tolerance Book in PDF, Epub and Kindle

The production of a new version of any book is a daunting task, as many authors will recognise. In the field of computer science, the task is made even more daunting by the speed with which the subject and its supporting technology move forward. Since the publication of the first edition of this book in 1981 much research has been conducted, and many papers have been written, on the subject of fault tolerance. Our aim then was to present for the first time the principles of fault tolerance together with current practice to illustrate those principles. We believe that the principles have (so far) stood the test of time and are as appropriate today as they were in 1981. Much work on the practical applications of fault tolerance has been undertaken, and techniques have been developed for ever more complex situations, such as those required for distributed systems. Nevertheless, the basic principles remain the same.

Software Fault Tolerance Techniques and Implementation

Software Fault Tolerance Techniques and Implementation
Author: Laura L. Pullum
Publisher: Artech House
Total Pages: 358
Release: 2001
Genre: Computers
ISBN: 1580531377

Download Software Fault Tolerance Techniques and Implementation Book in PDF, Epub and Kindle

Look to this innovative resource for the most-comprehensive coverage of software fault tolerance techniques available in a single volume. It offers you a thorough understanding of the operation of critical software fault tolerance techniques and guides you through their design, operation and performance. You get an in-depth discussion on the advantages and disadvantages of specific techniques, so you can decide which ones are best suited for your work.

Software-Implemented Hardware Fault Tolerance

Software-Implemented Hardware Fault Tolerance
Author: Olga Goloubeva
Publisher: Springer Science & Business Media
Total Pages: 238
Release: 2006-09-19
Genre: Technology & Engineering
ISBN: 0387329374

Download Software-Implemented Hardware Fault Tolerance Book in PDF, Epub and Kindle

This book presents the theory behind software-implemented hardware fault tolerance, as well as the practical aspects needed to put it to work on real examples. By evaluating accurately the advantages and disadvantages of the already available approaches, the book provides a guide to developers willing to adopt software-implemented hardware fault tolerance in their applications. Moreover, the book identifies open issues for researchers willing to improve the already available techniques.

Fault Tolerance for Iterative Methods in High-performance Computing

Fault Tolerance for Iterative Methods in High-performance Computing
Author: Dingwen Tao
Publisher:
Total Pages: 154
Release: 2018
Genre: Cellular automata
ISBN: 9780438429512

Download Fault Tolerance for Iterative Methods in High-performance Computing Book in PDF, Epub and Kindle

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems and fail-stop errors in the entire system, considering large component counts and lower power margins of emerging high-performance computing (HPC) platforms.

Software Performability: From Concepts to Applications

Software Performability: From Concepts to Applications
Author: Ann T. Tai
Publisher: Springer Science & Business Media
Total Pages: 207
Release: 2012-12-06
Genre: Computers
ISBN: 1461313252

Download Software Performability: From Concepts to Applications Book in PDF, Epub and Kindle

Computers are currently used in a variety of critical applications, including systems for nuclear reactor control, flight control (both aircraft and spacecraft), and air traffic control. Moreover, experience has shown that the dependability of such systems is particularly sensitive to that of its software components, both the system software of the embedded computers and the application software they support. Software Performability: From Concepts to Applications addresses the construction and solution of analytic performability models for critical-application software. The book includes a review of general performability concepts along with notions which are peculiar to software performability. Since fault tolerance is widely recognized as a viable means for improving the dependability of computer system (beyond what can be achieved by fault prevention), the examples considered are fault-tolerant software systems that incorporate particular methods of design diversity and fault recovery. Software Performability: From Concepts to Applications will be of direct benefit to both practitioners and researchers in the area of performance and dependability evaluation, fault-tolerant computing, and dependable systems for critical applications. For practitioners, it supplies a basis for defining combined performance-dependability criteria (in the form of objective functions) that can be used to enhance the performability (performance/dependability) of existing software designs. For those with research interests in model-based evaluation, the book provides an analytic framework and a variety of performability modeling examples in an application context of recognized importance. The material contained in this book will both stimulate future research on related topics and, for teaching purposes, serve as a reference text in courses on computer system evaluation, fault-tolerant computing, and dependable high-performance computer systems.

Scalable Techniques for Fault Tolerant High Performance Computing

Scalable Techniques for Fault Tolerant High Performance Computing
Author:
Publisher:
Total Pages: 174
Release: 2006
Genre:
ISBN:

Download Scalable Techniques for Fault Tolerant High Performance Computing Book in PDF, Epub and Kindle

As the number of processors in todayʹs parallel systems continues to grow, the mean-time-to-failure of these systems is becoming significantly shorter than the execution time of many parallel applications. It is increasingly important for large parallel applications to be able to continue to execute in spite of the failure of some components in the system. Todayʹs long running scientific applications typically tolerate failures by checkpoint/restart in which all process states of an application are saved into stable storage periodically. However, as the number of processors in a system increases, the amount of data that need to be saved into stable storage increases linearly. Therefore, the classical checkpoint/restart approach has a potential scalability problem for large parallel systems. In this research, we explore scalable techniques to tolerate a small number of process failures in large scale parallel computing. The goal of this research is to develop scalable fault tolerance techniques to help to make future high performance computing applications self-adaptive and fault survivable. The fundamental challenge in this research is scalability. To approach this challenge, this research (1) extended existing diskless checkpointing techniques to enable them to better scale in large scale high performance computing systems; (2) designed checkpoint-free fault tolerance techniques for linear algebra computations to survive process failures without checkpoint or rollback recovery; (3) developed coding approaches and novel erasure correcting codes to help applications to survive multiple simultaneous process failures. The fault tolerance schemes we introduce in this dissertation are scalable in the sense that the overhead to tolerate a failure of a fixed number of processes does not increase as the number of total processes in a parallel system increases. Two prototype examples have been developed to demonstrate the effectiveness of our techniques. In the first example, we developed a fault survivable conjugate gradient solver that is able to survive multiple simultaneous process failures with negligible overhead. In the second example, we incorporated our checkpoint-free fault tolerance technique into the ScaLAPACK/PBLAS matrix-matrix multiplication code to evaluate the overhead, survivability, and scalability. Theoretical analysis indicates that, to survive a fixed number of process failures, the fault tolerance overhead (without recovery) for matrix-matrix multiplication decreases to zero as the total number of processes (assuming a fixed amount of data per process) increases to infinity. Experimental results demonstrate that the checkpoint-free fault tolerance technique introduces surprisingly low overhead even when the total number of processes used in the application is small.

Enhancing Software Fault Prediction With Machine Learning: Emerging Research and Opportunities

Enhancing Software Fault Prediction With Machine Learning: Emerging Research and Opportunities
Author: Rashid, Ekbal
Publisher: IGI Global
Total Pages: 143
Release: 2017-09-13
Genre: Computers
ISBN: 1522531866

Download Enhancing Software Fault Prediction With Machine Learning: Emerging Research and Opportunities Book in PDF, Epub and Kindle

Software development and design is an intricate and complex process that requires a multitude of steps to ultimately create a quality product. One crucial aspect of this process is minimizing potential errors through software fault prediction. Enhancing Software Fault Prediction With Machine Learning: Emerging Research and Opportunities is an innovative source of material on the latest advances and strategies for software quality prediction. Including a range of pivotal topics such as case-based reasoning, rate of improvement, and expert systems, this book is an ideal reference source for engineers, researchers, academics, students, professionals, and practitioners interested in novel developments in software design and analysis.

Fault-Tolerant Parallel and Distributed Systems

Fault-Tolerant Parallel and Distributed Systems
Author: Dimiter R. Avresky
Publisher: Springer Science & Business Media
Total Pages: 396
Release: 2012-12-06
Genre: Computers
ISBN: 1461554497

Download Fault-Tolerant Parallel and Distributed Systems Book in PDF, Epub and Kindle

The most important use of computing in the future will be in the context of the global "digital convergence" where everything becomes digital and every thing is inter-networked. The application will be dominated by storage, search, retrieval, analysis, exchange and updating of information in a wide variety of forms. Heavy demands will be placed on systems by many simultaneous re quests. And, fundamentally, all this shall be delivered at much higher levels of dependability, integrity and security. Increasingly, large parallel computing systems and networks are providing unique challenges to industry and academia in dependable computing, espe cially because of the higher failure rates intrinsic to these systems. The chal lenge in the last part of this decade is to build a systems that is both inexpensive and highly available. A machine cluster built of commodity hardware parts, with each node run ning an OS instance and a set of applications extended to be fault resilient can satisfy the new stringent high-availability requirements. The focus of this book is to present recent techniques and methods for im plementing fault-tolerant parallel and distributed computing systems. Section I, Fault-Tolerant Protocols, considers basic techniques for achieving fault-tolerance in communication protocols for distributed systems, including synchronous and asynchronous group communication, static total causal order ing protocols, and fail-aware datagram service that supports communications by time.