Transparent Fault Tolerance for Job Healing in HPC Environments

Transparent Fault Tolerance for Job Healing in HPC Environments
Author:
Publisher:
Total Pages:
Release: 2004
Genre:
ISBN:

Download Transparent Fault Tolerance for Job Healing in HPC Environments Book in PDF, Epub and Kindle

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions. This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas. First, at the job level, novel, scalable mechanisms are built in support of proactive FT and to significantly enhance reactive FT. The contributions of this dissertation in this area are (1) a transparent job pause mechanism, which allows a job to pause when a process fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant approach that combines process-level live migration with health monitoring to complement reactive with proactive FT and to reduce the number of checkpoints when a majority of the faults can be handled proactively; (3) a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks; and (4) an incremental checkpointing mechanism, which is combined with full checkpoints to explore the potential of reducing the overhead of checkpointing by performing fewer full checkpoints interspersed with multiple smaller incremental checkpoints. Second, for the job input data, transparent techniques are provided to improve the reliability, availability and performance of HPC I/O systems. In this area, the dissertation contributes (1) a mechanism for offline job input data reconstruction to ensure availability of job input data and to improve center-wide performance at no cost to job owners; (2) an approach to automatic recover job input data at run-time during failures by recovering staged data from an original source; and (3) ÃØâ'ƠÅ"just in timeÃØâ'ƠÂ replicatio.

Fault-Tolerance Techniques for High-Performance Computing

Fault-Tolerance Techniques for High-Performance Computing
Author: Thomas Herault
Publisher: Springer
Total Pages: 325
Release: 2015-07-01
Genre: Computers
ISBN: 3319209434

Download Fault-Tolerance Techniques for High-Performance Computing Book in PDF, Epub and Kindle

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

Handbook of Cloud Computing

Handbook of Cloud Computing
Author: Borko Furht
Publisher: Springer Science & Business Media
Total Pages: 638
Release: 2010-09-11
Genre: Computers
ISBN: 1441965246

Download Handbook of Cloud Computing Book in PDF, Epub and Kindle

Cloud computing has become a significant technology trend. Experts believe cloud computing is currently reshaping information technology and the IT marketplace. The advantages of using cloud computing include cost savings, speed to market, access to greater computing resources, high availability, and scalability. Handbook of Cloud Computing includes contributions from world experts in the field of cloud computing from academia, research laboratories and private industry. This book presents the systems, tools, and services of the leading providers of cloud computing; including Google, Yahoo, Amazon, IBM, and Microsoft. The basic concepts of cloud computing and cloud computing applications are also introduced. Current and future technologies applied in cloud computing are also discussed. Case studies, examples, and exercises are provided throughout. Handbook of Cloud Computing is intended for advanced-level students and researchers in computer science and electrical engineering as a reference book. This handbook is also beneficial to computer and system infrastructure designers, developers, business managers, entrepreneurs and investors within the cloud computing related industry.

Administering Data Centers

Administering Data Centers
Author: Kailash Jayaswal
Publisher: John Wiley & Sons
Total Pages: 668
Release: 2005-10-28
Genre: Computers
ISBN: 0471783358

Download Administering Data Centers Book in PDF, Epub and Kindle

"This book covers a wide spectrum of topics relevant to implementing and managing a modern data center. The chapters are comprehensive and the flow of concepts is easy to understand." -Cisco reviewer Gain a practical knowledge of data center concepts To create a well-designed data center (including storage and network architecture, VoIP implementation, and server consolidation) you must understand a variety of key concepts and technologies. This book explains those factors in a way that smoothes the path to implementation and management. Whether you need an introduction to the technologies, a refresher course for IT managers and data center personnel, or an additional resource for advanced study, you'll find these guidelines and solutions provide a solid foundation for building reliable designs and secure data center policies. * Understand the common causes and high costs of service outages * Learn how to measure high availability and achieve maximum levels * Design a data center using optimum physical, environmental, and technological elements * Explore a modular design for cabling, Points of Distribution, and WAN connections from ISPs * See what must be considered when consolidating data center resources * Expand your knowledge of best practices and security * Create a data center environment that is user- and manager-friendly * Learn how high availability, clustering, and disaster recovery solutions can be deployed to protect critical information * Find out how to use a single network infrastructure for IP data, voice, and storage

Data Center Networks

Data Center Networks
Author: Yang Liu
Publisher: Springer Science & Business Media
Total Pages: 77
Release: 2013-09-26
Genre: Computers
ISBN: 331901949X

Download Data Center Networks Book in PDF, Epub and Kindle

This SpringerBrief presents a survey of data center network designs and topologies and compares several properties in order to highlight their advantages and disadvantages. The brief also explores several routing protocols designed for these topologies and compares the basic algorithms to establish connections, the techniques used to gain better performance, and the mechanisms for fault-tolerance. Readers will be equipped to understand how current research on data center networks enables the design of future architectures that can improve performance and dependability of data centers. This concise brief is designed for researchers and practitioners working on data center networks, comparative topologies, fault tolerance routing, and data center management systems. The context provided and information on future directions will also prove valuable for students interested in these topics.

Software Fault Tolerance Techniques and Implementation

Software Fault Tolerance Techniques and Implementation
Author: Laura L. Pullum
Publisher: Artech House
Total Pages: 358
Release: 2001
Genre: Computers
ISBN: 1580531377

Download Software Fault Tolerance Techniques and Implementation Book in PDF, Epub and Kindle

Look to this innovative resource for the most-comprehensive coverage of software fault tolerance techniques available in a single volume. It offers you a thorough understanding of the operation of critical software fault tolerance techniques and guides you through their design, operation and performance. You get an in-depth discussion on the advantages and disadvantages of specific techniques, so you can decide which ones are best suited for your work.

Cloud Computing and Software Services

Cloud Computing and Software Services
Author: Syed A. Ahson
Publisher: CRC Press
Total Pages: 458
Release: 2010-07-19
Genre: Computers
ISBN: 9781439803165

Download Cloud Computing and Software Services Book in PDF, Epub and Kindle

Whether you're already in the cloud, or determining whether or not it makes sense for your organization, Cloud Computing and Software Services: Theory and Techniques provides the technical understanding needed to develop and maintain state-of-the-art cloud computing and software services. From basic concepts and recent research findings to fut

Pervasive Computing

Pervasive Computing
Author: Ciprian Dobre
Publisher: Morgan Kaufmann
Total Pages: 550
Release: 2016-05-06
Genre: Computers
ISBN: 0128037024

Download Pervasive Computing Book in PDF, Epub and Kindle

Pervasive Computing: Next Generation Platforms for Intelligent Data Collection presents current advances and state-of-the-art work on methods, techniques, and algorithms designed to support pervasive collection of data under ubiquitous networks of devices able to intelligently collaborate towards common goals. Using numerous illustrative examples and following both theoretical and practical results the authors discuss: a coherent and realistic image of today’s architectures, techniques, protocols, components, orchestration, choreography, and developments related to pervasive computing components for intelligently collecting data, resource, and data management issues; the importance of data security and privacy in the era of big data; the benefits of pervasive computing and the development process for scientific and commercial applications and platforms to support them in this field. Pervasive computing has developed technology that allows sensing, computing, and wireless communication to be embedded in everyday objects, from cell phones to running shoes, enabling a range of context-aware applications. Pervasive computing is supported by technology able to acquire and make use of the ubiquitous data sensed or produced by many sensors blended into our environment, designed to make available a wide range of new context-aware applications and systems. While such applications and systems are useful, the time has come to develop the next generation of pervasive computing systems. Future systems will be data oriented and need to support quality data, in terms of accuracy, latency and availability. Pervasive Computing is intended as a platform for the dissemination of research efforts and presentation of advances in the pervasive computing area, and constitutes a flagship driver towards presenting and supporting advanced research in this area. Indexing: The books of this series are submitted to EI-Compendex and SCOPUS Offers a coherent and realistic image of today’s architectures, techniques, protocols, components, orchestration, choreography, and development related to pervasive computing Explains the state-of-the-art technological solutions necessary for the development of next-generation pervasive data systems, including: components for intelligently collecting data, resource and data management issues, fault tolerance, data security, monitoring and controlling big data, and applications for pervasive context-aware processing Presents the benefits of pervasive computing, and the development process of scientific and commercial applications and platforms to support them in this field Provides numerous illustrative examples and follows both theoretical and practical results to serve as a platform for the dissemination of research advances in the pervasive computing area