Distribution and fault-tolerance is inter-related to each other.
Fault tolerance is in the major area of
distributed system design that covers various methodologies. Study shows that
in past there were many critical applications folded under faults because of
insufficient level of fault tolerance. Various issues are examined during distributed
system design and are properly addressed to achieve desired level of fault
tolerance. In this paper we are going to define various terminologies like
failure, fault, fault tolerance, recovery, redundancy, security, etc. and explain
basic concepts related to fault tolerance in distributed environments. It also
describes two kinds of fault tolerance and ways of achieving. The paper
presents various solutions and architectures that implements fault tolerance in
various facets of distributed systems. Overall goal of this paper is to give
understanding of fault tolerance and different mechanisms in distributed system
and to familiarize with current research in this area.
Distributed Systems consists of variety of
hardware and software components so the fault tolerance mechanisms can classified
into software mechanisms and hardware mechanism and system fault tolerance, in
this paper we will be going to discuss all of them
majority of fault-tolerant designs have been directed toward building computers
that automatically recover from random faults occurring in hardware components.
The techniques employed to do this generally involve partitioning a computing
system into modules that act as fault-containment regions. Each module is backed
up with protective redundancy so that, if the module fails, others can assume
its function. Special mechanisms are added to detect errors and implement
Efforts to attain software that can tolerate software design faults
(programming errors) have made use of static and dynamic redundancy approaches similar
to those used for hardware faults. One such approach, N-version programming, uses
static redundancy in the form of independently written programs (versions) that
perform the same functions, and their outputs are voted at special checkpoints.
System fault tolerance
main mechanisms of software fault tolerance are checkpoint storage and rollback
recovery since they are simple and have lower system overheads. A checkpoint is
a snapshot of the entire state of the process at the moment it was taken. It
represents all the information that we would need to restart the process from
that point. The implementation of software fault tolerance is to develop a
utility program to store checkpoints of target system regularly. When errors
occur, this utility program is used for rollback recovery. The system described
in this study is a type of software fault tolerance. The final fault-tolerance
method is called system fault tolerance. A complete system architecture that
can automatically store program checkpoints, memory blocks, and memory ranges
is built. This system can detect errors occurring in the applications itself. When
errors occur the system can also provide corresponding processing thereby
correcting the errors.
Failure of any of these components can lead
to unanticipated, potentially disruptive behavior and to service
unavailability. we have to provide perfect solution to software fault tolerance
over and above redundancy. Also the distributed systems are vulnerable to
security threats because of their openness in operating environment. Researchers
term them as system faults. Providing proper fault detection mechanisms along
with redundancy in the system by replication of data and resources can prevent
failures. It also requires recovering by rolling back the execution of all the
affected processes. Security is mainly concerned with issues like
authentication and authorization and we are not going to discuss this issue in
this paper. Also the aim of fault tolerance distributed systems is to provide
proper solutions to these system faults upon their occurrence and make the
system more dependable by increasing its reliability. The solutions to these
system faults should be transparent to users of the system. Such fault tolerant
behavior is extremely necessary in critical applications like flight control
systems, hazardous industries, nuclear power plants, etc. as well as non-critical
ones like communications and transaction processing but these days transactional
processing application are also considered to be critical. Since then more
formal and abstract approach has led to better understanding of the problems
faced and key to developing fault resilient systems. Researchers have undergone
their work in various paradigms of distributed fault tolerance from failure
detection to mobile security. One such approach specifies action models and
path based solution algorithm to provide an intuitive, high level, modeling
formalism for fault tolerant distributed computing systems and to analyze the
impact of fault tolerance mechanisms on the user perceived reliability.
Our focus is on understanding basic
concepts of fault tolerance and implementation with the help of case studies
the paper contains basic concepts and terminologies that are used, broadly defines
two forms of fault tolerances, and various phases of achieving fault tolerance
like failure detection, redundancy.
Basic Concepts and Terminologies
Being fault tolerant is very much related
to what are called dependable systems. A system is dependable when it is
trustworthy enough that reliance can be placed on the service that it delivers.
For a system to be dependable, it must be available (e.g., ready for use when
we need it), reliable (e.g., able to provide continuity of service while we are
using it), safe (e.g., does not have a catastrophic consequence on the
environment), and secure (e.g., able to preserve confidentiality). Following
are few terminologies that are very closely related to dependability of system
and its behavior:
Fault – Can be termed as “defect” at the lowest
level of abstraction. It can lead to erroneous system state. Faults may be
classified as transient, intermittent or permanent.
can be of following types
Processor Faults (Node Faults): Processor faults occur when the processor
behaves in an expected manner. It may be of classified into three kinds:
Fail-Stop – Here a processor can both be active and participate in distribute
protocols or is totally failed and will never respond. In this case the
neighboring processors can detect the failed processor.
Slowdown – Here a processor might run in degraded fashion or might totally
Byzantine – Here a processor can fail, run in degraded fashion for some time or
execute at normal speed but tries to fail the computation.
Network Faults (Link Faults): Network faults occur when (live and working)
processors are prevented from communicating with each other. Link faults can
cause new kinds of problems like:
One way Links – Here one processor can send messages to other is not able to
receive messages. This kind of problem is similar to that faced due to
Network Partition – Here a portion of network is completely isolated with the
Error – Undesirable system state that may lead to failure of the
Failure – Faults due to unintentional intrusion.
Fault Tolerance – Ability of system to behave in a
well-defined manner upon occurrence of faults.
Recovery – Recovery is a passive approach in which
the state of the system is maintained and is used to roll back the execution to
a predefined checkpoint.
Redundancy – With respect to fault tolerance it is
replication of hardware, software components or computation.
Security – Robustness of the system characterized by
secrecy, integrity, availability, reliability and safety during its operation.
Types of Fault Tolerance and Failure Detection
Types of Fault Tolerance
For a distributed program to behave
correctly, it must satisfy both the above properties. We will try to explain
these concepts before we go ahead with explaining types of fault tolerances.
Safety means that some specific “bad thing”
never happens within a system. Formally, this can be characterized by
specifying when an execution e is “not safe” for a property p: if e ? p there must be an identifiable discrete
event e that prohibits all possible system executions from being safe. e.g.:
simultaneous updating of a shared object. Distributed program is safe if system
will always remain within set of safe states. On the other hand, a liveness
property claims that some “good thing” will eventually happen during the system
execution. Formally, a partial execution of a system is live for property p, if
and only if it can be legally extended to still remain in p. “Legally here
means that the extension must be allowed by the system itself. e.g. : a process
waiting for access to a shared object will finally is allowed to do so. To
behave correctly, a distributed program A must satisfy both its safety and its
liveness property. Now upon occurrence of fault, how is the property of A
Various researchers for efficiently
detecting and determining the type of fault in the system have made numerous
efforts. One of the significant contributors has been to solving consensus and
atomic broadcast problem by unreliable failure detectors by use of rigorous
formalization. Some have further investigated their
efficiency with measures using longest message chain before decision and
greatest lower bounds for main class of failure detectors.
No matter how well you design your system
to tolerate fault, it is always possible that is fails under repeated or severe
attacks of faults. We believe that redundancy is necessary, though not
sufficient condition for fault tolerance. It states two forms of redundancy,
first that in space
and second that in time.
Redundancy in space is meant by set of configurations in a
program that are never reached in absence of faults while redundancy in time is meant by set of actions of a program
that are never executed in absence of faults.
It is important to note that proper fault
detection is very essential for the system to lead it to safety. Detection
needs information about state space and /or program actions. Correction upon
detection of fault ensures liveness property of the system. So we can understand
that ensuring safety (by detection) is easier to achieve than liveness (by
correction). It is also necessary to add that these detection and correction
mechanisms should themselves be fault tolerant.
Redundancy in space is achieved by replication of components.
Examples of space redundancy in hardware it may be tandem systems or in
software may be like adding parity bits in transmission. Redundancy in time is like repeating the computation in the
same system. Examples of time redundancy might be calculation of result more
Hardware redundancy is in the form of replaceable hardware units meaning the units that fail independently
of other units (or can be added or removed independently). The service provided
by the hardware servers in each replaceable unit should have very good failure
semantics (crash and omission). The replaceable hardware units can be having coarse granularity architecture or fine granularity architecture. Later architectures e.g. Stratus and
Sequoia have each elementary hardware server as a replaceable unit by itself.
Certain assumptions, e.g. memory having read omission failure semantics,
regarding hardware failure semantics are made by designers to use known
hierarchical masking techniques. Failure is detected in hardware by either
error detection codes (with error detecting circuitry is used) or by
duplication with comparison (where hardware duplication with comparison logic
is used). Error detection codes are used in storage and communication.
Hardware servers while duplication and
matching is used in complex circuitry 2. Hardware server failures are masked
in hardware itself by implementing redundancy management mechanisms (mostly
multiplexing). But such mechanisms are not able to eliminate the need for
handling at application software levels processor service failure, operating
system level failures and application level failures. Software redundancy in
similar way to hardware redundancy should provide good failure semantics. These
failure semantics depends on the persistent state of the service.
Researchers mention the use of simplicity
and clarity during design, hierarchical design methods based on information
hiding and abstract data types, rigorous design specification and verification
techniques, systematic identification, detection, and handling of all exception
occurrences and use of modern inspection and testing methods to prevent design
faults in program. Any software service should replicate its resources in order
to mask service failures. Several issues arise such as consistency of states,
communication, and replication management while maintaining software server
groups. System failure semantics is implicit upon failure semantics of all
levels of abstraction of the system i.e. hardware, operating system and
application. The goal is to make hardware having crash of omission failure
semantics and make software be totally or at least partially correct.
Increasing application of critical
application have led to giving more focus and importance to fault tolerance.
The area have gone more and more complex with advent of networked system and
distributed application. Even after substantial research in theoretical as well
as practical implementation this area is still is far from convincingly
explored. Though there are some inherent limitations, fault tolerance is not
impossible to implement. As we have seen formal approaches to deal with the
problem using safety and liveness have helped us understand the concept well to
implement it with exact understanding. Researchers have worked hard to
formalize this field to make it more understandable. We have tried to derive
certain basic properties of a system to be fault tolerant. We also have tried
to formalize building masking fault tolerant programs by use of component based
method. They used stepwise approach by adding components to fault intolerant
programs to transform it to non-masking fault tolerant and finally to masking
one. But there are certain cases where detection is not possible before state
of system transits to unsafe state and only its liveness may be guaranteed.
Fault detection algorithms have successfully been implemented to solve
consensus problems efficiently. In practice designing fault tolerance systems
is more of compromise of design decisions with respect to replication level,
protocols, fault detection algorithms, etc.