REGENESYS - Regeneration of Replicated Systems

Executive Summary

Information technology (IT) systems deployed in environments where malicious adversaries may be present (e.g., Internet) are exposed to dangerous attacks. Moreover, it is well-known that attackers are actively involved in the development of new techniques to carry out these attacks. When attacks are successful, they may originate intrusions, giving the intruder arbitrary control over the compromised system.

The usual way of fighting such evolving malicious behavior is by applying security patches to operating systems or by introducing newer (better) versions of the application code. Typically, these activities are done by a system administrator and may introduce unavailability periods in system operation.

We argue that the security of money-critical (e.g., online banking, e-commerce websites) and safety-critical (e.g., power/water/gas infrastructures connected to the Internet) systems should not depend on human intervention and that unavailability should be avoided at all costs. In order to increase the security of IT systems exposed to malicious attacks, these systems should be able to deal with attacks and intrusions in an automatic away.

In the last decade, a large number of Byzantine fault-tolerant (BFT) protocols has been proposed. These protocols, also called intrusion-tolerant protocols, may be used in replicated systems to tolerate the arbitrary failure of a finite number of replicas, denoted by f (typically, f=1 or f=2). However, BFT protocols alone are not enough. These protocols have limited utility in long-lived systems where malicious adversaries are constantly deploying attacks and causing intrusions, given that the allowed number of failures (f) may be exhausted. To deal with this problem, we argue that intrusion-tolerant protocols should be complemented with regeneration mechanisms able to reduce the probability of an adversary compromising more than f replicas. The regeneration of a replica may include various actions, but at the minimum, it cleans the effects of any existing intrusions and applies security patches, restoring the replica to a correct state. These actions imply a non-negligible unavailability time for the replica being regenerated.

The goal of the project is to design, implement and evaluate a regeneration service able to enhance the security of replicated systems exposed to accidental (e.g., server crashes) and malicious (e.g., virus infection, server intrusions) faults. In order to achieve this goal, the regeneration service should be able to integrate with existing BFT protocols, enhancing their intrusion tolerance properties. Moreover, the service is flexible, allowing both planned and unplanned regeneration actions. Planned regenerations are defined at deployment time and are triggered periodically. Unplanned regenerations are triggered on demand when a danger situation is predicted. The combination of planned and unplanned regenerations is done in a way that maximizes the availability of the replicas that are necessary to ensure the normal operation of the replicated system.

The two main scientific challenges of the proposed regeneration service are the following:

the combination of planned and unplanned regenerations in a way that does not disturb the normal operation of the replicated system, namely its availability;
the integration of the regeneration service with existing BFT protocols. This second challenge is specially difficult given that most existing BFT protocols only deal with permanent faults (i.e., they assume that a replica either is correct during the entire execution or arbitrarily fails at some instant and is never restored) and, consequently, are not prepared to deal with the transient faults introduced by replicas’ regeneration. Solving this challenge will allow the creation of a novel class of BFT protocols.

The project team has a solid background on the design of BFT protocols and regeneration mechanisms. Therefore, there is an adequate know-how to address the scientific challenges enumerated above.

The expected results from the project are:

the definition of a novel class of BFT protocols that are able to tolerate permanent and transient arbitrary faults;
the specification of an abstract methodology to convert existing BFT protocols into protocols that tolerate permanent and transient arbitrary faults;
the specification of a novel regeneration service that accommodates both planned and unplanned regenerations and that integrates with BFT protocols converted using the methodology defined in (2);
the implementation of this regeneration service and of one or more converted BFT protocols;
the evaluation of the security and availability of an intrusion-tolerant application in two scenarios: using classical BFT protocols and using the novel class of BFT protocols combined with the regeneration service.