Abstract
Real-time fault tolerance (RTFT) is a core technology for increasing the reliability of computer-based safety-critical applications such as space applications, factory automation systems, etc. In recent years, the real-time computing market has started showing explosive growth. In order to realize highly robust real-time fault tolerant computing stations, several component techniques are necessary. Among the most significant include (a) a scaleable RTFT scheme, (b) a network surveillance (NS) scheme, (c) a timeliness-guaranteed kernel that supports both the RTFT and the NS schemes. This dissertation attempts to make a significant step forward towards the goal of realizing ultra-reliable computer-based safety-critical systems. As a first step in this direction, the following new technologies have been devised: (i) the primary-shadow time-triggered message-triggered object (TMO) replication (PSTR) scheme which provides time-bounded recovery from faults in TMO structured systems, (ii) the supervisor-based network surveillance (SNS) scheme which is effective in a variety of point-to-point networks and is amenable to fault detection latency bound analysis. Second, it was observed that even though a few promising component technologies that addressed certain specific requirements of real-time fault tolerant computing stations have been established, little efforts were made to integrate these technologies. Only such integrated technologies can meet the diverse demands that are imposed by safety-critical applications. This dissertation attempts to establish guidelines for such integration. The following integrated schemes have been devised: (i) the PSTR scheme and the SNS scheme, (ii) the distributed recovery block (DRB) scheme established earlier and the SNS scheme, (iii) the adaptable DRB scheme established earlier and the SNS scheme. Third, convincing demonstrations of the validity and potential utility of the devised schemes would facilitate their use in real-world applications. A timeliness-guaranteed kernel developed earlier was extended to support all the devised schemes. A TMO-structured defense application supported by the newly extended kernel was also made fault-tolerant. Finally, the performance analyses of the RTFT and NS schemes, even though of great importance, have been scarcely practiced. We have analyzed the performance of the devised schemes and obtained some tight time bounds. The modeling and analysis techniques presented would serve as useful guides to system engineers.