Redundant Systems With A Common Threat |
Suppose a system is equipped with two redundant components, and the failure of both components (at the same time) represents total failure of the system. Also, each component contains provisions to protect it from an external threat to which both components are subjected, at some unknown rate, in unison. Furthermore, the failure of the protection is not inherently detectable, except when specifically checked at some periodic inspection interval. Between inspections, a failure of the protection provisions becomes apparent only if/when the component is subsequently subjected to the external threat and fails. Given the failure rates of the protection provisions of each component, and the length of the inspection interval, can we define a realistic upper bound on the probability of simultaneous failure of both systems? |
Letting l denote the failure rate of the protection provisions of each component, and T the periodic inspection interval, a very conservative approach would be to assume that failure of the protection of a single component is never detected (and repaired) other than at periodic inspections. Admittedly we will also detect such failures if we encounter the external threat, but since we do not know the rate at which the system encounters this threat, it may seem that we're forced to assume the worst case, i.e., that we never encounter the threat, and therefore the threat does not provide a means of detecting failures of the protection provisions. On this basis, the exposure time for each of the protection failures is the full inspection interval T, so they each have probability lT of being failed at the end of the interval, and the probability of both being failed is (lT)2. (Of course, the actual probabilities for exponentially distributed failures are 1-e-lT, but since lT is small for the cases of interest, we can use the conventional approximation lT.) |
Now, given that the protection features of both components are failed, the overall system will not fail until/unless it encounters the external threat, but, again, since we do not know the rate of encountering this threat, we need to make a conservative assumption, which, in this case, is that we will certainly encounter the threat. Moreover, we assume that the threat will definitely be encountered at the end of the inspection period, when the probability of both protections being failed is a maximum. Combining all these worst-case assumptions, we conclude that the probability of a total system failure during the last incremental unit of operation in the T-hour inspection period is simply (lT)2. |
This is certainly a valid upper bound, but it seems unnecessarily conservative, because it assumes the external threat never strikes the system except at the very end of the inspection interval, at which point we assume it strikes the system with certainty. This combination of assumptions is extremely unrealistic, because the occurrences of the external threat are presumed to be exponentially distributed with just a single rate, independent of the periodic inspections. Hence it would be valid to assume certainty for an encounter at the end of the interval only if we assume the rate of encounters is very high (infinite, recalling that the probability is really 1-e-lT), but this grossly conflicts with the assumption that we never encounter the threat at any other time during the interval. |
A more realistic representation of the system is given by the Markov model shown below. |
In this model the rate m represents the (unknown) rate of encountering the external threat. The premise is that if/when the external threat is encountered while only one of the components has failed protection (i.e., State 1), the component will fail and thereby be detected, leading to its repair, returning the system to the Full-Up condition (State 0). This same rate also represents the transition rate from State 2 (the condition when both components have failed protection) to State 3 (total system failure). Our strategy will be to determine the explicit time-dependent solution of this model as a function of l and m, and then, beginning with the system in State 0, evaluate the rate of entry into State 3 at the time T, i.e., the end of the periodic inspection interval , which is when the rate reaches a maximum. This equals mP2(T)/(1-P3(T)), although the denominator is so close to 1 that we can accurately represent the total failure rate at time T simply by mP2(T). |
The differential equations corresponding to this model are |
The characteristic polynomial of the coefficient matrix is |
so the eigenvalues of the system are |
where |
Therefore, each Pj(t) for j = 0, 1, or 2 is of the form |
where Aj, Bj, and Cj are constants determined by the initial conditions. At the time t = 0 we have |
Inserting these into the system equations gives the first derivatives |
Differentiating the system equations and inserting the first derivatives gives the second derivatives |
Differentiating the expression for Pj(t) twice and inserting the initial conditions, we have three equations in the unknown coefficients Aj, Bj, and Cj |
Solving this system for the coefficients, we get |
This gives the explicit expression for P2(t), which we can multiply by m to give the rate of total system failure at any time t |
Evaluating this at the time t = T gives the rate of total system failure at the end of a periodic inspection interval. We know the values of l and T, so the only unknown is the rate m of encounter the external threat. Thus we can plot mP2(T) versus m as shown below for a typical case with l = (5.0)10-7/hour and T=5000 hours. |
Naturally if m = 0 the rate of total system failure would also be zero, because the system would never make the transition from State 2 to State 3. On the other hand, if m is infinite, the rate of total system failure is again zero, because the system would never get past State 1. Therefore, we expect that there is some intermediate value of m that maximizes the rate of total system failure, and this is confirmed by the plot above. The maximum for this example occurs at m = (3.2)10-4/hr, corresponding to a total system failure rate of (7.4)10-10/hr. In contrast, the more simplistic (and unrealistic) approach described previously predicts a rate of (lT)2 = (1.0)10-4/hr for this same case. |
To approximate the results of the Markov model in the form of a fault tree, the top event is total system failure during the last incremental unit of operational time, denoted by Dt. This event is generated by {protection failed on both components} AND {encounter with the external threat}. The probability of encountering the external threat during this increment of time is m Dt. The probability of both components being without protection is the square of the probability of the loss of protection for a single component. The latter is simply the failure rate l for the protection multiplied by the appropriate exposure time. Now, the mean time between occurrences is the reciprocal of the rate, and rates are additive, so the appropriate exposure time is given by adding 1/T to the reciprocal of the interval corresponding to the rate m. However, for a fixed interval the mean exposure time is actually half the interval, so we divide m by 2 to give the overall exposure time 1/[(m/2) + (1/T)]. Dividing the top event probability by Dt to normalize the probability to a per-hour basis for the last incremental unit of operational time, we get |
A plot of this function is shown in the figure below, superimposed on the exact Markov model result. |
Thus both methods give comparable results. One advantage of the simple fault tree representation is that it can easily be differentiated to find the maximum point explicitly. Not surprisingly, we find that the worst-case value of m (i.e., the one that gives the maximum probability of total system failure) is 2/T. Substituting this back into the equation for the total system failure (per hour for an incremental unit of operation at the end of the inspection interval) is simply l2T/2. |