Making Reliability a Reality

Making Reliability a Reality

David E. Mortin, Ph.D., Chief, Reliability Branch, U.S. Army Materiel Systems Analysis Activity (AMSAA)

Stephen P. Yuhas, Chief, Reliability Directorate, U.S. Army Evaluation Center (AEC)

There has been a significant push to strive for very high levels of weapon system reliability, sometimes referred to as “ultra reliability.” Recent articles by senior Army leadership have stressed the importance of increasing reliability well beyond legacy values [1,2]. Draft reliability requirements for the Future Combat System (FCS) are four to 12 times current values, and multiple organizations are suggesting that even higher levels are needed. These high levels of reliability will not be achieved with legacy reliability design practices. Given the recognition that very high levels of reliability are required for our future systems, we must make major changes to legacy design practices to make higher reliability a reality. The following paragraphs discuss some of the changes that need to occur if we are to make ultra reliability more than just a slogan.

Too much emphasis on reliability predictions

The reliability portions of our contracts often spend considerable time addressing reliability predictions. A reliability prediction may have little or nothing to do with the actual reliability of the product and can actually encourage poor design practices [3]. As one example, when nine contractors came in with separate radio designs and predictions, subsequent testing showed that the reliability predictions ranged from 30 percent to 3900 percent of the actual values. Contractors and subcontractors that frequently quote predictions may not understand the engineering and design considerations necessary to minimize risk and to produce a reliable design. In many cases, the person producing the prediction may not be a direct contributor to the design team. The historic focus on the accounting of predictions versus the engineering activities needed to eliminate failures during the design process has significantly limited our ability to produce highly-reliable products. High reliability is not obtained through reliability predictions.

The real reliability models

When most people think of reliability models, they think of reliability block diagrams; failure mode, effects, and criticality analysis (FMECA); fault trees; and reliability growth. When directly used to influence the design team, or when used by the Army to manage reliability progress, these tools can be extremely useful to focus engineering and testing efforts. However, the most important reliability tools are the structural, thermal, fatigue, failure mechanism, and vibration models used by the design team to ensure they are producing a product that will have a sufficiently large failure-free operating period. A good contractor routinely conducts thermal and vibration analyses to address potential failure mechanisms and failure sites (i.e., a physics-of-failure approach to reliable design). These analyses can include the use of fatigue analysis tools, finite element modeling, dynamic simulation, heat transfer analyses, etc. Without such engineering analyses, the risk of failure is very high.

Changing the perception that very high reliability must be very unaffordable

When reliability is designed into systems early, many potential failure mechanisms and sources of failure can be eliminated with little cost. However, as time goes on, the cost to fix failures that were not addressed earlier in design can become very large. Early analysis of the engineering design, combined with early low-level testing and substantial integration testing, can greatly improve the reliability of the product before designs are locked in, and well before any formal testing program. Programs that do not focus on designing-in reliability up front and seek to increase reliability primarily through a test-fix-test-fix-test process will likely not produce a cost-effective and reliable product.

Many still equate high reliability to gold plating (i.e., using more expensive materials or exotic designs). High reliability is the direct result of a strong engineering design effort combined with smart testing and management focus. As an example of how small investments can make a big difference, a reliability structural and thermal analysis for a circuit board can be completed for as little as $15K plus the cost of highly accelerated life testing (HALT) if confirmation is required. Based on just one of the projects that we have worked on, the savings from identifying problems with a single circuit card can be in excess of $27M.

By one estimate, operating and support costs represent 60 percent of total life-cycle costs. Reliability improvements directly influence the majority of the operating and support cost contributors. Over the life-cycle of a major weapon system, moderate improvements in reliability can result in savings in the hundreds of millions to billions of dollars. For a major system development such as the FCS, design-based engineering reliability improvement activities, to include early HALT and integration testing, will cost a few million dollars. However, the early elimination of just a few failure modes can bring about acquisition and operating and support cost savings that are many times this investment.

Testing

Even with today's failure mechanism models and engineering tools, there is still a need for smart and focused testing. Lower-level testing (e.g., HALT) is critical for precipitating failures early and identifying weaknesses in the design. Integration testing is critical for identifying unforeseen interface issues. Some programs are conducting these lower-level tests; however, many do not or only perform these tests for a small subset of the components.

Developmental Testing (DT) serves as one of the last opportunities to fix remaining problems and increase the probability of system success. Some programs choose to have very limited or no formal DT. When a system meets the reliability requirement in DT, there is a 68 percent chance that it will meet the OT reliability requirement. If the system fails in DT, there is only an 18 percent chance that it will meet the OT reliability requirement. Significant program setbacks often happen when testing is reduced or eliminated to meet schedule or cost constraints. In some cases, the systems fail and have to repeat OT. In other cases, we pay the price in operating and support costs for years to come. For OT, it is not uncommon for programs to schedule such short test durations that in order to have a good chance of demonstrating the reliability requirement, the contractor has to design to a reliability level several times higher than the requirement, almost ensuring failure.

Early low-level testing, along with focused higher-level testing, is key to producing products with high reliability. Without comprehensive lower-level testing (e.g. HALT) on most or all critical subassemblies, and without significant integration and developmental testing, there is little likelihood that high levels of reliability will be achieved.

We need to change the way that we approach commercial-off-the-shelf (COTS) equipment

COTS equipment represents a great opportunity to improve reliability, reduce costs, and leverage the latest technologies. However, COTS does not imply that we abandon engineering analyses and early testing. On many occasions, we have heard the expression, “that piece of equipment is COTS so its reliability is what it is.” Thermal, vibration, fatigue, and failure mechanism modeling, combined with early accelerated testing, can quantify and qualify the risk of COTS equipment failing in the military operating environment. We still have cases where a major COTS failure mode is discovered relatively late in the program. Often COTS equipment data is proprietary; however, there are often workarounds that can be used to develop data that can support sufficiently detailed engineering analyses. Relatively simple vibration and thermal analyses can detect some potential show stoppers. The show stoppers that have emerged because of inadequate early analysis have cost the Army millions of dollars and have significantly slowed our ability to put some critical systems into the field.

Incentives

For many, if not most, of our procurements, the contractor does not have a strong incentive to make the product reliable. Even when reliability is mentioned in the Statement of Work (SoW), the weight of reliability in the selection criteria is usually small. Contractors have to bid low in order to be competitive. When they have to trim their programs, reliability is often one of the first areas to go. To further complicate things, contractors typically make significant amounts of profit from follow-on replenishment spares. Unless the contractor sees value in directing and resourcing the design team to achieve high reliability, we will continue to field equipment with reliability values that fall far short of what the commercial consumer typically experiences. Most contractors have the engineering staff and technical know-how to produce highly reliable systems. If we were to make reliability one of our high priorities in the SOW and spec, and provided incentives, most of our major defense contractors could, and likely would, develop highly reliable systems. If we do not provide incentives for contractors and prioritize reliability, reliability activities will continue to consist of predictions and documents that do little to improve the design and do little to produce products that will meet our future needs.

What to look for in a good contractor

AMSAA and ATEC have developed a list of reliability best practices. These practices can be used to help discriminate between contractors when it comes to reliability. In addition to the engineering modeling and testing activities discussed above, a good contractor will:

Conclusion

There is little doubt that our legacy reliability practices have produced low reliability values. If we do not change our approach to reliability, there is little to no chance of achieving the reliability requirements and the footprint reductions envisioned for the Future Combat System and other Army systems. For the most part, contractors have the capability to design equipment that achieves much higher levels of reliability than we see today without huge increases in cost. However, today, they do not have the incentives to do so. We also have to become much more involved in the contractor's engineering efforts. This does not mean verifying that the contractors have made reliability predictions that exceed the requirement; it means engaging the contractor to see what their finite element, thermal, and vibration modeling is showing them, seeing that they understand what failure mechanisms are putting them most at risk, and examining their low-level testing programs. We do not need to violate acquisition reform to do this; we just need to be smart buyers. There is no easy way to achieve high reliability. The commercial companies that produce highly-reliable products have put significant engineering effort behind their development programs to ensure that their products last and do not break frequently. We need to do the same.


References

  1. Mahan, Charles S., “Linking Acquisition and Operational Logistics,” Army AL&T, September-October 2002.
  2. Orsini, Eric A., “The Importance of Improved Supportability: A Historical Perspective,” Army AL&T, September-October 2002.
  3. Cushing, Michael J., “Comparison of Electronics-Reliability Assessment Approaches,” IEEE Transactions on Reliability, Vol 42, December 1993.