Failure Mode and Effects Analysis - An explanation
Much of failure analysis work is done before any actual testing is done. It might seem that taking a chip and putting it in a scanning electron microscope or thermal microscope to find points of failure is the most direct way to detect where a problem is occurring, but this is reactive and the most costly approach. It also damages a company's reputation with its customers, which can be very costly.
Failure analysis is a procedure that should start from the ground up at the design stage itself. The initial investment of designing something to work around the failures of past systems can be repaid many times over later on with reduced failure costs.
In this article, we look at the Failure Mode and Effects Analysis (FMEA) procedure which is a technique for preventing failures from occurring in a chip in the first place.
Steps in FMEA analysis
The first step in formal FMEA analysis is simple - find out what can possibly go wrong. Here, the choice of the team used for analysis is critical. Since it's a top down analysis style, we need to have an experienced set of individuals who know what they're doing and have experience with failing integrated circuits in the past. Only then will they be able to form a comprehensive and detailed list of problems. These must be stated from the user's point of view. For example, "such a failure will cause complex math computations with floating point numbers to give inaccurate results."
We rate these failures in terms of severity. For example, those which might lead to a lawsuit get a very high priority.
The second step is to analyze the causes of each failure in the first step and determine how likely they are to occur. Higher rankings are given if the occurrence probability is high. Actions also must be formulated based on these rankings. Naturally, if a particular failure is more severe then it gets attention even if the likelihood of failure is low.
Finally, we determine how likely it is that a particular failure will go undetected. This is done by examining all existing failure detection controls and the likely tests which will be applied. Once more, rankings are given for ease of detection.
By multiplying these three rankings, we get risk priority numbers (RPNs) which are used to determine which potential failures get the most attention from the design team. All this may sound basic, but there's a lot involved and it harder than it looks. If it's done properly, the results will be obvious when there are fewer failures and decreased costs to the firm later on.
Systematic identification of failures using an FMEA analysis provides a solid foundation for preventing failures before they cause serious damage.