Links > Articles

News - Cool Stuff - Articles

 

Reliability — Part I: The Problem

I’m sitting at my desk composing this column, watching my Internet and email access lights intermittently light up and extinguish. Our server has been in a foul mood all morning, and taking it out on our office. Time for daydreaming. How did we survive before without the Internet? Oh well: I really didn’t want to communicate with anybody today anyway. Now I don’t have a choice in the matter. At times like these, my misanthropic side holds sway. (There really should be a circle of Hell reserved for IT people who, I’m convinced, often design systems to periodically fail to ensure their own job security). On the other hand, happily, with no digital distractions, I have more free time, and since we all know work expands to fill the available time, I’m inspired by my sporadically functioning surroundings to compose reflections like the following:

Definition:

Re-li-a-bil-i-ty (Noun)

  1. The quality or state of being reliable.
  2. The extent to which an experiment, test, or measuring procedure yields the same results on repeated trials.

First Use: 1816

Synonyms: Accuracy, authenticity, constancy, dependability, faithfulness, fidelity, honesty, loyalty, safety, security, solidity, solidness, soundness, steadfastness, sureness, trustability, trustworthiness.

Antonyms: dodginess (chiefly British), unreliability

Reliability Engineering: The ability of a system or component to perform its required functions under stated conditions for a specified period of time*.

*Reliability Engineering, Cynic’s Addendum: The period of time which, coincidentally, typically lasts one day longer than the warranty period for the system or component.**

**Reliability Engineering, CFO’s Corollary: At which time we begin charging full freight for our services all over again. Praised be capitalism and the planned obsolescence driving it!

A clever commercial, run last fall for Xfinity, illustrates the issue, and the underlying frustration that impacts us when we are mugged by experience. It shows a father at home with his family. He sits on his living room sofa, next to his (presumed) teenage daughter, working on his laptop. Daughter, in true teenage fashion, has her face buried in a tablet, ignoring him. He looks up just long enough to declare, in no-nonsense fashion, that he likes products that work as advertised. Every time. No exceptions. The camera then traverses the house, focusing on features that work: beds; hamster wheel, toaster, computers, etc. Constant theme: “It just works.” The implication, naturally, is that the product being advertised “just works.” Xfinity users can judge the validity of that message.

The problem, once again, is that reliability exists in the eye of the beholder. For most of us, that means all-weather performance, with little or no thought given for the cost to achieve it. For example, we want our car to start as reliably in the dead of winter as it does in the doldrums of summer. For years. Typically despite minimal, often-deferred maintenance. We want our laptops and mobile devices to boot up faithfully every time, with no delays, and no applications hanging or locking up. We don’t want to be inconvenienced, or worse: We want the turbines on that airliner we’re riding to keep spinning (at least for the duration of our travel on it) until we’re safely deposited on dry land. The FAA even has a term for that safety category of reassuringly spooling turbine blades: ETOPs, which means extended operations. Extended long enough to get one from Point A, on one coast, to Point B, on another coast, and over the intervening body of water. Wags have a more prosaic translation: Engines Turn OrPassengers Swim. Swimming is a suboptimal outcome.

So what is acceptable ETOPS for printed circuit boards? One never wants one’s product or process to stand accused and convicted of dodginess. The horror.

Of course, irritatingly, acceptable reliability depends on lots of factors. Factors of application. Factors of specification. Factors of design, cost, duration, performance, and efficiency.

My company is in the scrutiny business. We perform nondestructive and destructive PCBA failure analysis. We get hired regularly to figure out why things break, or we get to break them (the super fun part of the job), and then hypothesize why they broke when they did, and under which forces and parameters the breakage occurred. Bad boards are good for business and, I can assure you, there are a lot of bad boards out there, and business has never been better. Failed and substandard PCBAs are our mothers’ milk as illness is to a physician. Newer, faster production equipment churns out defects more efficiently. MTBF still has an F in it, and that number ain’t zero. Most assuredly the future looks bright for us.

Some time ago a customer hired us to X-ray a series of board failures in servo motors. These motors were being installed in a high-performance aerospace application, essentially attitude control (pitch/yaw/roll) of flight vehicles. We found many defects. Some were assembly-related (insufficient solder, inadequate thru-hole barrel fill, partial solder bridging—violating airgaps but not enough to cause a short); others were embedded in the bare board (voids in thru-hole barrels, cracked signal lines, resin recession).

Over the course of several tense days we X-rayed many boards. Many of them were marginal but electrically functional. Conversing with several process engineers over those several days, it became alarmingly clear no one had determined whether these boards, originally built to Class 2 standards (Commercial/Industrial) were fit for purpose in a ruggedized Class 3-plus (Aerospace/High Reliability/Life Support) environment. Now (surprise!) they were failing in that environment. I’m sure the unit price of these commercial-grade servo motors made them very attractive to the buyer.

Except…….

The first simple truth to acknowledge is that for the most part, reliability is not high on the list of priorities for most EMS companies. It doesn’t have to be, so long as the stated acceptance requirements, often defaulting to IPC-A-610, are met. Once the product is shipped, it’s out of sight, out of mind, with the expectation that it will remain shipped for the warranty-plus-one day period noted above.

Unless, of course, it comes back sooner than that. Which it often does. Like those flight servo motor boards. Seven months later. Or from an oil rig in the middle of the Gulf of Mexico, where it failed during a hurricane, pummeled by salt spray. Or the hardware that is shot into space and stays there and stops working one day. Tough to make a service call in space when the black box fails and all those SiriusXM listeners go ballistic because the satellite carrying B.B. King’s Bluesville crapped out. It’s not a happy, soulful day when the blues hardware abruptly breathes its last. At those times no one cares that it passed the tests on the production line. Like the commercial, they want it to simply work.

How do I know this to be true? Remember what I said about bad boards being good for business? The evidence is overwhelming.

The second simple truth to be acknowledged is that most board test procedures have little or nothing to say about long-term performance and ultimate reliability of the product. In-circuit, flying probe, and JTAG testing are snapshots. They tell us how the board is working today, if it is working at all, and whether that board was correctly assembled in conformity to the customer’s bill of materials and schematic and IPC-A-610. Nothing more than that. That sort of test has no predictive capability: It has nothing to say about how long that passing state will last. Solder on certain pads may hang by a thread, and escape visual detection, assuming such detection methods were used at all, so by the rules of manufacturing, and electrical testing, it is good because current flows in the designed direction, and the product works, insofar as it matches the BOM. It just works. How long it will work is anybody’s guess.

Reliability is the 900-pound gorilla nobody talks about. This column is going to talk about it. First we state the problem. Next time we’ll examine some tools and standards of measurement of the reliability of PCBAs. Finally, we’ll consider some pathways leading toward a solution to the problem.

Good ideas. They just work.