Vanderbilt University
When your computer
crashes and you get the dreaded blue screen or your smartphone freezes and you
have to go through the time-consuming process of a reset, most likely you blame
the manufacturer: Microsoft or Apple or Samsung.
In many instances,
however, these operational failures may be caused by the impact of electrically
charged particles generated by cosmic rays that originate outside the solar
system.
"This is a really
big problem, but it is mostly invisible to the public," said Bharat Bhuva,
professor of electrical engineering at Vanderbilt University, in a presentation
on Friday, Feb. 17 at a session titled "Cloudy with a Chance of Solar
Flares: Quantifying the Risk of Space Weather" at the annual meeting of
the American Association for the Advancement of Science in Boston.
When cosmic rays traveling at fractions of the speed of light strike Earth's atmosphere they create cascades of secondary particles including energetic neutrons, muons, pions and alpha particles.
Millions of these
particles strike your body each second. Despite their numbers, this subatomic
torrent is imperceptible and has no known harmful effects on living organisms.
However, a fraction of
these particles carry enough energy to interfere with the operation of
microelectronic circuitry. When they interact with integrated circuits, they
may alter individual bits of data stored in memory.
This is called a
single-event upset or SEU. Since it is difficult to know when and where these
particles will strike and they do not do any physical damage, the malfunctions
they cause are very difficult to characterize.
As a result,
determining the prevalence of SEUs is not easy or straightforward.
"When you have a
single bit flip, it could have any number of causes. It could be a software bug
or a hardware flaw, for example. The only way you can determine that it is a
single-event upset is by eliminating all the other possible causes," Bhuva
explained.
There have been a
number of incidents that illustrate how serious the problem can be, Bhuva
reported.
For example, in 2003
in the town of Schaerbeek, Belgium a bit flip in an electronic voting machine
added 4,096 extra votes to one candidate. The error was only detected because
it gave the candidate more votes than were possible and it was traced to a single
bit flip in the machine's register.
In 2008, the avionics
system of a Qantus passenger jet flying from Singapore to Perth appeared to
suffer from a single-event upset that caused the autopilot to disengage. As a
result, the aircraft dove 690 feet in only 23 seconds, injuring about a third
of the passengers seriously enough to cause the aircraft to divert to the
nearest airstrip.
In addition, there
have been a number of unexplained glitches in airline computers -- some of
which experts feel must have been caused by SEUs -- that have resulted in
cancellation of hundreds of flights resulting in significant economic losses.
An analysis of SEU
failure rates for consumer electronic devices performed by Ritesh Mastipuram
and Edwin Wee at Cypress Semiconductor on a previous generation of technology
shows how prevalent the problem may be. Their results were published in 2004 in
Electronic Design News and provided the following estimates:
· A simple cell phone
with 500 kilobytes of memory should only have one potential error every 28
years.
· A router farm like
those used by Internet providers with only 25 gigabytes of memory may
experience one potential networking error that interrupts their operation every
17 hours.
· A person flying in
an airplane at 35,000 feet (where radiation levels are considerably higher than
they are at sea level) who is working on a laptop with 500 kilobytes of memory
may experience one potential error every five hours.
Bhuva is a member of
Vanderbilt's Radiation Effects Research Group, which was established in 1987
and is the largest academic program in the United States that studies the
effects of radiation on electronic systems.
The group's primary
focus was on military and space applications. Since 2001, the group has also
been analyzing radiation effects on consumer electronics in the terrestrial
environment.
They have studied this
phenomenon in the last eight generations of computer chip technology, including
the current generation that uses 3D transistors (known as FinFET) that are only
16 nanometers in size.
The 16-nanometer study
was funded by a group of top microelectronics companies, including Altera, ARM,
AMD, Broadcom, Cisco Systems, Marvell, MediaTek, Renesas, Qualcomm, Synopsys,
and TSMC
"The
semiconductor manufacturers are very concerned about this problem because it is
getting more serious as the size of the transistors in computer chips shrink
and the power and capacity of our digital systems increase," Bhuva said.
"In addition, microelectronic circuits are everywhere and our society is
becoming increasingly dependent on them."
To determine the rate
of SEUs in 16-nanometer chips, the Vanderbilt researchers took samples of the
integrated circuits to the Irradiation of Chips and Electronics (ICE) House at
Los Alamos National Laboratory.
There they exposed
them to a neutron beam and analyzed how many SEUs the chips experienced.
Experts measure the failure rate of microelectronic circuits in a unit called a
FIT, which stands for failure in time. One FIT is one failure per transistor in
one billion hours of operation.
That may seem
infinitesimal but it adds up extremely quickly with billions of transistors in
many of our devices and billions of electronic systems in use today (the number
of smartphones alone is in the billions). Most electronic components have
failure rates measured in 100's and 1,000's of FITs.
"Our study
confirms that this is a serious and growing problem," said Bhuva.
"This did not come as a surprise. Through our research on radiation
effects on electronic circuits developed for military and space applications,
we have been anticipating such effects on electronic systems operating in the
terrestrial environment."
Although the details
of the Vanderbilt studies are proprietary, Bhuva described the general trend
that they have found in the last three generations of integrated circuit
technology: 28-nanometer, 20-nanometer and 16-nanometer.
As transistor sizes
have shrunk, they have required less and less electrical charge to represent a
logical bit. So the likelihood that one bit will "flip" from 0 to 1
(or 1 to 0) when struck by an energetic particle has been increasing.
This has been
partially offset by the fact that as the transistors have gotten smaller they
have become smaller targets so the rate at which they are struck has decreased.
More significantly,
the current generation of 16-nanometer circuits have a 3D architecture that
replaced the previous 2D architecture and has proven to be significantly less
susceptible to SEUs.
Although this
improvement has been offset by the increase in the number of transistors in
each chip, the failure rate at the chip level has also dropped slightly.
However, the increase
in the total number of transistors being used in new electronic systems has
meant that the SEU failure rate at the device level has continued to rise.
Unfortunately, it is
not practical to simply shield microelectronics from these energetic particles.
For example, it would
take more than 10 feet of concrete to keep a circuit from being zapped by
energetic neutrons. However, there are ways to design computer chips to
dramatically reduce their vulnerability.
For cases where
reliability is absolutely critical, you can simply design the processors in
triplicate and have them vote.
Bhuva pointed out:
"The probability that SEUs will occur in two of the circuits at the same
time is vanishingly small. So if two circuits produce the same result it should
be correct." This is the approach that NASA used to maximize the reliability
of spacecraft computer systems.
The good news, Bhuva
said, is that the aviation, medical equipment, IT, transportation,
communications, financial and power industries are all aware of the problem and
are taking steps to address it. "It is only the consumer electronics
sector that has been lagging behind in addressing this problem."
The engineer's bottom
line: "This is a major problem for industry and engineers, but it isn't
something that members of the general public need to worry much about."