New framework for programming unreliable chips

For handling the future unreliable chips, a research group at MIT’s Computer Science and Artificial Intelligence Laboratory has developed a new programming framework that enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it’s intended.

As transistors get smaller, they also become less reliable. This reliability won’t be a major issue in some cases. For example, if few pixels in each frame of a high-definition video are improperly decoded, viewers probably won’t notice — but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency.

The researchers’ system, which they’ve dubbed Rely, begins with a specification of the hardware on which a program is intended to run. That specification includes the expected failure rates of individual low-level instructions, such as the addition, multiplication, or comparison of two values. In its current version, Rely assumes that the hardware also has a failure-free mode of operation — one that might require slower execution or higher power consumption.

A developer who thinks that a particular program instruction can tolerate a little error simply adds a period (i-e “dot”) to the appropriate line of code. So the instruction “total = total + new_value” becomes “total = total +. new_value.” Where Rely encounters that telltale dot, it knows to evaluate the program’s execution using the failure rates in the specification. Otherwise, it assumes that the instruction needs to be executed properly.

With the existing version of Rely, a programmer who finds that permitting a few errors yields an unacceptably low probability of success can go back and tinker with his or her code, removing dots here and there and adding them elsewhere. Re-evaluating the code, the researchers say, generally takes no more than a few seconds.

But in ongoing work, they’re trying to develop a version of the system that allows the programmer to simply specify the accepted failure rate for whole blocks of code: say, pixels in a frame of video need to be decoded with 97 percent reliability. The system would then go through and automatically determine how the code should be modified to both meet those requirements and maximize either power savings or speed of execution.

Related News: