Cyber Physical System Homework Its gonna be articles Cyber Physical Systems: Design Challenges Edward A. Lee ∗ Center for Hybrid and Embedded Software Sy

Cyber Physical System Homework Its gonna be articles Cyber Physical Systems: Design Challenges

Edward A. Lee ∗

Center for Hybrid and Embedded Software Systems, EECS
University of California, Berkeley

Berkeley, CA 94720, USA


Cyber-Physical Systems (CPS) are integrations of com-
putation and physical processes. Embedded computers and
networks monitor and control the physical processes, usu-
ally with feedback loops where physical processes affect
computations and vice versa. The economic and soci-
etal potential of such systems is vastly greater than what
has been realized, and major investments are being made
worldwide to develop the technology. There are consider-
able challenges, particularly because the physical compo-
nents of such systems introduce safety and reliability re-
quirements qualitatively different from those in general-
purpose computing. Moreover, physical components are
qualitatively different from object-oriented software com-
ponents. Standard abstractions based on method calls and
threads do not work. This paper examines the challenges in
designing such systems, and in particular raises the ques-
tion of whether today’s computing and networking tech-
nologies provide an adequate foundation for CPS. It con-
cludes that it will not be sufficient to improve design pro-
cesses, raise the level of abstraction, or verify (formally or
otherwise) designs that are built on today’s abstractions.
To realize the full potential of CPS, we will have to re-
build computing and networking abstractions. These ab-
stractions will have to embrace physical dynamics and com-
putation in a unified way.

1 Introduction

Cyber-Physical Systems (CPS) are integrations of com-
putation with physical processes. Embedded computers and
networks monitor and control the physical processes, usu-
ally with feedback loops where physical processes affect
computations and vice versa. In the physical world, the pas-

∗This work is supported by the National Science Foundation (CNS-
0647591 and CNS-0720841).

sage of time is inexorable and concurrency is intrinsic. Nei-
ther of these properties is present in today’s computing and
networking abstractions.

Applications of CPS arguably have the potential to dwarf
the 20-th century IT revolution. They include high confi-
dence medical devices and systems, assisted living, traffic
control and safety, advanced automotive systems, process
control, energy conservation, environmental control, avion-
ics, instrumentation, critical infrastructure control (electric
power, water resources, and communications systems for
example), distributed robotics (telepresence, telemedicine),
defense systems, manufacturing, and smart structures. It is
easy to envision new capabilities, such as distributed micro
power generation coupled into the power grid, where tim-
ing precision and security issues loom large. Transportation
systems could benefit considerably from better embedded
intelligence in automobiles, which could improve safety
and efficiency. Networked autonomous vehicles could dra-
matically enhance the effectiveness of our military and
could offer substantially more effective disaster recovery
techniques. Networked building control systems (such as
HVAC and lighting) could significantly improve energy effi-
ciency and demand variability, reducing our dependence on
fossil fuels and our greenhouse gas emissions. In communi-
cations, cognitive radio could benefit enormously from dis-
tributed consensus about available bandwidth and from dis-
tributed control technologies. Financial networks could be
dramatically changed by precision timing. Large scale ser-
vices systems leveraging RFID and other technologies for
tracking of goods and services could acquire the nature of
distributed real-time control systems. Distributed real-time
games that integrate sensors and actuators could change the
(relatively passive) nature of on-line social interactions.

The economic impact of any of these applications would
be huge. Computing and networking technologies today,
however, may unnecessarily impede progress towards these
applications. For example, the lack of temporal seman-
tics and adequate concurrency models in computing, and
today’s “best effort” networking technologies make pre-

11th IEEE Symposium on Object Oriented Real-Time Distributed Computing (ISORC)

978-0-7695-3132-8/08 $25.00 © 2008 IEEE
DOI 10.1109/ISORC.2008.25


dictable and reliable real-time performance difficult. Soft-
ware component technologies, including object-oriented
design and service-oriented architectures, are built on ab-
stractions that match software better than physical systems.
Many applications will not be achievable without substan-
tial changes in the core abstractions.

2 Requirements for CPS

Embedded systems have always been held to a higher
reliability and predictability standard than general-purpose
computing. Consumers do not expect their TV to crash and
reboot. They have come to count on highly reliable cars,
where in fact the use of computer controller has dramati-
cally improved both the reliability and efficiency of the cars.
In the transition to CPS, this expectation of reliability will
only increase. In fact, without improved reliability and pre-
dictability, CPS will not be deployed into such applications
as traffic control, automotive safety, and health care.

The physical world, however, is not entirely predictable.
Cyber physical systems will not be operating in a controlled
environment, and must be robust to unexpected conditions
and adaptable to subsystem failures. An engineer faces an
intrinsic tension; designing predictable and reliable com-
ponents makes it easier to assemble these components into
predictable and reliable systems. But no component is per-
fectly reliable, and the physical environment will manage
to foil predictability by presenting unexpected conditions.
Given components that are predictable and reliable, how
much can a designer depend on that predictability and re-
liability when designing the system? How does she avoid
brittle designs, where small deviations from expected oper-
ating conditions cause catastrophic failures?

This is not a new problem in engineering. Digital circuit
designers have come to rely on astonishingly predictable
and reliable circuits. Circuit designers have learned to har-
ness intrinsically stochastic processes (the motions of elec-
trons) to deliver a precision and reliability that is unprece-
dented in the history of human innovation. They can deliver
circuits that will perform a logical function essentially per-
fectly, on time, billions of times per second, for years. All
this is built on a highly random substrate. Should system
designers rely on this predictability and reliability?

In fact, every digital system we use today relies on this
to some degree. There is considerable debate about whether
this reliance is in fact impeding progress in circuit technol-
ogy. Circuits with extremely small feature sizes are more
vulnerable to the randomness of the underlying substrate,
and if system designers would rely less on the predictabil-
ity and reliability of digital circuits, then we could progress
more rapidly to smaller feature sizes.

No major semiconductor foundry has yet taken the
plunge and designed a circuit fabrication process that deliv-

ers logic gates that work as specified 80% of the time. Such
gates are deemed to have failed completely, and a process
that delivers such gates routinely has a rather poor yield.

But system designers do design systems that are robust
to such failures. The purpose is to improve yield, not to im-
prove reliability of the end product. A gate that fails 20%
of the time is a failed gate, and a successful system has to
route around it. The gates that have not failed will work
essentially 100% of the time. The question, therefore, be-
comes not whether to design robust systems, but rather at
what level to build in robustness. Should we design sys-
tems that work with gates that perform as specified 80%
of the time? Or should we design systems that reconfigure
around gates that fail 20% of the time, and then assume that
remaining gates work essentially 100% of the time?

I believe that the value of being able to count on gates
that have passed the yield test to work essentially 100% of
the time is enormous. Such solidity at any level of abstrac-
tion in system design is valuable. But it does not eliminate
the need for robustness at the higher levels of abstraction.
Designers of memory systems, despite the high reliability
and predictability of the components, still put in checksums
and error-correcting codes. If you have a billion compo-
nents (one gigabit RAM, for example) operating a billion
times per second, then even nearly perfect reliability will
deliver errors upon occasion.

The principle that we need to follow is simple. Compo-
nents at any level of abstraction should be made predictable
and reliable if this is technologically feasible. If it is not
technologically feasible, then the next level of abstraction
above these components must compensate with robustness.
Successful designs today follow this principle. It is (still)
technically feasible to make predictable and reliable gates.
So we design systems that count on this. It is harder to make
wireless links predictable and reliable. So we compensate
one level up, using robust coding and adaptive protocols.

The obvious question is whether it is technically feasi-
ble to make software systems predictable and reliable. At
the foundations of computer architecture and programming
languages, software is essentially perfectly predictable and
reliable, if we limit the term “software” to refer to what is
expressed in simple programming languages. Given an im-
perative programming language with no concurrency, like
C, designers can count on a computer to perform exactly
what is specified with essentially 100% reliability.

The problem arises when we scale up from simple
programs to software systems, and particularly to cyber-
physical systems. The fact is that even the simplest C pro-
gram is not predictable and reliable in the context of CPS
because the program does not express aspects of the behav-
ior that are essential to the system. It may execute perfectly,
exactly matching its semantics, and still fail to deliver the
behavior needed by the system. For example, it could miss


timing deadlines. Since timing is not in the semantics of
C, whether a program misses deadlines is in fact irrelevant
to determining whether it has executed correctly. But it is
very relevant to determining whether the system has per-
formed correctly. A component that is perfectly predictable
and reliable turns out not to be predictable and reliable in
the dimensions that matter. This is a failure of abstraction.

The problem gets worse as software systems get more
complex. If we step outside C and use operating system
primitives to perform I/O or to set up concurrent threads,
we immediately move from essentially perfect predictabil-
ity and reliability to wildly nondeterministic behavior that
must be carefully reigned in by the software designer [19].
Semaphores, mutual exclusion locks, transactions, and pri-
orities are some of the tools that software designers have
developed to attempt to compensate for this loss of pre-
dictability and reliability.

But the question we must ask is whether this loss of pre-
dictability and reliability is really necessary. I believe it is
not. Predictable and reliable software does not eliminate
the need to design robust systems, but dramatically changes
the nature of the challenge. We must follow the principle
of making systems predictable and reliable if this is tech-
nically feasible, and give up only when there is convincing
evidence that this is not possible or cost effective. There
is no such evidence for software. Moreover, we have an
enormous asset: the substrate on which we build software
systems (digital circuits) is essentially perfectly predictable
and reliable with respect to properties we care about (timing
and functionality).

Let us examine further the failure of abstraction. Fig-
ure 1 illustrates some of the abstraction layers on which we
depend when designing embedded systems. In this three-
dimensional Venn diagram, each box represents a set. E.g.,
at the bottom, we have the set of all microprocessors. An
element of this set, e.g., the Intel P4-M 1.6GHz, is a par-
ticular microprocessor. Above that is the set of all x86 pro-
grams, each of which can run on that processor. This set is
defined precisely (unlike the previous set, which is difficult
to define) by the x86 instruction set architecture (ISA). Any
program coded in that instruction set is a member of the set,
such as a particular implementation of a Java virtual ma-
chine. Associated with that member is another set, the set
of all JVM bytecode programs. Each of these programs is
(typically) synthesized by a compiler from a Java program,
which is a member of the set of all syntactically valid Java
programs. Again, this set is defined precisely by Java syn-
tax. Each of these sets provides an abstraction layer that is
intended to isolate a designer (the person or program that
selects elements of the set) from the details below. Many of
the best innovations in computing have come from careful
and innovative construction and definition of these sets.

However, in the current state of embedded software,

silicon chips




VHDL programs

VHDL programs

C++ programs


Java programs

Java byte code programs

FPGA configurations


x86 programs


P4-M 1.6GHz




C programs

models Linux processesPosix



task-level models

Figure 1. Abstraction layers in computing

nearly every abstraction has failed. The ISA, meant to
hide hardware implementation details from the software,
has failed because the user of the ISA cares about timing
properties the ISA does not guarantee. The programming
language, which hides details of the ISA from the program
logic, has failed because no widely used programming lan-
guage expresses timing properties. Timing is merely an ac-
cident of the implementation. A real-time operating system
hides details of the program from their concurrent orches-
tration, yet this fails because the timing may affect the re-
sult. The network hides details of signaling from systems,
but most standard networks provide no timing guarantees.

All embedded systems designers face versions of this
problem. Aircraft manufacturers have to stockpile the elec-
tronic parts needed for the entire production line of an air-
craft model to avoid having to recertify the software if the
hardware changes. “Upgrading” a microprocessor in an en-
gine control unit for a car requires thorough re-testing of the
system. Even “bug fixes” in the software or hardware can
be extremely risky, since they can change timing behavior.

The design of an abstraction layer involves many
choices, and computer scientists have chosen to hide timing
properties from all higher abstractions. Wirth [31] says “It
is prudent to extend the conceptual framework of sequential
programming as little as possible and, in particular, to avoid
the notion of execution time.” In an embedded system,
however, computations interact directly with the physical
world, where time cannot be abstracted away. Even general-
purpose computing suffers from these choices. Since timing
is neither specified in programs nor enforced by execution


platforms, a program’s timing properties are not repeatable.
Concurrent software often has timing-dependent behavior
in which small changes in timing have big consequences.

Designers have traditionally covered these failures by
finding worst case execution time (WCET) bounds and us-
ing real-time operating systems (RTOS’s) with predictable
scheduling policies. But these require substantial margins
for reliability, and ultimately reliability is (weakly) deter-
mined by bench testing. Moreover, WCET is an increas-
ingly problematic fiction as processor architectures develop
ever more elaborate techniques for dealing stochastically
with deep pipelines, memory hierarchy, and parallelism.
Modern processor architectures render WCET virtually un-
knowable; even simple problems demand heroic efforts. In
practice, reliable WCET numbers come with many caveats
that are increasingly rare in software. The processor ISA
has failed to provide an adequate abstraction.

Timing behavior in RTOSs is coarse and becomes in-
creasingly uncontrollable as the complexity of the sys-
tem increases, e.g., by adding inter-process communication.
Locks, priority inversion, interrupts and similar issues break
the formalisms, forcing designers to rely on bench testing,
which rarely identifies subtle timing bugs. Worse, these
techniques produce brittle systems in which small changes
can cause big failures.

While there are no true guarantees in life, we should
not blithely discard predictability that is achievable. Syn-
chronous digital hardware delivers astonishingly precise
timing behavior, and software abstractions discard several
orders of magnitude of precision. Compare the nanosecond-
scale precision with which hardware can raise an interrupt
request to the millisecond-level precision with which soft-
ware threads respond. We don’t have to do it this way.

3 Background

Integration of physical processes and computing is not
new. The term “embedded systems” has been used for some
time to describe engineered systems that combine physical
processes with computing. Successful applications include
communication systems, aircraft control systems, automo-
tive electronics, home appliances, weapons systems, games
and toys, for example. However, most such embedded sys-
tems are closed “boxes” that do not expose the computing
capability to the outside. The radical transformation that
we envision comes from networking these devices. Such
networking poses considerable technical challenges.

For example, prevailing practice in embedded software
relies on bench testing for concurrency and timing proper-
ties. This has worked reasonably well, because programs
are small, and because software gets encased in a box with
no outside connectivity that can alter the behavior. How-
ever, the applications we envision demand that embedded

systems be feature-rich and networked, so bench testing
and encasing become inadequate. In a networked environ-
ment, it becomes impossible to test the software under all
possible conditions. Moreover, general-purpose networking
techniques themselves make program behavior much more
unpredictable. A major technical challenge is to achieve
predictable timing in the face of such openness.

Historically, embedded systems were largely an indus-
trial problem, one of using small computers to enhance the
performance or functionality of a product. In this earlier
context, embedded software differed from other software
only in its resource limitations (small memory, small data
word sizes, and relatively slow clocks). In this view, the
“embedded software problem” is an optimization problem.
Solutions emphasize efficiency; engineers write software at
a very low level (in assembly code or C), avoid operating
systems with a rich suite of services, and use specialized
computer architectures such as programmable DSPs and
network processors that provide hardware support for com-
mon operations. These solutions have defined the practice
of embedded software design and development for the last
30 years or so. In an analysis that remains as valid today
as 19 years ago, Stankovic [26] laments the resulting mis-
conceptions that real-time computing “is equivalent to fast
computing” or “is performance engineering” (most embed-
ded computing is real-time computing).

But the resource limitations of 30 years ago are surely
not resource limitations today. Indeed, the technical chal-
lenges have centered more on predictability and robustness
than on efficiency. Safety-critical embedded systems, such
as avionics control systems for passenger aircraft, are forced
into an extreme form of the “encased box” mentality. For
example, in order to assure a 50 year production cycle for
a fly-by-wire aircraft, an aircraft manufacturer is forced to
purchase, all at once, a 50 year supply of the microproces-
sors that will run the embedded software. To ensure that val-
idated real-time performance is maintained, these micropro-
cessors must all be manufactured on the same production
line from the same masks. The systems will be unable to
benefit from the next 50 years of technology improvements
without redoing the (extremely expensive) validation and
certification of the software. Evidently, efficiency is nearly
irrelevant compared to predictability, and predictability is
difficult to achieve without freezing the design at the phys-
ical level. Clearly, something is wrong with the software
abstractions being used.

The lack of timing in computing abstractions has been
exploited heavily in such computer science disciplines as ar-
chitecture, programming languages, operating systems, and
networking. Caches, dynamic dispatch, and speculative ex-
ecution improve average case performance at the expense of
predictability. These techniques make it nearly impossible
to tell how long it will take to execute a particular piece of


code.1 To deal with these architectural problems, designers
may choose alternative processor architectures such as pro-
grammable DSPs not for efficiency, but for predictability.

Even less time-sensitive applications are affected. Anec-
dotal evidence from computer-based instrumentation, for
example, indicates that real-time performance delivered by
today’s PCs is about the same as that delivered by PCs in
the mid-1980’s. Twenty years of Moore’s law have not im-
proved things in this dimension. This is not entirely due to
hardware architecture, of course. Operating systems, pro-
gramming languages, user interfaces, and networking have
become more elaborate. All have been built on an abstrac-
tion of software where time is irrelevant.

The prevailing view of real-time appears to have been
established well before embedded computing was common
[31]. “Computation” is accomplished by a terminating se-
quence of state transformations. This core abstraction un-
derlies the design of nearly all computers, programming
languages, and operating systems in use today. But unfor-
tunately, this core abstraction may not fit CPS very well.

The most interesting and revolutionary cyber-physical
systems will be networked. The most widely used net-
working techniques today introduce a great deal of timing
variability and stochastic behavior. Today, embedded sys-
tems often use specialized networking technologies (such as
CAN busses in manufacturing systems and FlexRay in au-
tomotive applications). What aspects of those networking
technologies should or could be important in larger scale
networks? Which are compatible with global networking?

To be specific, recent advances in time synchronization
across networks promise networked platforms that share a
common notion of time to a known precision [16]. How
would that change how distributed cyber-physical applica-
tions are developed? What are the implications for security?
Can we mitigate security risks created by the possibility of
disrupting the shared notion of time? Can security tech-
niques effectively exploit a shared notion of time to improve

Operating systems technology is also groaning under the
weight of the requirements of embedded systems. RTOS’s
are still essentially best-effort technologies. To specify real-
time properties of a program, the designer has to step out-
side the programming abstractions, making operating sys-
tem calls to set priorities or to set up timer interrupts. Are
RTOS’s merely a temporary patch for inadequate comput-
ing foundations? What would replace them? Is the concep-
tual boundary between the operating system and the pro-
gramming language (a boundary established in the 1960’s)
still the right one? It would be truly amazing if it were.

1A glib response is that execution time in a Turing-complete language
is undecidable anyway, so it’s not worth even trying to predict execution
time. This is nonsense. No cyber-physical system that depends on timeli-
ness can be deployed without timing assurances. If Turing completeness
interferes with this, then Turing completeness must be sacrificed.

Cyber-physical systems by nature will be concurrent.
Physical processes are intrinsically concurrent, and their
coupling with computing requires, at a minimum, concur-
rent composition of the computing processes with the phys-
ical ones. Even today, embedded systems must react to mul-
tiple real-time streams of sensor stimuli and control multi-
ple actuators concurrently. Regrettably, the mechanisms of
interaction with sensor and actuator hardware, built for ex-
ample on the concept of interrupts, are not well represented
in programming languages. They have been deemed to be
the domain of operating systems, not of software design.
Instead, the concurrent interactions with hardware are ex-
posed to programmers through the abstraction of threads.

Threads, however, are a notoriously problematic [19,
32]. This fact is often blamed on humans rather than on
the abstraction. Sutter and Larus [27] observe that “hu-
mans are quickly overwhelmed by concurrency and find it
much more difficult to reason about concurrent than sequen-
tial code. Even careful people miss possible interleavings
among even simple collections of partially ordered oper-
ations.” The problem will get far worse with extensively
networked cyber-physical systems.

Yet humans are quite adept at reasoning about concur-
rent systems. The physical world is concurrent, and our
very survival depends on our ability to reason about con-
current physical dynamics. The problem is that we have
chosen concurrent abstractions for software that do not even
vaguely resemble the concurrency of the physical world.
We have become so used to these that we have lost track
of the fact that they are not immutable. Could it be that the
difficulty of concurrent programming is a consequence of
the abstractions, and that if we were are willing to let go of
those abstractions, then the problem would be fixable?

4 Solutions

These problems are not entirely new; many creative re-
searchers have made contributions. Advances in formal ver-
ification, emulation and simulation techniques, certification
methods, software engineering processes, design patterns,
and software component technologies all help. We would
be lost without these improvements. But I believe that to re-
alize its full potential, CPS systems will require fundamen-
tally new technologies. It is possible that these will emerge
as incremental improvements on existing technologies, but
given the lack of timing in the core abstractions of comput-
ing, this seems improbable.

Nonetheless, incremental improvements can have a con-
siderable impact. For example, concurrent programming
can be done in much better ways than threads. For example,
Split-C [10] and Cilk [8] are C-like languages supporting
multithreading with constructs that are easier to understand
and control than raw threads. A related approach combines


language extensions with constraints that limit expressive-
ness of established languages in order to get more consistent
and predictable behavior. For example, the Guava language
[5] constrains Java so that unsynchronized objects cannot be
accessed from multiple threads. It further makes explicit the
distinction between locks that ensure the integrity of read
data (read locks) and locks that enable safe modification
of the data (write locks). SHIM also provides more con-
trollable thread interactions [29]. These language changes
prune away considerable nondeterminacy without sacrific-
ing much performance, but they still have deadlock risk, and
again, none of them confronts the lack of temporal seman-

As stated above, I believe that the best approach has to be
predictable where it is technically feasible. Predictable con-
current computation is possible, but it requires approaching
the problem differently. Instead of starting with a highly
nondeterministic mechanism like threads, and relying on
the programmer to prune that nondeterminacy, we should
start with deterministic, composable mechanisms, and in-
troduce nondeterminism only where needed.

One approach that is very much a bottom-up approach is
to modify computer architectures to deliver precision tim-
ing [12]. This can allow for deterministic orchestration
of concurrent actions. But it leaves open the question of
how the software will be designed, given that programming
languages and methodologies have so thoroughly banished
time from the domain of discourse.

Achieving timing precision is easy if we are willing to
forgo performance; the engineering challenge is to deliver
both precision and performance. While we cannot aban-
don structures such as caches and pipelines and 40 years of
progress in programming languages, compilers, operating
systems, and networking, many will have to be re-thought.
Fortunately, throughout the abstraction stack, there is much
work on which to build. ISAs can be extended with instruc-
tions that deliver precise timing with low overhead [15].
Scratchpad memories can …

Submit a Comment

Open chat