While the era of the ubiquitous computer, in
which computers are embedded everywhere to yield smart cars, smart buildings,
smart offices, etc., has not yet arrived, it is clear that more and more devices
in daily use involve information-processing embedded computer systems. Computer are certainly becoming more
pervasive, but I would argue that before computing system can ever become truly
ubiquitous, they will have to exhibit a far greater ability to engage in natural
interaction with humans than they are capable of at present. Not only must they be able to understand
the manner in which humans communicate by word and gesture, but they will have
to become aware of their environment, how it behaves, and when they are being
addressed or communicated with by humans.
At that point, we will have entered the age of the vigilant environment -
an environment populated by innocuous devices that are perceptive and
autonomous, that interact naturally with humans, that know when interaction is
required, and that understand the intention of the communication, even if the
information communicated is ill-posed or insufficient. Such an understanding requires both
robust, adapative, anticipatory perceptual systems and a sufficient
understanding of the semantics of interaction.
The foregoing 'dream' has been held by many
scientists and others for several years. However, the scientific challenges are
significant. For a vigilant
environment to be possible, we will need fundamental developments on several
fronts.
We will need robust computer vision based tools
that can reliably derive structural information about an observed scene. Arguably, these tools will require
either active (mobile) sensors or multiple-camera stereopsis (or, probably,
both). This requirement is an
inevitable consequence of the loss of information inherent in the imaging
projection process; one needs multiple constraints to effect the information
recovery. Note well, however, that
this does not necessarily imply that the vision tools must be capable of
producing 3-D CAD-like representations of the environment; rather that the
representations it does produce are in some sense invariant with time and space
- that the representations are valid for at least moderate periods of time and
that they are valid irrespective of the relative positioning of the observed and
observer and their mutual behaviour.
I believe it is a core research issue to find such invariants and ways of
computing them from the uncertain, time-varying, and noise world that makes up
our visual environment. I believe
too, that it may be difficult to do this in a conventional analytic framework
and that the search for such invariants might benefit from being re-cast as an
evolutionary problem (i.e. an evolutionary computation problem). Whilst an unbounded search for such
constraints is inherently ill-posed (and, at best, NP-hard), some constraints
might be provided by the nature of the interaction itself.
However hard the development of the robust
vision and auditory tools may be, it is only one side of the vigilant interface
coin. The other side, I believe, is
actually much harder. I take it as
a given that a computational artifact which requires the use of some
externally-sourced form of semantics (information on the structure and behaviour
of the world) can only function well in very limited circumstances and with
quite small semantic data-sets. My principal reason for asserting this is that
any externally-sourced semantics reflect the perceptions and conceptions of the
human observer - programmer prejudice, if you will - rather than any consistent
mapping between observed and observer in the computational agent's domain of
discourse. The larger the set of a
priori semantic information embedded in a system, the greater the possibility of
a mismatch between the ability of the system to generate such information from
sensory data and its ability to use it to solve the problem at hand. If we want to develop truly resilient
robust perceptual agents (ubiquitous human-computer interfaces), I believe it is
essential that the agent be capable of closing the epistemological loop that
makes the perception and human-computer interaction meaningful in the first
place.
There a number of important implications of
this point of view.
The first is that the agent must be able to
learn its own epistemology (within, perhaps, a group of perceiving agents) and
build its own semantic understanding of the world. This means that it must be
able to interact with its environment and falsify postulated hypotheses
regarding the perceptual-interaction mapping.
The second is that goal-directed activity (i.e.
getting the agent to do what we want of it) can only then be effected by
training - it can't be effected by implanting rules in the
agent.
The third is that the development of a
resilient natural interface becomes an exercise in behavioural development of a
computation (perceptual) agent (actor).
The fourth is that cognition might then be
viewed as a (temporal) pattern of attentional behaviour as the visual agent
interacts with its environment based on prior learned semantics and
expectation-driven reasoning about the likely outcome of future
interaction.
Finally, the semantics of the agent's world can
thus be seen to be consistent function-dependent patterns of inter-object
behaviour (of which the agent may be one object). A table, for example, is not a table
because of its structure but because an agent or actor is consistently perceived
to be using it to support things that they are using. One major research challenge is to find
out how to identify - to learn, to represent, and to reason about - these
function-dependent patterns.
It is clear that we will require continued
development of computer perception tools (mathematical formalisms,
representations, learning paradigms) of the kind outlined above. We will also need a flexible test bed to
investigate new techniques and to experiment with interactive systems. Such a test-bed might come in the form
of a vigilant room, populated by a series of sensor systems of different
configurations and different degrees of mobility (consider, for example, a
series of articulated sensor surfaces populated by cameras, like a large scale
active plenoptic array). All sensor surfaces should be capable of sharing
information; this is one possible way in which the perceptual agents can create
a shared epistemology. There will
also be a need for robotic agents to effect training. Finally, there will be a need for
significant computational resources, especially if an evolutionary approach is
taken to learning. Above all,
this research dream should be enacted in a multi-disciplinary environment where
scientists, engineers, psycho-physicists, psychologists, and others can come
together for extended periods to share ideas and develop new ways of
thinking.