The Vigilant Environment - A Research Dream

David Vernon

The Dream

While the era of the ubiquitous computer, in which computers are embedded everywhere to yield smart cars, smart buildings, smart offices, etc., has not yet arrived, it is clear that more and more devices in daily use involve information-processing embedded computer systems. Computer are certainly becoming more pervasive, but I would argue that before computing system can ever become truly ubiquitous, they will have to exhibit a far greater ability to engage in natural interaction with humans than they are capable of at present. Not only must they be able to understand the manner in which humans communicate by word and gesture, but they will have to become aware of their environment, how it behaves, and when they are being addressed or communicated with by humans. At that point, we will have entered the age of the vigilant environment - an environment populated by innocuous devices that are perceptive and autonomous, that interact naturally with humans, that know when interaction is required, and that understand the intention of the communication, even if the information communicated is ill-posed or insufficient. Such an understanding requires both robust, adapative, anticipatory perceptual systems and a sufficient understanding of the semantics of interaction.

The Challenges

The foregoing 'dream' has been held by many scientists and others for several years. However, the scientific challenges are significant. For a vigilant environment to be possible, we will need fundamental developments on several fronts.

We will need robust computer vision based tools that can reliably derive structural information about an observed scene. Arguably, these tools will require either active (mobile) sensors or multiple-camera stereopsis (or, probably, both). This requirement is an inevitable consequence of the loss of information inherent in the imaging projection process; one needs multiple constraints to effect the information recovery. Note well, however, that this does not necessarily imply that the vision tools must be capable of producing 3-D CAD-like representations of the environment; rather that the representations it does produce are in some sense invariant with time and space - that the representations are valid for at least moderate periods of time and that they are valid irrespective of the relative positioning of the observed and observer and their mutual behaviour. I believe it is a core research issue to find such invariants and ways of computing them from the uncertain, time-varying, and noise world that makes up our visual environment. I believe too, that it may be difficult to do this in a conventional analytic framework and that the search for such invariants might benefit from being re-cast as an evolutionary problem (i.e. an evolutionary computation problem). Whilst an unbounded search for such constraints is inherently ill-posed (and, at best, NP-hard), some constraints might be provided by the nature of the interaction itself.

However hard the development of the robust vision and auditory tools may be, it is only one side of the vigilant interface coin. The other side, I believe, is actually much harder. I take it as a given that a computational artifact which requires the use of some externally-sourced form of semantics (information on the structure and behaviour of the world) can only function well in very limited circumstances and with quite small semantic data-sets. My principal reason for asserting this is that any externally-sourced semantics reflect the perceptions and conceptions of the human observer - programmer prejudice, if you will - rather than any consistent mapping between observed and observer in the computational agent's domain of discourse. The larger the set of a priori semantic information embedded in a system, the greater the possibility of a mismatch between the ability of the system to generate such information from sensory data and its ability to use it to solve the problem at hand. If we want to develop truly resilient robust perceptual agents (ubiquitous human-computer interfaces), I believe it is essential that the agent be capable of closing the epistemological loop that makes the perception and human-computer interaction meaningful in the first place.

There a number of important implications of this point of view.

The first is that the agent must be able to learn its own epistemology (within, perhaps, a group of perceiving agents) and build its own semantic understanding of the world. This means that it must be able to interact with its environment and falsify postulated hypotheses regarding the perceptual-interaction mapping.

The second is that goal-directed activity (i.e. getting the agent to do what we want of it) can only then be effected by training - it can't be effected by implanting rules in the agent.

The third is that the development of a resilient natural interface becomes an exercise in behavioural development of a computation (perceptual) agent (actor).

The fourth is that cognition might then be viewed as a (temporal) pattern of attentional behaviour as the visual agent interacts with its environment based on prior learned semantics and expectation-driven reasoning about the likely outcome of future interaction.

Finally, the semantics of the agent's world can thus be seen to be consistent function-dependent patterns of inter-object behaviour (of which the agent may be one object). A table, for example, is not a table because of its structure but because an agent or actor is consistently perceived to be using it to support things that they are using. One major research challenge is to find out how to identify - to learn, to represent, and to reason about - these function-dependent patterns.

The Dream Made Real

It is clear that we will require continued development of computer perception tools (mathematical formalisms, representations, learning paradigms) of the kind outlined above. We will also need a flexible test bed to investigate new techniques and to experiment with interactive systems. Such a test-bed might come in the form of a vigilant room, populated by a series of sensor systems of different configurations and different degrees of mobility (consider, for example, a series of articulated sensor surfaces populated by cameras, like a large scale active plenoptic array). All sensor surfaces should be capable of sharing information; this is one possible way in which the perceptual agents can create a shared epistemology. There will also be a need for robotic agents to effect training. Finally, there will be a need for significant computational resources, especially if an evolutionary approach is taken to learning. Above all, this research dream should be enacted in a multi-disciplinary environment where scientists, engineers, psycho-physicists, psychologists, and others can come together for extended periods to share ideas and develop new ways of thinking.