Towards conscious general vision systems

Jan-Mark Geusebroek and Arnold W.M. Smeulders

Intelligent Sensory Information Systems, Informatics Institute,
Faculry of Science, University of Amsterdam
Kruislaan 403, 1098 SJ Amsterdam, Netherlands,
{mark, smeulders}@science.uva.nl

This research dream sketches the problem of content awareness in general vision systems. Common conceptual framework in visual cognition is Marr's (1983) view of a layered visual system, which represents a low-level image processing layer, followed by an intermediate level of segmentation and grouping, and ending in a high-level interpretation layer. However, when visual information enters the brain it is redirected to dozens of cognitive modules in the brain (Van Essen 1992), interconnected by various visual pathways. Knowledge and expectation is included at an early stage in human perception. These facts more or less rule out the existence of a layered vision system, where interpretation is a top-level conclusion without steering and feedback from intermediate or low-level vision modules. We consider cognition as a highly distributed task, where several cognitive modules together form consensus about the visual scene. In such a complex system, conscious and unconscious behaviour may be initiated at various modules. Understanding the working of individual visual modules, and their interconnection with other vision modules and with other perception or cognitive modules, gives insight in the semantic conscious interpretation of the visual scene.

The discipline of computer vision aims at the modelling of vision to the extent it can be performed by a machine. Understanding visual perception into this amount of detail is a long-term goal, which seemed only a grant challenge. As vision absorbs 30% of the brain power, and what is known about it points to a highly distributed task interwoven with many other modules, now it is clear that understanding and modelling human vision is far off. Nevertheless, vision remains an intriguing challenge as it dominates our own senses, societal interaction, and points at essence of individual existence. And, as computers are expected to reach the capacity of the human brain in 2015, now is the time to start thinking how to construct modules for general vision systems.

General vision is concerned with the processing of visual sensory information in order to act and react to a constantly changing environment the sensory system is observing. The human visual system is a very well adapted example of a general vision system. Computer vision in this respect is disappointing, as there are only a few areas where it can compete with the capabilities of human perception. Two reasons are apparent. First, a major bottleneck is the difference between the physical scene and the observation of the scene. The observation is affected by accidental imaging circumstance, the viewpoint to the scene, the aspects of interaction between light and material, the limited resolution of the sensory system, and several other factors. Only recently computer vision aims at solving this bottleneck. The second bottleneck comes from knowledge and expectations of what we see. Human perception actively assigns knowledge and anticipation to the observed scene, using semantic information for reasoning on a higher level than from purely visual evidence can be achieved. A better knowledge of the assignment of semantic information to visual data is rudimentary for the sustained development in computer vision.

We depart from the prerequisite that any general sensory system is adapted to the outside world it is processing, specifically to the statistical structure of the input signals. For one, the statistics of the sensory input is dominated by physical laws of image formation and of the reflection from materials. They generate scene specific imaging aspects, which are desirable for a precise understanding of the scene but undesirable for the recognition of objects in the scene and their labelling with general categories. A bag of sugar will show a specific reflection pattern and from scene specific aspects of the image the observer may estimate position in the scene, points of contact and ways to grab the bag. When the aim is to recognise the bag as a bag and identifying the bag may contain sugar, the scene accidental conditions have to be removed first by an invariant description. This is a general requirement for general vision systems as is generalised in current philosophy as the proper way to describe all conscious perception (Nozick 2002).

To counteract the accidental aspects of the scene, a general vision system will represent the image in many diverse invariant representations. We consider the transformation of sensory responses to invariants an important and inevitable information reduction stage in general vision system. The resulting directly observable quantities are believed to be an essential part of human perception (Foster and Nascimento 1994; Koenderink 1984). As many physical parameters each may or may not affect the image formation process, a large variety of invariants can be deduced from visual data. In the scene there may be white light, blue diffuse light (open sky), there may be directed light and shadows, each require a different invariant transformation. Computer vision has partly solved the problem of invariant transformations. Koenderink (1984) has made a significant step forward by his work on the structure and scaling behaviour of receptive fields. He has been followed by many, among others the subsequent categorisation of geometrical invariants by Florack (1991) and Van Gool et al. (1995), and the derivation of colour invariants by Geusebroek et al. (2001). The simplification of the sensory input by invariant representation advances towards a better formulated computational theory for visual cognition. The use of invariants results in less complicated algorithms (Mundy et al. 1993), simpler object representation, and image retrieval under various conditions has become possible (Smeulders et al. 2000).

The visual content of the scene is laid down in invariant representations, leaving the interpretation of the object specific visual characteristics. The essence of invariant representations is to simplify the visual interpretation task. Adaptation to the remaining statistical structure implies the tuning to the a-priori occurrence of the phenomenon observed which constrains the visual input. To limit the enormous computational burden arising from the complex task of interpretation, any efficient general vision system will ignore the common statistics in its input signals. Hence, the apparent occurrence of invariant representations decides what is salient enough to pay attention to. In this respect, the study of natural image statistics is essential (Knill et al. 2003). Focus-of-attention mechanisms restrict processing to a limited number of stimuli (Schmid 2000). For the human visual system, sensory information is selected by our consciousness, giving attention to only a few stimuli at any time (O'Shaugnessy 2002). Cognitive vision starts playing a role as soon as we pay attention to the visual stimulus.

Semantic interpretation assigns knowledge from the objects in the scene, either by memory association, or by learning more abstract rules from feedback or reasoning. The complexity of the learning space is tremendously reduced after exploiting the statistical structure in the sensory input. Invariant representations reduce the information content to the essential visual characteristics. The selection of statistical descriptive properties by focus of attention mechanisms limits the dimensionality of the learning space.

Expectation about the scene may be used to steer the selection of invariant representations from the scene. Hence, focal attention is not only triggered by visual stimuli. Alternatively, focal attention is affected by knowledge about the scene, initiating conscious behaviour. At a first look to the scene, expectation may affect focal attention. Closer inspection of the scene will use the hypothesised contents of the visual field to steer focal attention mechanisms. This feedback mechanism sketches a form of consciousness for general vision systems. Research question is in which way knowledge and expectation steer focal attention to yield an efficient vision system.

In our view, the physical and statistical constrains on the sensory input determines the construction of general vision systems. Representing the scene in a multiplicity of invariants is opposite to mainstream computer vision, in that it breaks with the tradition of Marr's simplification of the scene in terms of geometrical entities, Marr's primal sketch (Marr 1983). Rather than aiming for a complete geometrical representation of the visual field, visual cognition may be based on weak description of the important features in the scene, as long as mutual correspondence between observation and objects in the world is maintained. The simplification of the sensory input by invariant representation advances towards a better formulated computational theory for visual cognition. Interpretation of the scene, hence the extraction of knowledge, starts after focal attention to conspicuous stimuli. Expectation and knowledge about the scene steers such focal attention mechanisms to adapt the selected set of invariant representations.

Envisaged results for the research as sketched include answers on the question what kind of knowledge can be extracted from specific visual modules. In other words, which rudimentary visual tasks can be solved at a low-level of processing, and which tasks require an enormous amount of intelligence, in that several cognitive modules should be connected to solve such a task. For example, think of the assignment of text primitives to visual entities like "a red ball", which may be solved at a completely different level as the interpretation of a complex scene as "a memorial meeting".

A second fundamental insight aimed for is how to use knowledge to steer the focus of attention to increase accuracy of vision systems. This will have a major impact in computer vision, as it allows general algorithms to adapt to specific circumstances or tasks, and to include expectation into vision algorithms. For instance, general vision sensors in cars may be adapted from highway circumstances, where focal attention may be geared towards detection of speed limits, to city-mode, focusing on pedestrian detection. Another area in which results may be expected is in man-machine interaction. For internet browsing, the combination of textual information, representing direct knowledge, and visual data may yield better image retrieval systems.

This research dream sketches a computational theory for cognitive vision, starting at the sensory input, and including a first connection between knowledge and vision through statistical learning and focal attention shifting. In this theory, the first two steps are visible: invariant representation and focal attention. Most important ingredients for invariant representation are, among others, colour scale space, geometric invariance, general histogram, and motion. As a plurality of invariant representations result, focal attention reduces the complexity of the visual representation. The remaining statistical structure for each of the invariant representations determines focal attention mechanisms. Expectation and hypothesised knowledge about the scene content steer focal attention mechanisms. The presence of large annotated pictorial databases and the availability of massive computing power allow the sketched direction to be fruitful.

References

D. C. Van Essen, C. H. Anderson, and D. J. Felleman, Information processing in the primate visual system: An integrated systems perspective, Science 255, 419-423, 1992.

L. J. M. Florack, Image Structure, PhD Thesis Utrecht University, 1991.

D. H. Foster and S. M. C. Nascimento. Relational colour constancy from invariant cone-excitation ratios. Proc. R. Soc. London B 257, 115-121, 1994.

J. M. Geusebroek et al., Color invariance, IEEE Trans. Pattern Anal. Machine Intell. 23, 1338-1350, 2001.

L. Van Gool et al., Vision and Lie's approach to invariance, Image Vision Comput.. 13, 259-277, 1995.

D. Knill, W. T. Freeman, and W. S. Geisler (editors), Special issue JOSA-A on Bayesian and Statistical Approaches to Vision, American Optical Society, to appear, 2003.

J. J. Koenderink, The structure of images, Biol. Cybern. 50, 363-370, 1984.

D. Marr, Vision, Freeman and Co., 1983.

J. Mundy and A. Zisserman (editors), Geometric Invariance in Computer Vision, Springer-Verlag, 1992.

R. Nozick, Invariances: The Structure of the Objective World, Harvard University Press, 2002.

B. O¹Shaughnessy, Consiousness and the World, Oxford University Press, 2002.

C. Schmid et al., Evaluation of interest point detectors, Int. J. Comput. Vision.. 37, 151-172, 2000.

A. W. M. Smeulders et al., Content based image retrieval at the end of the early years, IEEE Trans. Pattern Anal. Machine Intell. 22, 1349-1379, 2000.