Theories of Object Recognition

Compare and contrast Marr and Nishihara’s and Biederman’s theories of object recognition. How well do they explain how we are able to recognize three dimensional objects despite changes in viewing angle?

Humphreys and Bruce (1989) proposed a model of object recognition that fits a wider context of cognition. According to them, the recognition of objects occurs in a series of stages. First, sensory input is generated, leading to perceptual classification, where the information is compared with previously stored descriptions of objects.

Then, the object is recognized and can be semantically classified and subsequently named. This approach is, however, over-simplified. Other theories like Marr and Nishihara’s and Biederman’s explain in more detail the processes involved in the stages of perceptual and semantic classification. This essay will compare and contrast these two latter theories and evaluate their contribution to 3D object recognition. In doing so, it will consider the viewpoint invariant or viewpoint dependent debate and compare both approaches to others like Tarr and Bülthoff’s and Foster and Gilson’s.

According to Humphreys and Bruce (1989), the first stage of object recognition is the early visual processing of the retinal image, as for example Marr’s primal sketch, in which a two dimensional description is formed.

In the second stage a description of the object is generated, as for example Marr’s 2 ½ D sketch, in which a description of depth and orientation of visible surfaces is formed in relation to the view point of the observer and is therefore viewpoint dependent.

In the third stage (perceptual classification) a structural description is created, similar to the processes forming Marr’s 3D model representation. The main focus of both Marr and Biedermann theories appear to be on the second and third stages of this sequence. Marr and Nishihara (1978) proposed a theory of object recognition based on generating a 3D object-centered representation, which allows the object to be recognized by any angle. According to them, this representation was based on a canonical coordinate frame which is achieved by defining the central axis of an object.

To locate the main axis, the shape of the object is generated from the information provided by the 2 ½ D sketch based on the object’s occluding contours. The boundaries of the object’s silhouette are used to generate the contour of the object and are referred to as contour generator. Once the shape of the object is generated the main axis is located. Areas of concavity and convexity are then used to divide the object into smaller parts. Following, the axes for each sub-section are identified and each component is represented via a generalized cone known as primitive. In this way a 3D image of the object is generated and a match between the arrangement of components and a stored 3D model description is performed to identify the object. These 3D models are hierarchical and include both global and detailed information stored in a hierarchically organized catalog (Kaye, 2010).

Marr’s ideas about object recognition have been extended and adapted by Biedermann. Like Marr and Nishihara’s, Biedermann’s theory is also based on representing complex objects using a series of more simple primitives. However, Biedermann’s primitives are not limited to generalized cones. Instead, he proposes that complex objects are made up of arrangements of basic component parts such as cylinders and cubes known as geons. Similar to Marr and Nishihara theory, this division into component parts is also based on geometrical properties of occluding contours in the image, in particular that parts are defined in relation to sharp concavities on contours. However, different from Marr, Biedermann claims that contour generation is not needed to recover a 3D shape. Instead, he proposes that each geon has a key feature that remains invariant independently of the viewpoint. So, first the key features are located in the 2D primal sketch and then they are matched to a geon, generating a 3D structural description of the object. This description is then matched against those stored in memory.

According to Biedermann, geons are detected on the basis of non-accidental properties such as collinearity, symmetry and parallelism. Like Marr and Nishihara, Biedermann sustains that primitives are invariant under changes in viewpoint. Similarly, both theories are supported by research. Lawson and Humphreys (1996), for example, showed that recognition is affected more by tilt of major axis (foreshortening) than any other rotation, which endorses Marr’s and Nishihara prediction that establishing a central axis is crucial to the process of recognition. Warrington and Taylor (1978) reported that brain damaged patients could recognize objects presented in a typical view only. These patients found difficult to say if two photographs presented simultaneously were the same object when one image was a typical view and the other an unusual view. Although this could be explained as the patient’s inability to transform a 2D version of the atypical view into a 3D model, it could also be due to difficulty in establishing the central axis or due to some features of the object being hidden.

In a later study, Humphreys and Riddoch (1984) used images where either the axis had been foreshortened through rotation or a critical feature was hidden. They found that patients had more problems recognizing the images with a foreshortened axis than the ones where a critical feature was hidden. Their findings provide some support for Marr and Nishihara idea that axis location plays a key role in generating 3D descriptions. In addition, Biedermann and Gehardstein (1993) investigated the extent to which recognition is object-centered. They used a technique known as repetition priming to find out if presenting one viewpoint of an object would help it to be recognized from another viewpoint. Their results showed that priming occurred if the change in the angle was less135 degrees apart.

However, priming was less effective if one or more geon was hidden between the first and the second view, even with a angle inferior to 135 degrees (Kaye, 2010).Their findings not only support Biedermann’s idea that geons are used to generate descriptions, but also support the claim made by both theories that the production of an object description is viewpoint independent. There is also evidence to support the claim made by both theories that concavities are used to divide the objects into components. A study of Biedermann (1987a) demonstrated that deleting the concavity parts of images resulted in greater disruption to recognition than deleting other parts of the contour.

Both theories have indeed great advantage over earlier models such template matching and feature recognition in which the complexities of 3D recognition was not taken into account. However, both present limitations and research that cannot be accommodated. Bulthoff and Edelman (1992), for example, found that participants were not able to recognize novel objects presented from a novel viewpoint, although the view presented allowed the creation of an object-centered description. Besides, although both theories accommodate well between-category discriminations, such as deciding if an animal is a cat or a dog, they cannot account for subtle perceptual discriminations within classes of objects. For example, the same geons or cones are used to describe poodles, but cannot distinguish between two specific ones. Furthermore, the two theories see the process of recognition as essentially passive, however other types of recognition such as by touch take a more active approach. Furthermore, both theories are not specific about how visual information is matched with what is held in long term memory, and how semantic information is accessed and naming takes place. Moreover, they downplay the importance of context in object recognition.

Palmer, for example, found that the probability of identifying an object was greater when the object was appropriated to the context (Eysenck and Keane, 1995). In addition, studies related to both theories may lack ecological validity for relying on the use of static pictures or models. Thus, it makes sense to think that for different types of recognition there should be different ways to achieve it. While structural description models like Marr’s and Biedermann’s predict that objects should not be greatly affected by change in viewpoint as long as the same structural elements are visible from all views (Hayward, 2003); others like Tarr and Bülthoff (1995) have argued that recognition is indeed viewpoint dependent. This second class of theories assumes that objects are encoded in memory in the poses in which they are seen by viewers, and are thus called “view-based” theories.

They propose that a 2D, instead of a 3D, projection of an object from a particular position is encoded using specific coordinates, and form separate representations for each new view of an object. Old views are recognized through a match of the encoded 2D features while new views require a process to generalize it to the closest stored view. Unlike structural views, there is no mention to non-accidental properties. They predict increased recognition difficulties as an observed view is rotated further from the nearest studied view. Both approaches agree that changing viewpoint will result in costs and that some visual properties, especially structural ones, are important for generalizing across viewpoint. The debate between viewpoint dependent or invariant has been extensive in the last decade (Hayward, 2003).

In a study, Tarr and Bulthoff (1993) defended the view-point dependent model. They argued that view point invariant mechanism lack generality, as the conditions proposed for obtaining it do not characterize everyday recognition. They also mentioned that the wide range of studies that find view-point dependent recognition performance cannot be dismissed as arising from non recognition systems or experimental artifacts, as suggested by Biedermann. Moreover, they stated that Biederman’s theory does not provide an account of entry-level recognition because in some cases it represents different entry-level items as the same object and in others it represents it as different objects. In addition, they argued that geon structural descriptions cannot account for object recognition and concluded that the visual system may use multiple object representation systems for different tasks and/or different classes of object. They suggested that view-independent representations may be used for object classification, and view-dependent representations for discrimination within an object class.

So, there may be multiple representational systems. Like Tarr and Bulthoff’s, there are other models predicting that recognition performance is affected by change of view in some situations but is invariant in others. Foster and Gilson, however, go beyond these and propose a model in which viewpoint dependent and invariant are integrated within a cooperative framework (Hayward, 2003). “Foster and Gilson’s study of view dependency of novel objects that were created by combining structural properties (number of parts) with metric properties (thickness, size of parts) has found that both view-dependent and view-independent processing seem to be combined in object recognition” (Biederman, Osaka and Rentschler, 2007, p90). There are, however, some problems with this study.

For example, the need to minimize part occlusion and the fact that detecting the number of parts in an object is a too simplistic structural property. Nevertheless, it is a move from the debate between two extreme perspectives of either view-based or view-invariant processing to a possibility of a recognition processing in which features are selected according to the current task depending on the amount of visual experience in that task (Biederman, Osaka and Rentschler, 2007). In conclusion, recognition of 3D objects is a complex process and cognitive psychologists still have a lot to learn about it.

There are not only different types of recognition, but also different ways of recognizing objects, which may involve different processes. Therefore, both Marr’s and Nishihara and Biedermann viewpoint invariant theories are not entirely satisfactory accounts of the processes involved in recognition. Although both theories clarify some points of the process, both present problems. Then again, viewpoint dependent theories also present an incomplete account of object recognition. There may be, as suggested by Tarr and Bulthoff, multiple representational systems and they may work cooperatively rather than alternatively, as suggested by Foster and Gilson. Indeed, further research that moves beyond the viewpoint debate is needed for a better understanding of object recognition.

Biederman, I. and Gerhardstein, P.C. (1993) ‘Recognizing depth-rotated objects: evidence and conditions for three-dimensional viewpoint invariance’, Journal of Experimental Psychology: Human Perception and Performance, vol.19, pp.1162–82, in Kaye, H. (2010) (Ed) “Cognitive Psychology”, Milton Keynes, The Open University, p.126. Bulthoff, H.H. and Edelman, S. (1992) ‘Psychophysical support for a two dimensional view interpolation theory of object recognition’, Proceedings of the National Academy of Sciences of the USA, vol.89, pp.60–4 in Kaye, H. (2010) (Ed) “Cognitive Psychology”, Milton Keynes, The Open University, p.126. Eysenck, M.W. and Keane, M.T. (1995) “Cognitive psychology: A student’s handbook”, East Sussex, Psychology Press. Hayward, W.G. (Oct 2003) ‘After the viewpoint debate: where next in object recognition?’, Trends in Cognitive Sciences, vol 7, no.10, pp. 425–7. Humphreys, G.W. and Bruce, V. (1989) Visual Cognition: Computational, Experimental and Neuropsychological Perspectives, Hove, Lawrence Erlbaum Associates Ltd, in Kaye, H. (2010) (Ed) “Cognitive Psychology”, Milton Keynes, The Open University, p.106. Humphreys, G.W. and Riddoch, M.J. (1984) ‘Routes to object constancy: implications from neurological impairments of object constancy’, Quarterly Journal of Experimental Psychology, vol.36A, pp.385–415, in Kaye, H. (2010) (Ed) “Cognitive Psychology”, Milton Keynes, The Open University, p.123. Kaye, H. (2010) (Ed) “Cognitive Psychology”, Milton Keynes, The Open University. Lawson, R. and Humphreys, G.W. (1996) ‘View-specificity in object processing: evidence from picture matching’, Journal of Experimental
Psychology: Human Perception and Performance, vol.22, pp.395–416, in Kaye, H. (2010) (Ed) “Cognitive Psychology”, Milton Keynes, The Open University. Marr, D. and Nishihara, H.K. (1978) ‘Representation and recognition of the spatial organization of three-dimensional shapes’, Proceedings of the Royal Society of London, Series B, vol.211, pp.151–80 in Kaye, H. (2010) (Ed) “Cognitive Psychology”, Milton Keynes, The Open University, pp. 116-23. Tarr, M. J. & Bülthoff, H. H. (1995) ‘Is human object recognition better described by geon structural descriptions or by multiple views? Comment on Biederman and Gerhardstein (1993)’, Journal of Experimental Psychology: Human Perception and Performance, vol.21, no.6, pp.1494–505. Warrington, E.K. and Taylor, A.M. (1978) ‘Two categorical stages of object recognition’, Perception, vol.7, pp.695–705, in Kaye, H. (2010) (Ed) “Cognitive Psychology”, Milton Keynes, The Open University, p.123.