The identity, emotion and gaze behind Apple’s Vision Pro
When Apple introduced their Vision Pro it represented another iteration of the immersive head-mounted display dating back to Ivan Sutherland’s experiments in the 1960s, but with a difference. Apple’s Vision Pro is not only a high-resolution immersive display for the person wearing it, but also features an external display that gives the impression for those around that they can see the eyes of the wearer. It was an expensive but motivated choice to add a curved glass front to the device with a complex lenticular lensed display that presents images of the user’s eyes which are different depending on the angle from which they are seen. Even though others may think these are the user’s actual eyes, the image is an animated 3D model driven by sensors inside the helmet and generated from data collected when the user set up the device.
So, why so much effort to reveal an animated image of the users’ eyes on the external display? There are some basic communicative principles at play in the ways eyes see and are seen in everyday space. The external display is designed to have social efficacy because human eyes are highly communicative, in three ways.
First, the unique appearance of the eye can communicate identity. People are able to distinguish friends from strangers, thus establishing for both participants a shared history, experience and established relationship. Recognising others is, for most people, a precondition for sociality (Bruce 2017).
Second, facial expressions are a key way of communicating feelings, which Ekman (1971) claims is manifest as a set of universal emotions: anger, contempt, disgust, happiness, fear, sadness and surprise. Others disagree that emotional expressions are universal, arguing that there are cultural differences in how faces and eyes are able to communicate (Keltner et al, 2019). Either way, if the Vision Pro can represent the emotional expressiveness of eyes, so the logic goes, the user can participate in social interactions more fully.
Third, the eyes communicate through gaze, which indicates attention, interest and interpersonal demands. Gaze structures social power, in particular gendered power relations. The gaze is central to other media – in particular, the cinema, which can be understood as a medium structured according to the logic of gazes: between the viewer and the eyes of characters in the film, and among the characters within the film (Mulvey 2001). Many social robots are designed to give the impression that they are meeting the user’s gaze (Chesher & Andreallo, 2022). If ‘spatial computing’ becomes a widespread media paradigm, it will operate to mediate the power of the gaze (Beer 2018).
“For those nearby, the VR user becomes a subject of some pity, lost in a world that they cannot share with others…”
The Vision Pro is a critique in design of virtual reality goggles that immerse users into another space altogether but eliminate social presence. There is something inhuman about a person wearing a traditional VR head-mounted display like Meta’s Quest. For most of the time, the user’s vision is completely occluded, and when the device is running, users are segregated from the physical and social space around them. While users are aware there may be others nearby, and can even conduct a conversation, they effectively become blind and masked. It could be a form of torture. For those nearby, the VR user becomes a subject of some pity, lost in a world that they cannot share with others (unless their view is externalised somehow – such as by being projected onto a monitor – which changes the scene again).
Apple is distancing itself from this segregated model of virtual reality, even if their technology is quite capable of working in this way. It’s an effort to head off Meta’s pied piper quest to draw everyone into an animated Metaverse, phenomenologically separated from the everyday world. This was based on a very 1990s vision of VR as opening new frontiers of virtual territory (Chesher, 1994).
By contrast, Apple’s promotional videos are quite conservative in representing users as having a familiar experience of working with interface elements like screens, windows, desktops and floating keyboards situated in the user’s immediate space. VisionOS incorporates these elements into an everyday lifeworld of the user who can use eye movements and hand gestures to manipulate the virtual elements embedded in the high-resolution view of the world. This is an extension of working with Macs, iPhones and AppleTVs.
The idea of combining computer-generated elements with a view of the surroundings in fact preceded the closed-off version of ‘virtual reality’. Sutherland’s 1968 ‘Sword of Damocles’, slung from the ceiling, used a half-silvered mirror that allowed users to see the world around them – and for those present to see the users’ eyes (McLellan, 1996). The computer would mechanically track the movement of the suspended headset and project rudimentary graphics calculated to give the impression of hallucinated objects in the room.
There have been several such systems since. At Boeing in the early 1990s, Thomas Caudell’s ‘augmented reality’ experiments projected images of computer graphic schematics of aircraft components for technicians working on the 777 (Azuma, 1997). Again, this allowed others still to communicate with them. Google’s ill-fated and overhyped 2013 Glass experiment promised to offer a ubiquitous computing display that guided explorers through the physical world but was controversial and underpowered and was discontinued in 2015 (Klein et al, 2020).
Similarly, the heavier and more technically advanced augmented reality smart glasses released by Microsoft — the 2016 Hololens and 2019 Hololens 2 – allow the user to see the world around them directly, but use near-eye devices to project images onto the user’s retina. These ‘optical waveguides’ can project colourful ‘holograms’ that appear to be in the space of the room. Depth sensors allow it to map the 3D space of the room, and cameras track hand gestures. However, these devices have a very narrow field of view, the images are insubstantial, and they have failed to get widespread take-up.
Unlike other augmented reality devices, Apple’s strategy is essentially to build a virtual reality headset and enhance it with augmented reality features. Its desegregation strategy is based on a doubled form of illusion that attempts to recover the sense of the copresence of the headset user with other people and things surrounding them. The user’s illusion that they remain immersed in their everyday space comes from high-resolution cameras and greater-than 4k displays close to the eye that mimic natural vision. The operating system is configured to sense the presence of other people and to reveal them to the user. A digital crown allows the user to cross-fade between the virtual scene and the view of the surrounding space. The user’s eye movements and hand gestures become user inputs that bring the body into interaction with their space in new ways.
The prospects for developing a communicative medium for ‘spatial computing’, ‘metaverse’ or ‘holograms’ remain uncertain. Despite their differences, all current head-mounted displays are physically cumbersome, expensive, and (most importantly) lack compelling use cases for a mass market, despite extravagant R&D budgets from the ‘Magnificent Seven’.
The iterative processes of establishing new media paradigms are often slow, and rarely successful. The telephone was patented in 1876, and was used not widely in use before the turn of the century. Radio broadcasting was slower still: the principles of radio (which also uses only one sensory modality) were known in the 1880s, but broadcasting did not start stabilising as a popular cultural form until the 1920s (Hilmes & Loviglio, 2002). Spatial computing is among the most complex consumer technologies ever built, with the goal of making the most natural and versatile user interface.
Apple chief Tim Cook has been vociferous in criticising VR as socially isolating, but it seems unlikely that his augmented reality headset, with its unique doubled digitally mediated gaze, will ever be the basis for a technology as ubiquitous as the telephone.
References
Azuma, R. T. (1997). A survey of augmented reality. Presence: teleoperators & virtual environments, 6(4), 355-385.
Beer, D. (2018). The data gaze: Capitalism, power and perception. Sage.
Bruce, V. (2017). Recognising faces (Vol. 3). Routledge.
Ekman, P. (1971). Universals and cultural differences in facial expressions of emotion. In Nebraska symposium on motivation. University of Nebraska Press.
Chesher, C., & Andreallo, F. (2022). Eye machines: Robot eye, vision and gaze. International journal of social robotics, 14(10), 2071-2081.
Chesher, C. (1994). Colonizing virtual reality: Construction of the discourse of virtual reality. Cultronix, 1(1), 1-27.
Hilmes, M., & Loviglio, J. (Eds.). (2002). Radio reader: Essays in the cultural history of radio. Psychology Press.
Keltner, D., Tracy, J. L., Sauter, D., & Cowen, A. (2019). What basic emotion theory really says for the twenty-first century study of emotion. Journal of nonverbal behavior, 43, 195-201.
Klein, A., Sørensen, C., de Freitas, A. S., Pedron, C. D., & Elaluf-Calderwood, S. (2020). Understanding controversies in digital platform innovation processes: The Google Glass case. Technological Forecasting and Social Change, 152, 119883.
McLellan, H. (1996). Virtual realities. Handbook of research for educational communications and technology, 457-487.
Mulvey, L. (2001). Unmasking the gaze: some thoughts on new feminist film theory and history. Lectora: revista de dones i textualitat, (7), 0005-14.
Recommended citation
Chesher, C. (July, 2023) The identity, emotion, and gaze behind Apple’s Vision Pro. Critical Augmented and Virtual Reality Researchers Network (CAVRN). link
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.