3D: the world seen through multiple cameras
Capturing images of the real world using multiple cameras and reconstructing those images in 3D is starting to look like a very promising line of research. Researchers with UPC-Barcelona Tech’s Image and Video Processing Group (GPI) explain where things stand with this technology.
The world around us is a space that has three dimensions (length, width and depth). But when we capture an image of the real world using a camera, information about depth, the third dimension, is lost. A photograph is just a projection of the three-dimensional world on a two-dimensional surface (the plane of the image). Spatial information that exists in the original scene is lost, and photos have a series of limitations that researchers working on applications to analyze and represent images are well aware of.
“The limitations of the image-formation process are clear. Projection yields an apparent image because it’s not possible to get information about the position the objects occupy within the scene. It also has an occlusion effect: we can only see the objects in front, the ones that aren’t hidden by other objects,” explains Josep R. Casas, a researcher with the Image and Video Processing Group (GPI).
The problem of not being able to analyze or represent what cannot be seen can be solved by using multiple sensors (cameras). “If we use many cameras simultaneously, we can capture a scene from all viewpoints and then produce a reconstruction in three dimensions. This is called free-viewpoint video because it provides access to any viewpoint, enabling us to move around objects and choose any perspective to observe the scene,” says Casas.
With 3D technology the viewpoint can be chosen at will
Image and video processing technologies are constantly evolving. Fifteen years ago working with different cameras simultaneously was not seen as a viable option by the scientific community. Computer tools for capturing video were limited and only available on high-end workstations or even specialized supercomputers. Now more powerful computers have opened up new possibilities and applications that were previously unthinkable have been developed.
Experts say the main challenge now is to capture and work with images from multiple cameras. Technological research in this area is still at an early stage. Javier Ruiz Hidalgo, a GPI researcher, explains that research on generating multi-camera content needs to focus on solving problems related to the management of multiple cameras, explore ways to efficiently present three-dimensional information (how video streams are analyzed so they can be combined, how they can be referenced to each other, and how to produce the three-dimensional reconstruction), and look at ways of exploiting or using the model in three dimensions.
The GPI, a group attached to the Department of Signal Theory and Communications, is conducting research in these three areas.
Though it is still too early to talk about particular products, the researchers believe that in the future this technology could be applied in the analysis and representation of images.
A three-dimensional reconstruction can be used to analyze the scene or object represented in real dimensions and even to view the “back” side. This paves the way for richer analysis and opens up a broad range of possibilities. The technology could be exploited in areas such as video surveillance: the use of multiple cameras to control access to a building would make it more difficult for someone to hide behind an object to avoid being spotted by a camera.
It will also facilitate the development of visual interfaces between users and computer systems. The connection between a user and a system requires a physical interface for sending orders that the system responds to. Apart from the keyboard, screen and mouse, the use of other interfaces such as the voice has also become widespread. Gestures that can be recognized by a system could be another way to interact with the computing environment. The availability of three-dimensional information makes it easier to recognize user gestures, as Microsoft’s Kinect gaming system demonstrates. Kinect, a device that works with the Xbox console, analyzes the position of the user’s body to facilitate game play.
As for representation, the use of multiple cameras makes it possible to represent a scene or object in a different way. When three-dimensional information about an object is available, it can be rendered (reconstructed) and viewed from any perspective. This way of making images could have a revolutionary impact on the creation of film and TV content.
In television and film, the main advantage of this technology is that the camera can be positioned anywhere. If systems of this type come into use, experts believe they will change the paradigm for creating natural content. At present, when a scene is being shot it is clear that the actors and the set need to be on one side, and the cameras, technicians, lights and director on the other. But if the goal is to capture a scene from every viewpoint, the question of where to put each element will need to be rethought.
It is already common practice to use multiple cameras when shooting a scene, but the director chooses just one for any given shot. In the future, creators will also determine the possible viewpoint locations of a virtual camera (a device used to project a scene).Studying a person’s gestures and how they move, and tracking individuals to recognize distinctive aspects of the way they move is the focus of the GPI’s contribution to the European project Unobtrusive Authentication Using Activity Related and Soft Biometrics (ACTIBIO), aimed at identifying individuals based on dynamic biometric characteristics. “Authentication systems of this type will recognize users by the way they move and their gestures. These systems will be more robust than the ones currently in use, more difficult to trick, because the analysis of users will be continuous,” says Josep Ramon Casas.
The GPI is also working on the design of the immersive and interactive television of the future, a highly advanced technology that is the focus of another European project dubbed the Format-Agnostic Script-Based Interactive Experience (FASCINATE). In this case, the GPI’s role is to develop the interface between the user and the device used to view broadcasts, which could be a mobile handset, computer or television. “The goal is for users to be able to control the device without using a mouse, keyboard or remote control, for example by using gestures,” says Javier Ruiz Hidalgo.
The projects—funded by the European Commission under the Seventh Framework Program for Research and Technological Development—focus on analysis but are likely to lead to many varied applications.
When we go to see a movie like Avatar at the theater and put on our special glasses, what we are seeing is not a three-dimensional image, but rather a stereoscopic image that simulates natural vision.
Human vision is binocular (stereoscopic): we have two sensors, our two eyes, and because they are horizontally separated we receive two images of the same scene with slightly different viewpoints. The brain superimposes and interprets the images to create a sensation of depth or three-dimensional vision.
Recently, the film industry has once again begun to make movies using a process that imitates human vision. Two parallel cameras are used to simultaneously capture scenes. When the movie is shown, the image recorded with the left camera is viewed only by the left eye, while the one recorded with the right camera is captured only by the right eye.
Despite the hype, the stereo system is not a new technology for capturing and viewing images and constructing a film narrative. In contrast, the reconstruction of images in three dimensions does bring something new because it allows the viewpoint to be freely selected after images have been recorded.
Segueix-nos a Twitter