About the Volumetric Format Association

About

How We Work

Capture

An array of cameras captures a real scene.

Process

Camera input is converted to 3D models.

Compress

3D models are prepared to stream over network.

Playback

Volumetric captures are played back on devices.

VFA’s Working Groups collaborate to develop specficiations that will allow the interoperable integration what today are the disparate technologies and products that required to create true 3 Dimensional Video.

Volumetric technology depends on several distinct technologies (we call them Verticals) to communicating with one another. Our members participate in several Volumetric working groups to ensure the interoperability of Capture, Process, Compress, and Playback .

Capture

In this step, a person or scene is surrounded by over a dozen volumetric cameras.

There are four ways to perform volumetric capture: time of flight, structured light, photogrammetry & multiview depth and stereo disparity. After the scene is captured, it is processed in the reconstruction step.

Time of flight

This generates a color and depth image by emitting infrared light and measuring the distance from the camera to the person being captured. Microsoft’s Azure Kinect camera uses time of flight.

Photogrammetry & Multiview Depth

This technique uses several color cameras to generate the depth image or a point cloud of the person being captured. The cameras generate both a depth and color image.

Time of flight

This uses two monochrome cameras that can detect infrared, a color camera, and a laser that projects infrared dots onto the scene. This technique generates a depth image and a color image. Intel’s release camera uses structure light.

Stereo Disparity

This technique uses two color cameras to simulate the left and right eye and to generate a depth image and a single color image from one of the color cameras.

Process

During Process, a 3D world is reconstructed from the captured scene. This step involves processing the 2D videos into 3D models. The four methods for processing are depth map, voxel, point cloud, and mesh generation. The 3D models are then sent to compression so they can be streamed over the network.

Depth Map Generation

A depth map is an image that represents the location of every pixel (or subpixel) in 3D space. This is similar to how a terrain map works but far more precise. This technique can be part of the volumetric cameras such as the Intel Realsense cameras or the Microsoft’s Azure Kinect cameras.

Voxel Generation

This technique usually takes a point cloud and turns each point into a 3D cube. This fills in any of the holes in a point cloud and will have more than a single color associated with the voxel, but a small color image wrapped around the voxel to represent the subject being captured.

Point Cloud Generation

This technique can use SLAM or a depth map to generate points in 3D space to represent the subject being captured. These points also include a color value that matches the color of the subject being captured.

Mesh Generation

This technique usually takes a point cloud and turns every 3 points into a triangle that represents the surface of the subject being captured. This results in hundreds to thousands of triangles that are connected to each other as a single mesh. The color of the mesh is usually stitched together from the color cameras and then applied as a UV map.

Compress

The 3D models are compressed so they can be streamed over a network. There are three different methods of compression: mesh, point cloud, and Depth & UV Map Compression.

Point Cloud Compression

This technique compresses the points over a sequence of frames so they can be streamed over a network. The device will either render the decompressed points as-is; generate a mesh and render the mesh; turn the points into voxels and render the voxels.

Mesh Compression

This technique compresses a mesh’s data over a sequence of frames so they can be streamed over a network. The device will need to uncompress the mesh but won’t have to generate the mesh from scratch.

Depth and UV Map Compression

This technique compresses the depth maps and UV maps over frames so they can be streamed over a network. The device will have to then generate points, or mesh or voxels to render the device.

Playback

The 4 modes of playback are: traditional 2D video, 3D rendering on a big screen tv, 3D rendering on an XR device, and 3D rendering on a smartphone.

Traditional 2D Video

This technique can combine the volumetrically captured model with SFX from a movie / tv show and just output the 2D video so it can be part of the movie or tv show. Another example is a live sports event in which a virtual camera can be used to generate a highlight that happened in the game and that highlights can be displayed on the stadium’s big screen tv’s.

3D rendering on an XR device

This technique allows the 3D model to be sent to an AR or VR device that allows the viewer to walk around the 3D model and / or blend the 3D model into the real world.

3D rendering on a big screen tv

This technique sends the 3D model to a game console, a set top box / streaming stick or a smart tv. The viewer can use a remote control to select the type of perspective they’d like using a menu of options. The viewer can also use a game console to control the virtual camera that is rendering the 3D model.

3D rendering on a Smartphone

This technology allows the user to use the smart phone’s touch screen to move around the 3D model and or project the 3D model into the real world using AR built into the smartphone.