How to maximize AR, VR performance with advanced stereo rendering

Staff Writer By: Rob Srinivasiah, Unity


How to maximize AR, VR performance with advanced stereo rendering | AR, VR, Unity, game development,

With Unity 2017.2, we released support for Stereo Instancing for XR devices running on DX11, meaning that developers will have access to even more performance optimizations for HTC Vive, Oculus Rift, and the brand new Windows Mixed Reality immersive headsets. We thought we would take this opportunity to tell you more about this exciting rendering advancement and how you can take advantage of it.

Brief history

One of the unique, and obvious, aspects of XR rendering is the necessity to generate two views, one per eye. We need these two views in order to generate the stereoscopic 3D effect for the viewer. But before we dive deeper into how we could render two viewpoints, let's take a look into the classic single viewpoint case.

In a traditional rendering environment, we render our scene from a single view. We take our objects and transform them into a space that's appropriate for rendering. We do this by applying a series of transformations to our objects, where we take them from a locally defined space, into a space that we can draw on our screen.

The classic transformation pipeline starts with objects in their own, local/object space. We then transform the objects with our model or world matrix, in order to bring the objects to world space. World space is a common space for the initial, relative placement of objects. Next, we transform our objects from world space to view space, with our view matrix. Now our objects are arranged relative to our viewpoint. Once in view space, we can project them onto our 2D screen with our projection matrix, putting the objects into clip space. The perspective divide follows, result in NDC (normalized device coordinate) space, and finally, the viewport transform is applied, resulting in screen space. Once we are in screen space, we can generate fragments for our render target. For the purposes of our discussion, we will just be rendering to a single render target.

This series of transformations is sometimes referred to as the "graphics transformation pipeline", and is a classic technique in rendering.

Additional reference resources:

Besides current XR rendering, there were scenarios where we wanted to present simultaneous viewpoints. Maybe we had split-screen rendering for local multiplayer. We might have had a separate mini-viewpoint that we would use for an in-game map or security camera feed. These alternative views might share scene data with each other, but they often share little else besides the final destination render target.

At a minimum, each view often owns distinctly unique views and projection matrices. In order to composite the final render target, we also need to manipulate other properties of the graphics transformation pipeline. In the 'early' days when we had only one render target, we could use viewports to dictate sub-rects on the screen to render into. As GPUs and their corresponding APIs evolved, we were able to render into separate render targets and manually composite them later.

Enter the XRagon

Modern XR devices introduced the requirement of driving two views in order to provide the stereoscopic 3D effect that creates depth for the device wearer. Each view represents an eye. While the two eyes are viewing the same scene from a similar vantage point, each view does possess a unique set of view and projection matrices.

Before proceeding, a quick aside into defining some terminology. These aren't necessarily industry standard terms, as rendering engineers tend to have varied terms and definitions across different engines and use cases. Treat these terms as a local convenience.

Scene graph - A scene graph is a term used to describe a data structure that organizes the information needed in order to render our scene and is consumed by the renderer. The scene graph can refer to either the scene in its entirety, or the portion visible to the view, which we will call the culled scene graph.

Render loop/pipeline - The render loop refers to the logical architecture of how we compose the rendered frame. A high level example of a render loop could be this:

Culling -> Shadows -> Opaque -> Transparent -> Post Processing -> Present

We go through these stages every frame in order to generate an image to present to the display. We also use the term render pipeline at Unity as well, as it relates to some upcoming rendering features we are exposing (e.g. Scriptable Render Pipeline). Render pipeline can be confused with other terms such as the graphics pipeline, which refers to the GPU pipeline to process draw commands.

OK, with those definitions, we can get back to VR rendering.

Multi-Camera

In order to render the view for each eye, the simplest method is to run the render loop twice. Each eye will configure and run through its own iteration of the render loop. At the end, we will have two images that we can submit to the display device. The underlying implementation uses two Unity cameras, one for each eye, and they run through the process of generating the stereo images. This was the initial method of XR support in Unity, and is still provided by 3rd party headset plugins.

While this method certainly works, Multi-Camera relies on brute force, and is the least efficient as far as the CPU and GPU are concerned. The CPU has to iterate twice through the render loop completely, and the GPU is likely not able to take advantage of any caching of objects drawn twice across the eyes.

Multi-Pass

Multi-Pass was Unity's initial attempt to optimize the XR render loop. The core idea was to extract portions of the render loop that were view-independent. This means that any work that is not explicitly reliant on the XR eye viewpoints doesn't need to be done per eye.

The most obvious candidate for this optimization would be shadow rendering. Shadows are not explicitly reliant on the camera viewer location. Unity actually implements shadows in two steps: generate cascaded shadow maps and then map the shadows into screen space. For multi-pass, we can generate one set of cascaded shadow maps, and then generate two screen space shadow maps, as the screen space shadow maps are dependent on the viewer location. Because of how our shadow generation is architected, the screen space shadow maps benefit from locality as the shadow map generation loop is relatively tightly coupled. This can be compared to the remaining render workload, which requires a full iteration over the render loop before returning to a similar stage (e.g. the eye specific opaque passes are separated by the remaining render loop stages).

The other step that can be shared between the two eyes might not be obvious at first: we can perform a single cull between the two eyes. With our initial implementation, we used frustum culling to generate two lists of objects, one per eye. However, we could create a unified culling frustum shared between our two eyes (see this post by Cass Everitt). This will mean that each eye will render a little bit extra than they would with a single eye culling frustum, but we considered the benefits of a single cull to outweigh the cost of some extra vertex shaders, clipping, and rasterization.

Multi-Pass offered us some nice savings over Multi-Camera, but there was still more to do. Which brought us to...

Single-Pass

Single-Pass Stereo Rendering means that we will make a single traversal of the entire renderloop, instead of twice, or certain portions twice.

In order to perform both draws, we need to make sure that we have all the constant data bound, along with an index.

What about the draws themselves? How can we perform each draw? In Multi-Pass, the two eyes each have their own render target, but we can't do that for Single-Pass because the cost of toggling render targets for consecutive draw calls would be prohibitive. A similar option would be to use render target arrays, but we would need to export the slice index out of the geometry shader on most platforms, which can also be expensive on the GPU, and invasive for existing shaders.

The solution we settled upon was to use a Double-Wide render target, and switch the viewport between draw calls, allowing each eye to render into half of the Double-Wide render target. While switching viewports does incur a cost, it's less than switching render targets, and less invasive than using the geometry shader (though Double-Wide presents its own set of challenges, particularly with post-processing). There is also the related option of using viewport arrays, but they have the same issue as render target arrays, in that the index can only be exported from a geometry shader. There is yet another technique that uses dynamic clipping, which we won't explore here.

Now that we have a solution to kick off two consecutive draws in order to render both eyes, we need to configure our supporting infrastructure. In Multi-Pass, because it was similar to monoscopic rendering, we could use our existing view and projection matrix infrastructure. We simply had to replace the view and projection matrix with the matrices sourced from the current eye. However, with single-pass, we don't want to toggle constant buffer bindings unnecessarily. So instead, we bind both eyes' view and projection matrices and index into them with unity_StereoEyeIndex, which we can flip between the draws. This allows our shader infrastructure to choose which set of view and projection matrices to render with, inside the shader pass.

One extra detail: In order to minimize our viewport and unity_StereoEyeIndex state changes, we can modify our eye draw pattern. Instead of drawing left, right, left, right, and so on, we can instead use the left, right, right, left, left, etc. cadence. This allows us to halve the number of state updates compared to the alternating cadence.

This isn't exactly twice as fast as Multi-Pass. This is because we were already optimized for culling and shadows, along with the fact that we are still dispatching a draw per eye and switching viewports, which does incur some CPU and GPU cost.

There is more information in the Unity Manual page for Single-Pass Stereo Rendering.

Stereo Instancing (Single-Pass Instanced)

Previously, we mentioned the possibility of using a render target array. Render target arrays are a natural solution for stereo rendering. The eye textures share format and size, qualifying them to be used in a render target array. But using the geometry shader in order to export the array slice is a large drawback. What we really want is the ability to export the render target array index from the vertex shader, allowing for simpler integration and better performance.

The ability to export render target array index from the vertex shader does actually exist on some GPUs and APIs, and is becoming more prevalent. On DX11, this functionality is exposed as a feature option, VPAndRTArrayIndexFromAnyShaderFeedingRasterizer.

Now that we can dictate which slice of our render target array we will render to, how can we select the slice? We leverage the existing infrastructure from Single-Pass Double-Wide. We can use unity_StereoEyeIndex to populate the SV_RenderTargetArrayIndex semantic in the shader. On the API side, we no longer need to toggle the viewport, as the same viewport can be used for both slices of the render target array. And we already have our matrices configured to be indexed from the vertex shader.

Though we could continue to use the existing technique of issuing two draws and toggling the value unity_StereoEyeIndex in the constant buffer before each draw, there is a more efficient technique. We can use GPU Instancing in order to issue a single draw call and allow the GPU to multiplex our draws across both eyes. We can double the existing instance count of a draw (if there are no instance usage, we just set the instance count to 2). Then in the vertex shader, we can decode the instance ID in order to determine which eye we are rendering to.

The biggest impact of using this technique is we literally halve the number of draw calls we generate on the API side, saving a chunk of CPU time. Additionally, the GPU itself is able to more efficiently process the draws, even though the same amount of work is being generated, since it doesn't have to process two individual draw calls. We also minimize state updates by not having to change the viewport between draws, like we do in traditional Single-Pass.

Please note: This will only be available to users running their desktop VR experiences on Windows 10 or HoloLens.

Single-Pass Multi-View

Multi-View is an extension available on certain OpenGL/OpenGL ES implementations where the driver itself handles the multiplexing of individual draw calls across both eyes. Instead of explicitly instancing the draw call and decoding the instance into an eye index in the shader, the driver is responsible for duplicating the draws and generating the array index (via gl_ViewID) in the shader.

There is one underlying implementation detail that differs from stereo instancing: instead of the vertex shader explicitly selecting the render target array slice which will be rasterized to, the driver itself determines the render target. gl_ViewID is used to compute view dependent state, but not to select the render target. In usage, it doesn't matter much to the developer, but is an interesting detail.

Because of how we use the Multi-View extension, we are able to use the same infrastructure that we built for Single-Pass Instancing. Developers are able to use the same scaffolding to support both Single-Pass techniques.

High level performance overview

At Unite Austin 2017, the XR Graphics team presented on some of the XR Graphics infrastructure, and had a quick discussion on the performance impact of the various stereo rendering modes (you can watch the talk here). A proper performance analysis could belong in its own blog, but we can quickly go over this chart.

As you can see, Single-Pass and Single-Pass Instancing represent a significant CPU advantage over Multi-Pass. However, the delta between Single-Pass and Single-Pass Instancing is relatively small. The reasoning is that the bulk of the CPU overhead is already saved by switching to Single-Pass. Single-Pass Instancing does reduce the number of draw calls, but that cost is quite low compared to processing the scene graph. And when you consider most modern graphics drivers are multi-threaded, issuing draw calls can be quite fast on the dispatching CPU thread.