2018年11月11日 星期日

Single Pass Dynamic Cube Mapping [In Unity]

Traditional cube mapping needs to update all six faces in a frame.
Which means we need to render six times in a frame.
That would increase CPU overhead a lot.


Fig 1. A cube map contains six images from every direction.


Optimization with Geometry Shader

Fortunately, we can use only one pass but output our triangles to six face at once.
This method may need hardware support, but generally works on modern GPU.



                 
   struct appdata
   {
    float4 vertex : POSITION;
    float2 uv : TEXCOORD0;
   };

   struct v2g
   {
    float2 uv : TEXCOORD0;
    float4 vertex : POSITION;
   };

   struct g2f
   {
    float2 uv : TEXCOORD0;
    float4 vertex : SV_POSITION;
    uint targetIdx : SV_RenderTargetArrayIndex;
   };

   sampler2D _MainTex;
   float4 _MainTex_ST;
   float4x4 _CubeMapView[6];
   float4x4 _CubeMapProj[6];

   v2g vert (appdata v)
   {
    v2g o;                                // world transform here
    o.vertex = mul(unity_ObjectToWorld, v.vertex);
    o.uv = TRANSFORM_TEX(v.uv, _MainTex);

    return o;
   }

   [maxvertexcount(18)]
   void geo(triangle v2g input[3], inout TriangleStream TriStream)
   {
    [unroll]
    for (uint i = 0; i < 6; i++)
    {
     g2f o;
     o.targetIdx = i;
     [unroll]
     for (uint j = 0; j < 3; j++)
     {
                                                // view/proj transform here
      o.uv = input[j].uv;
      o.vertex = mul(_CubeMapView[i], input[j].vertex);
      o.vertex = mul(_CubeMapProj[i], o.vertex);

      TriStream.Append(o);
     }
     TriStream.RestartStrip();
    }
   }
   
   float4 frag (g2f i) : SV_Target
   {
    // sample the texture
    float4 col = tex2D(_MainTex, i.uv);
    return col;
   }
         

Code is simple here.
It's basically a unlit shader which only renders texture color to render target.
The key is to separate world transform and view/proj transform.
The step is described in follow:

  1. Transform from local to world space in vertex shader.
  2. Process input triangles six times, and output 18 triangles to cube face by using six different camera matrix.
  3. Use SV_RenderTargetArrayIndex to decide output render target.
It's important that SV_RenderTargetArrayIndex only supports for array type resource.
So my cube map is rendered on a texture2darray render texture.


Works on CPU side

OK, the GPU side is simple.
But what do we need to do on CPU side?
Of course, we need to bind our texture for rendering.


                 

        cubeMap = new RenderTexture(cubeSize, cubeSize, 16, RenderTextureFormat.ARGB32, RenderTextureReadWrite.Linear);
        cubeMap.name = "Cube map";
        cubeMap.anisoLevel = 16;
        cubeMap.antiAliasing = msaa;
        cubeMap.dimension = TextureDimension.Cube;
        cubeMap.Create();

        cubeMapArray = new RenderTexture(cubeSize, cubeSize, 16, RenderTextureFormat.ARGB32, RenderTextureReadWrite.Linear);
        cubeMapArray.name = "Cube map array";
        cubeMapArray.anisoLevel = 16;
        cubeMapArray.antiAliasing = msaa;
        cubeMapArray.volumeDepth = 6;
        cubeMapArray.dimension = TextureDimension.Tex2DArray;
        cubeMapArray.Create();


Prepare a texture2darray which has six slices and a cubemap resource.
The code to render scenes into cube face is bellow:


                 

        // clear target
        mainCamera.RemoveCommandBuffer(CameraEvent.BeforeForwardOpaque, cubeMapRender);
        cubeMapRender.Clear();
        cubeMapRender.SetRenderTarget(cubeMapArray, 0, CubemapFace.Unknown, -1);
        cubeMapRender.ClearRenderTarget(true, true, Color.black);

        if (cubeMapMaterial)
        {
            SetCubeFaceMatrix();

            Vector3 camPos = mainCameraTrans.transform.position;
            RenderSceneObjectsHere();

            // copy slice to rendertexture
            for (int i = 0; i < 6; i++)
            {
                cubeMapRender.CopyTexture(cubeMapArray, i, cubeMap, i);
            }

            cubeMapRender.SetGlobalTexture("_SinglePassCube", cubeMap);
            mainCamera.AddCommandBuffer(CameraEvent.BeforeForwardOpaque, cubeMapRender);
        }

We use command buffer for rendering instead of DrawMeshNow().
Because DrawMeshNow() will interpret command buffer and submit rendering work immediately.
It will cause a performance hit.
This command buffer execute before render scene objects, like shadow map.

Use SetRenderTarget() with texture2darray resource, and it's important to set depth slice value to -1.
Set depth slice value to -1 will make Unity bind all slices of texture2darray to GPU.
That's actually we need.

Last step, use CopyTexture() from texture2darray source to cubemap source.
Why we need such steps?
Can't we just bind cubemap for rendering?

The answer is: Yeah, we just can't :/
The support of cubemap in Unity is really poor.
We can't bind all six faces of cubemap at once.
And also, we can't use texture2darray as a cubemap ShaderResourceView.

In native D3D11, we normally use texture2darray to render cubeface, and create a cubemap ShaderResourceView to use cubeface in shader directly.

But Unity just can't do it without writing native plugin.
So CopyTexture() is need here, I want to keep it as simple as it can.

Fortunately, CopyTexture() is basically an equivalent of CopyResource() in native D3D11.
It's optimized well and shouldn't hit performance a lot.


Result


Yeah!
Here, we have a cubemap with all six faces rendered.
But once uses one pass!!

Summary

Does this method really improve performance?
Frankly this depends on the bottleneck you encountered.
Obviously this method is GPU-bound.
The overhead is transfer from CPU side to GPU side.

Worst case: 
Scene objects are lie in six faces without overlapping frustum.
What does this mean? 
For example we have six faces, each renders 20 objects.
120 objects in total. Using single pass dynamic cube mapping still draws 120 objects.
But pipeline state change is reduced a lot. 

Limited culling: 
We can almost use layer or distance culling. 
Since we are rendering objects at once, we can't just do frustum culling or occlusion culling as normal.
And of course, frustum culling or occlusion culling for all six faces takes time.


Overall, it's still a good method to try!

2018年5月27日 星期日

Async Shadow Mapping With DirectX 12


This is about rendering shadows with DirectX 12 plugin on Unity 2017.4.3f1.
As the title, the shadows are rendered completely on another thread and aren't batched.
It renders 10000 GameObjects on the screen.
And also uses GPU Instancing (not only color, but textures.) for saving performance.
(Instancing has been in place for years.)

After enabling indirect drawing, shadow rendering got a 0.5ms boost.
This is the power of DX12!

Why multithread  rendering is hard and limited in D3D11?

In D3D11, we need to submit works through contexts.


  • Immediate context - used for executing works, a D3D11 device only has one immediate context.
  • Deferred context - used for recording work commands, won't be executed immediately.
Both contexts are not thread-safe.
We can use deferred context for recording works (draw, copy, map, etc.) on other threads.
But eventually, we need to synchronize our threads and use immediate context for executing commands recorded by deferred context.

Best case, we can divide our works into trunks and synchronize at some points to submit works. But we can't truly submit works asynchronously.

New rendering models in D3D12.

In D3D12, we no longer use contexts.
Instead, Microsoft presents new models for work submission:
  • CommandAllocator - An allocator for command list. (not thread-safe)
  • CommandList - An interface for recording works. (not thread-safe)
  • CommandQueue - An interface for executing recorded works by command lists. (thread-safe)
CommandList is similar to deferred context, we can create different command list on threads and recording render works.

But this time, we can execute our commands through CommandQueue.
And CommandQueue can be used on any threads.
This makes async rendering possible. The only thing we need to care is how to submit our works correctly.

D3D12 also provides Bundles, and we can use bundles for recording works only once.
Then playback our bundles with command list.


Resource Binding in D3D12.

In D3D11, resource binding is almost done automatically. 
In D3D12, the resource binding is separated from management tasks.
This time, we need to sure the resources are going to be used by GPU are resident.
And we can't use the resources on CPU side that are used by GPU.

For accessing resources, Descriptors are proposed.
It's similar to a pointer, and point to both CPU/GPU address.
  • CPU Handles - Immediate use, such as copying resource.
  • GPU Handles - Not for immediate, used at GPU execution time.
With descriptors, we can access our resources efficiently.
We can even build a descriptor table for dynamic indexing our shader resources.

For example, we may want to use textures like this:

                 
Texture2D DiffuseMap[128] : register(t0);
SamplerState sampler_DiffuseMap;
uint _TexIndex;

float4 PS(v2f i)

{

     return DiffuseMap[_TexIndex].Sample(sampler_DiffuseMap, i.uv);

}
         


It is possible with D3D12.
In D3D11, dynamic indexing to shader resources is limited. Index can only be literal, and shader turns out some waterfall code for indexing shader resource view.

Dynamic indexing provides unprecedented flexibility and unlocks new rendering techniques.

CPU/GPU Synchronization

In D3D12, we need to synchronize CPU/GPU by our own.
D3D12 provides Fence for this.
We can synchronize CPU/GPU efficiently used with a ring buffer implementation.

In D3D11, synchronization is almost automatically. But it usually waits for all jobs done by GPU. Apparently it's not so efficient.

Powerful Indirect Drawing

It's basically an enhanced version of DrawInstancedIndirect/DrawIndexedInstancedIndirect.
With this technique, a million draw calls is possible.
By integrating draw calls into indirect argument buffer, the CPU overhead is reduced significantly.

But it seems only 32-bit index buffer can work properly now.
I posted a question on MSDN forum but still got no answer.

D3D12 in Unity

Before Unity 2017.3.0, we can only use DX12 in editor through command: -force-d3d12.
But it isn't a safe way. At the beginning, I create my demo with Unity 5.5. If I modify one code of CG shader and makes editor compile it, the Unity will crash.

After 2017.3.0, we can use DX12 with Unity editor. But it is still experimental.
Unity hasn't fully implement D3D12.
For example, the target level of CG shader still caps at 5.0. But native D3D12 supports Shader Model 5.1.

If we want to utilize D3D12 now, we must write native plugin interface.
With this interface, we can utilize native D3D12 rendering.

My demo needs to execute above Unity 2017.3.0. (It also provides 32-bit index buffer. I need this for indirect drawing.)

Summary

D3D12 provides more controls to the rendering pipeline. But also increase the complexity of development. It's recommended to use D3D12 if you are skilled with D3D11.

D3D12 is aim for reducing CPU overhead. If your project isn't CPU-bound, put into D3D12 won't have a significant change.

Last, there are some games implemented with D3D12.
Not all developers done well with D3D12. (Some games even perform worse than D3D11).
And some of these game done well and are impressive. (For example, Gear of Wars 4)

Since D3D12 is a huge change, developers need to cost some times for optimizing it. (Also the Vulkan).
I'm looking forward to the development of new APIs, one day we can enjoy more impressive games!

Project is available on my github:
https://github.com/SquallLiu99/Async-Shadow-Mapping