2018年11月11日 星期日

Single Pass Dynamic Cube Mapping [In Unity]

Traditional cube mapping needs to update all six faces in a frame.
Which means we need to render six times in a frame.
That would increase CPU overhead a lot.


Fig 1. A cube map contains six images from every direction.


Optimization with Geometry Shader

Fortunately, we can use only one pass but output our triangles to six face at once.
This method may need hardware support, but generally works on modern GPU.



                 
   struct appdata
   {
    float4 vertex : POSITION;
    float2 uv : TEXCOORD0;
   };

   struct v2g
   {
    float2 uv : TEXCOORD0;
    float4 vertex : POSITION;
   };

   struct g2f
   {
    float2 uv : TEXCOORD0;
    float4 vertex : SV_POSITION;
    uint targetIdx : SV_RenderTargetArrayIndex;
   };

   sampler2D _MainTex;
   float4 _MainTex_ST;
   float4x4 _CubeMapView[6];
   float4x4 _CubeMapProj[6];

   v2g vert (appdata v)
   {
    v2g o;                                // world transform here
    o.vertex = mul(unity_ObjectToWorld, v.vertex);
    o.uv = TRANSFORM_TEX(v.uv, _MainTex);

    return o;
   }

   [maxvertexcount(18)]
   void geo(triangle v2g input[3], inout TriangleStream TriStream)
   {
    [unroll]
    for (uint i = 0; i < 6; i++)
    {
     g2f o;
     o.targetIdx = i;
     [unroll]
     for (uint j = 0; j < 3; j++)
     {
                                                // view/proj transform here
      o.uv = input[j].uv;
      o.vertex = mul(_CubeMapView[i], input[j].vertex);
      o.vertex = mul(_CubeMapProj[i], o.vertex);

      TriStream.Append(o);
     }
     TriStream.RestartStrip();
    }
   }
   
   float4 frag (g2f i) : SV_Target
   {
    // sample the texture
    float4 col = tex2D(_MainTex, i.uv);
    return col;
   }
         

Code is simple here.
It's basically a unlit shader which only renders texture color to render target.
The key is to separate world transform and view/proj transform.
The step is described in follow:

  1. Transform from local to world space in vertex shader.
  2. Process input triangles six times, and output 18 triangles to cube face by using six different camera matrix.
  3. Use SV_RenderTargetArrayIndex to decide output render target.
It's important that SV_RenderTargetArrayIndex only supports for array type resource.
So my cube map is rendered on a texture2darray render texture.


Works on CPU side

OK, the GPU side is simple.
But what do we need to do on CPU side?
Of course, we need to bind our texture for rendering.


                 

        cubeMap = new RenderTexture(cubeSize, cubeSize, 16, RenderTextureFormat.ARGB32, RenderTextureReadWrite.Linear);
        cubeMap.name = "Cube map";
        cubeMap.anisoLevel = 16;
        cubeMap.antiAliasing = msaa;
        cubeMap.dimension = TextureDimension.Cube;
        cubeMap.Create();

        cubeMapArray = new RenderTexture(cubeSize, cubeSize, 16, RenderTextureFormat.ARGB32, RenderTextureReadWrite.Linear);
        cubeMapArray.name = "Cube map array";
        cubeMapArray.anisoLevel = 16;
        cubeMapArray.antiAliasing = msaa;
        cubeMapArray.volumeDepth = 6;
        cubeMapArray.dimension = TextureDimension.Tex2DArray;
        cubeMapArray.Create();


Prepare a texture2darray which has six slices and a cubemap resource.
The code to render scenes into cube face is bellow:


                 

        // clear target
        mainCamera.RemoveCommandBuffer(CameraEvent.BeforeForwardOpaque, cubeMapRender);
        cubeMapRender.Clear();
        cubeMapRender.SetRenderTarget(cubeMapArray, 0, CubemapFace.Unknown, -1);
        cubeMapRender.ClearRenderTarget(true, true, Color.black);

        if (cubeMapMaterial)
        {
            SetCubeFaceMatrix();

            Vector3 camPos = mainCameraTrans.transform.position;
            RenderSceneObjectsHere();

            // copy slice to rendertexture
            for (int i = 0; i < 6; i++)
            {
                cubeMapRender.CopyTexture(cubeMapArray, i, cubeMap, i);
            }

            cubeMapRender.SetGlobalTexture("_SinglePassCube", cubeMap);
            mainCamera.AddCommandBuffer(CameraEvent.BeforeForwardOpaque, cubeMapRender);
        }

We use command buffer for rendering instead of DrawMeshNow().
Because DrawMeshNow() will interpret command buffer and submit rendering work immediately.
It will cause a performance hit.
This command buffer execute before render scene objects, like shadow map.

Use SetRenderTarget() with texture2darray resource, and it's important to set depth slice value to -1.
Set depth slice value to -1 will make Unity bind all slices of texture2darray to GPU.
That's actually we need.

Last step, use CopyTexture() from texture2darray source to cubemap source.
Why we need such steps?
Can't we just bind cubemap for rendering?

The answer is: Yeah, we just can't :/
The support of cubemap in Unity is really poor.
We can't bind all six faces of cubemap at once.
And also, we can't use texture2darray as a cubemap ShaderResourceView.

In native D3D11, we normally use texture2darray to render cubeface, and create a cubemap ShaderResourceView to use cubeface in shader directly.

But Unity just can't do it without writing native plugin.
So CopyTexture() is need here, I want to keep it as simple as it can.

Fortunately, CopyTexture() is basically an equivalent of CopyResource() in native D3D11.
It's optimized well and shouldn't hit performance a lot.


Result


Yeah!
Here, we have a cubemap with all six faces rendered.
But once uses one pass!!

Summary

Does this method really improve performance?
Frankly this depends on the bottleneck you encountered.
Obviously this method is GPU-bound.
The overhead is transfer from CPU side to GPU side.

Worst case: 
Scene objects are lie in six faces without overlapping frustum.
What does this mean? 
For example we have six faces, each renders 20 objects.
120 objects in total. Using single pass dynamic cube mapping still draws 120 objects.
But pipeline state change is reduced a lot. 

Limited culling: 
We can almost use layer or distance culling. 
Since we are rendering objects at once, we can't just do frustum culling or occlusion culling as normal.
And of course, frustum culling or occlusion culling for all six faces takes time.


Overall, it's still a good method to try!