使用计算着色器

本教程会带领你创建一个最简单的计算着色器。但首先需要先介绍一下计算着色器的背景以及在 Godot 中的工作原理。

备注

本教程假定你已大致熟悉着色器。如果你是着色器新手，请先阅读《着色器简介》和《你的第一个着色器》，然后再继续本教程。

计算着色器是一种着重于通用编程特殊的着色器。换句话说，它们相比于节点和片段着色器更加灵活，因为它们没有固定的用途（节点变换或图片着色）。不同于节点和片段着色器，计算着色器的幕后工作非常少。GPU 运行的代码就是你编写的代码，此外几乎没有其他内容。因此，计算着色器在将繁重计算转移到 GPU 上时非常有用。

现在我们以一个简短的计算着色器入手。

首先，用你选用的外部编辑器，在项目文件夹中创建一个命名为 compute_example.glsl 的新文件。Godot 的计算着色器直接使用 GLSL 代码。Godot 着色器语言基于 GLSL，如果你对 Godot 正常着色器熟悉，那么对以下语法也会比较熟悉。

备注

计算着色器只能在基于 RenderingDevice 的渲染器（Forward+ 或 Mobile）中使用。想要按照本教程操作的话，请确保你使用的是 Forward+ 或 Mobile 渲染器。相关设置位于编辑器的右上角。

请注意，虽然理论上移动设备支持计算着色器，但（由于驱动器问题）这一支持通常较差。

我们把它调成蓝色：

#[compute]
#version 450

// Invocations in the (x, y, z) dimension
layout(local_size_x = 2, local_size_y = 1, local_size_z = 1) in;

// A binding to the buffer we create in our script
layout(set = 0, binding = 0, std430) restrict buffer MyDataBuffer {
    float data[];
}
my_data_buffer;

// The code we want to execute in each invocation
void main() {
    // gl_GlobalInvocationID.x uniquely identifies this invocation across all work groups
    my_data_buffer.data[gl_GlobalInvocationID.x] *= 2.0;
}

这段代码接受一个 float 数组，将其中的每个元素和 2 相乘，并将结果存储回数组中。现在，我们来逐行观察这段代码。

#[compute]
#version 450

这两行文本传达了以下的两件事：

如下的代码为计算着色器。这是 Godot 特有的提示文本，编辑器需要此文本才能正确导入着色器文件。

代码使用的是 GLSL 450。

在编写计算着色器时，你应当永远以这两行作为文件的开头。

// Invocations in the (x, y, z) dimension
layout(local_size_x = 2, local_size_y = 1, local_size_z = 1) in;

接下来我们要传达每个工作组所使用的调用次数。“调用”指的是同一个工作组中运行的着色器实例。从 CPU 启动计算着色器时，我们会告诉它需要运行多少个工作组。工作组之间是并行执行的。运行时，一个工作组无法访问另一个工作组中的信息。不过同一个工作组中的不同调用可以相互进行有限的访问。

你可以将工作组和调用想象成巨型的嵌套 for 循环。

for (int x = 0; x < workgroup_size_x; x++) {
  for (int y = 0; y < workgroup_size_y; y++) {
     for (int z = 0; z < workgroup_size_z; z++) {
        // Each workgroup runs independently and in parallel.
        for (int local_x = 0; local_x < invocation_size_x; local_x++) {
           for (int local_y = 0; local_y < invocation_size_y; local_y++) {
              for (int local_z = 0; local_z < invocation_size_z; local_z++) {
                 // Compute shader runs here.
              }
           }
        }
     }
  }
}

工作组与调用属于高阶内容。目前请需要记住我们在每个工作组中运行了两个调用。

// A binding to the buffer we create in our script
layout(set = 0, binding = 0, std430) restrict buffer MyDataBuffer {
    float data[];
}
my_data_buffer;

这里我们提供的是与计算着色器所能访问的内存相关的信息。我们可以通过 layout 属性告诉着色器去哪里寻找缓冲，稍后我们需要在 CPU 一侧匹配这些 set 和 binding 的位置。

关键字 restrict 能够告诉着色器该缓冲只会在这个着色器中的某个单一位置进行访问。换句话说，我们不会将该缓冲绑定到其他 set 或 binding 索引。这一点非常重要，着色器编译器就能够借此对着色器代码进行优化。能使用 restrict 时请一定要使用。

这是一个未指明大小的缓冲，也就是说可以是任意大小。因此我们需要注意不要让用来读取的索引超过缓冲的大小。

// The code we want to execute in each invocation
void main() {
    // gl_GlobalInvocationID.x uniquely identifies this invocation across all work groups
    my_data_buffer.data[gl_GlobalInvocationID.x] *= 2.0;
}

Finally, we write the main function which is where all the logic happens. We access a position in the storage buffer using the gl_GlobalInvocationID built-in variables. gl_GlobalInvocationID gives you the global unique ID for the current invocation.

To continue, write the code above into your newly created compute_example.glsl file.

创建局部 RenderingDevice

To interact with and execute a compute shader, we need a script. Create a new script in the language of your choice and attach it to any Node in your scene.

Now to execute our shader we need a local RenderingDevice which can be created using the RenderingServer:

# Create a local rendering device.
var rd := RenderingServer.create_local_rendering_device()

// Create a local rendering device.
var rd = RenderingServer.CreateLocalRenderingDevice();

After that, we can load the newly created shader file compute_example.glsl and create a precompiled version of it using this:

# Load GLSL shader
var shader_file := load("res://compute_example.glsl")
var shader_spirv: RDShaderSPIRV = shader_file.get_spirv()
var shader := rd.shader_create_from_spirv(shader_spirv)

// Load GLSL shader
var shaderFile = GD.Load<RDShaderFile>("res://compute_example.glsl");
var shaderBytecode = shaderFile.GetSpirV();
var shader = rd.ShaderCreateFromSpirV(shaderBytecode);

警告

Local RenderingDevices cannot be debugged using tools such as RenderDoc.

提供输入数据

As you might remember, we want to pass an input array to our shader, multiply each element by 2 and get the results.

We need to create a buffer to pass values to a compute shader. We are dealing with an array of floats, so we will use a storage buffer for this example. A storage buffer takes an array of bytes and allows the CPU to transfer data to and from the GPU.

So let's initialize an array of floats and create a storage buffer:

# Prepare our data. We use floats in the shader, so we need 32 bit.
var input := PackedFloat32Array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
var input_bytes := input.to_byte_array()

# Create a storage buffer that can hold our float values.
# Each float has 4 bytes (32 bit) so 10 x 4 = 40 bytes
var buffer := rd.storage_buffer_create(input_bytes.size(), input_bytes)

// Prepare our data. We use floats in the shader, so we need 32 bit.
float[] input = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
var inputBytes = new byte[input.Length * sizeof(float)];
Buffer.BlockCopy(input, 0, inputBytes, 0, inputBytes.Length);

// Create a storage buffer that can hold our float values.
// Each float has 4 bytes (32 bit) so 10 x 4 = 40 bytes
var buffer = rd.StorageBufferCreate((uint)inputBytes.Length, inputBytes);

With the buffer in place we need to tell the rendering device to use this buffer. To do that we will need to create a uniform (like in normal shaders) and assign it to a uniform set which we can pass to our shader later.

# Create a uniform to assign the buffer to the rendering device
var uniform := RDUniform.new()
uniform.uniform_type = RenderingDevice.UNIFORM_TYPE_STORAGE_BUFFER
uniform.binding = 0 # this needs to match the "binding" in our shader file
uniform.add_id(buffer)
var uniform_set := rd.uniform_set_create([uniform], shader, 0) # the last parameter (the 0) needs to match the "set" in our shader file

// Create a uniform to assign the buffer to the rendering device
var uniform = new RDUniform
{
    UniformType = RenderingDevice.UniformType.StorageBuffer,
    Binding = 0
};
uniform.AddId(buffer);
var uniformSet = rd.UniformSetCreate([uniform], shader, 0);

定义计算管线

下一步需要创建一套 GPU 可以运行的指令。为此我们需要一个管线和一个计算列表。

需要执行以下步骤才能够得到计算结果：

新建管线。
Begin a list of instructions for our GPU to execute.
Bind our compute list to our pipeline
Bind our buffer uniform to our pipeline
Specify how many workgroups to use
End the list of instructions

# Create a compute pipeline
var pipeline := rd.compute_pipeline_create(shader)
var compute_list := rd.compute_list_begin()
rd.compute_list_bind_compute_pipeline(compute_list, pipeline)
rd.compute_list_bind_uniform_set(compute_list, uniform_set, 0)
rd.compute_list_dispatch(compute_list, 5, 1, 1)
rd.compute_list_end()

// Create a compute pipeline
var pipeline = rd.ComputePipelineCreate(shader);
var computeList = rd.ComputeListBegin();
rd.ComputeListBindComputePipeline(computeList, pipeline);
rd.ComputeListBindUniformSet(computeList, uniformSet, 0);
rd.ComputeListDispatch(computeList, xGroups: 5, yGroups: 1, zGroups: 1);
rd.ComputeListEnd();

Note that we are dispatching the compute shader with 5 work groups in the X axis, and one in the others. Since we have 2 local invocations in the X axis (specified in our shader), 10 compute shader invocations will be launched in total. If you read or write to indices outside of the range of your buffer, you may access memory outside of your shaders control or parts of other variables which may cause issues on some hardware.

执行计算着色器

After all of this we are almost done, but we still need to execute our pipeline. So far we have only recorded what we would like the GPU to do; we have not actually run the shader program.

To execute our compute shader we need to submit the pipeline to the GPU and wait for the execution to finish:

# Submit to GPU and wait for sync
rd.submit()
rd.sync()

// Submit to GPU and wait for sync
rd.Submit();
rd.Sync();

Ideally, you would not call sync() to synchronize the RenderingDevice right away as it will cause the CPU to wait for the GPU to finish working. In our example, we synchronize right away because we want our data available for reading right away. In general, you will want to wait at least 2 or 3 frames before synchronizing so that the GPU is able to run in parallel with the CPU.

警告

Long computations can cause Windows graphics drivers to "crash" due to TDR being triggered by Windows. This is a mechanism that reinitializes the graphics driver after a certain amount of time has passed without any activity from the graphics driver (usually 5 to 10 seconds).

Depending on the duration your compute shader takes to execute, you may need to split it into multiple dispatches to reduce the time each dispatch takes and reduce the chances of triggering a TDR. Given TDR is time-dependent, slower GPUs may be more prone to TDRs when running a given compute shader compared to a faster GPU.

获取结果

You may have noticed that, in the example shader, we modified the contents of the storage buffer. In other words, the shader read from our array and stored the data in the same array again so our results are already there. Let's retrieve the data and print the results to our console.

# Read back the data from the buffer
var output_bytes := rd.buffer_get_data(buffer)
var output := output_bytes.to_float32_array()
print("Input: ", input)
print("Output: ", output)

// Read back the data from the buffers
var outputBytes = rd.BufferGetData(buffer);
var output = new float[input.Length];
Buffer.BlockCopy(outputBytes, 0, output, 0, outputBytes.Length);
GD.Print("Input: ", string.Join(", ", input));
GD.Print("Output: ", string.Join(", ", output));

With that, you have everything you need to get started working with compute shaders.

参见

The demo projects repository contains a Compute Shader Heightmap demo This project performs heightmap image generation on the CPU and GPU separately, which lets you compare how a similar algorithm can be implemented in two different ways (with the GPU implementation being faster in most cases).