Looking for help to identify a regression when using Clang-based OpenCL platform with LuxMark/LuxCore

Discussion related to the LuxCore functionality, implementations and API.
illwieckz
Posts: 19
Joined: Tue May 17, 2022 11:41 pm

Looking for help to identify a regression when using Clang-based OpenCL platform with LuxMark/LuxCore

Post by illwieckz »

Hi, I found a bug in the `-ffast-math` option of the LLVM Clang compiler that is reproduced when compiling OpenCL code using LuxMark.

I'm looking for help to narrow the reproduction test with hope to get this Clang bug fixed.

Some context: The Mesa Clover radeonsi OpenCL driver for AMD GCN and later devices is relying on LLVM Clang compiler. I've noticed a regression when running LuxMark v3.1 LuxBall benchmark: it now render garbages with default OpenCL Compiler option. After some investigations, it appears the bug is not in Mesa but in LLVM and is reproduced once `-cl-fast-relaxed-math` is enabled. With with more investigations it appears the culprit is `-ffast-math` which looks to be automatically enabled when `-cl-fast-relaxed-math` is enabled: not enabling `-cl-fast-relaxed-math` but enabling `-ffast-math` reproduces the bug, and removing `-cl-fast-relaxed-math` dependency on `-ffast-math` workaround the bug.

One problem I face is that to reproduce the bug, one has to run the complete LuxMark v3.1 benchmark, this does not make easy to identify what's wrong.

Even LuxMark 3 is hard to build today (though I wrote a script to make it easy). This makes me think it would be good if newer LuxMark keep the LuxBall scene even if not the default one, just because it's a simple scene.

So I'm looking for help to narrow the reproduction test to help identify the bug in Clang. Here is the issue on Mesa side, but more important, here is the issue on LLVM side, this comment may be relevant.

Maybe it would be useful to identify which kernel is suffering from the bug, to begin with? I feel like I did the maximum I could and hit my limit and I don't know what to do myself to help more.

For information, this is how the bug looks like,

without `-ffast-math` (without `-cl-fast-relaxed-math`):

Image

with `-ffast-math` (implied by default `-cl-fast-relaxed-math`):

Image
Last edited by illwieckz on Thu Oct 20, 2022 11:46 pm, edited 2 times in total.
User avatar
Dade
Developer
Developer
Posts: 5672
Joined: Mon Dec 04, 2017 8:36 pm
Location: Italy

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by Dade »

LuxMark v3.1 is very old and the latest OpenCL code has changed a lot so it may not show the same bug at all.

From the look of your rendering, I can tell you that you have probably NaNs "traveling" across the code and landing on pixels: NaNs are like a poison, everything they touch become a NaN itself. I guess you have some kind of math function returning a NaN somewhere instead of a correct value but back tracking however where it is is, may be very hard.
Support LuxCoreRender project with salts and bounties
illwieckz
Posts: 19
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

Thank you for your answer. I was afraid investigating this from LuxMark side would not be easy, you seems to confirm it. One interesting thing I noticed in the mean time is that PoCL pthread (CPU implementation) that is also relying heavily on LLVM suffered from a very similar problem in LLVM 11, meaning a very similar bug was introduced for CPUs between LLVM 10 and LLVM 11 and fixed in LLVM 12, while the bug affecting AMD GCN GPUs is known to have been introduced between LLVM 9 and LLVM 11 (I did not find an easy way to test LLVM 10) but was never fixed.

I wrote more about my findings there.

I wonder if it would be possible to produce a simpler scene just involving this surface, as we know this part is badly rendered while looking simple, that may help to reduce the number of LuxMark OpenCL functions involved:

Image
illwieckz
Posts: 19
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

I'm still able to reproduce the bug when I reduce the luxball/scene.scn scene file to this:

Code: Select all

scene.camera.lookat.orig = 0.649246990680694580078125 0.279195994138717651367188 0.148695006966590881347656
scene.camera.lookat.target =0.107509911060333251953125 1.0487911701202392578125 -0.189305052161216735839844
scene.camera.up = 0 0 1
scene.lights.infinitelight.type = infinite
scene.lights.infinitelight.file = scenes/luxball/imagemap-00000.exr
scene.materials.material-0x7f4e0b47a5e0.type = matte
scene.objects.extmesh-0x7f4e0b4c3920.material = material-0x7f4e0b47a5e0
scene.objects.extmesh-0x7f4e0b4c3920.ply = scenes/luxball/mesh-00006.ply
This may help to identify which OpenCL code is failing, by reducing the amount of OpenCL code the scene relies on to be rendered.

I've noticed those material types reproduce the bug:

Code: Select all

matte
roughmatte
cloth
While those not:

Code: Select all

mattetranslucent
mirror
metal2
glossy2
velvet
archglass
illwieckz
Posts: 19
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

I verified that ROCm is affected as well, which is not surprising since both Clover and ROCm relies on LLVM.

Here is the issue as reported on ROCm side: https://github.com/RadeonOpenCompute/ROCm/issues/1828

I provide a script to (re)build LuxMark 3.1 on modern systems, I successfully tested it on Ubuntu 22.04.1. It requires GCC10. Details and instructions can be found there.

This may possible to tweak OpenCL code to find out where the issue is.

For example, I noticed that modifying the `kdVal` value in `MatteMaterial_ConstSample()` function from `LuxCore/include/slg/materials/materialdefs_funcs_matte.cl` file can make the bug disappear (but obviously introduces a bias that makes the render unable to validate). Details can be found there.

I'm looking for help to identify how that `kdVal` value becomes wrong. Basically when the bug happens `kdVal` gets a wrong value that is clamped to 1.
illwieckz
Posts: 19
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

I did more research. The bug happens when `MatteMaterial_ConstSample()` function returns `(1.0 ,1.0 , 1.0)`.

So, something looks really wrong. `(1.0, 1.0, 1.0)` is not NaN neither zero so we can both rule out this number being used as NaN in another compute or another compute doing a division by zero by dividing this number.

Details can be found there: https://github.com/llvm/llvm-project/issues/54947#issuecomment-1276482676

The only affected drivers are the ones relying on current LLVM amdgcn bytecode.

I need help to know what makes use of the return value of `MatteMaterial_ConstSample()`.
Last edited by illwieckz on Thu Oct 13, 2022 1:03 pm, edited 1 time in total.
illwieckz
Posts: 19
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

So, after more research, it appears that the bug is reproducible as soon as at least one component of the vec3 returned by `MatteMaterial_ConstSample` is not zero.

This function is probably not in cause. The input is believed to be always valid as it is believed to always come from hardcoded values or images sets in the scene file. The bug can also be reproduced with output that is believed to be valid.

The bug can be reproduced with a simple scene like that:

Code: Select all

scene.camera.lookat.orig = 0.65 0.28 0.15
scene.camera.lookat.target = 0.11 1.05 -0.19
scene.camera.up = 0 0 1
scene.lights.infinitelight.type = infinite
scene.lights.infinitelight.file = scenes/luxball/imagemap-00000.exr
scene.materials.material-0x7f4e0b47a5e0.type = matte
scene.materials.material-0x7f4e0b47a5e0.kd = 0.75 0.75 0.75
scene.objects.extmesh-0x7f4e0b4c3920.material = material-0x7f4e0b47a5e0
scene.objects.extmesh-0x7f4e0b4c3920.ply = scenes/luxball/mesh-00006.ply
So there is only three things: geometry, matte surface, lighting. Geometry is probably not in cause (both flat surfaces and spheres are affected). The matte material is probably not in cause (inputs are believed to be valid and outputs believed to be valid reproduce the bug). This left us with the lighting code.

More details about my investigations can be found there: https://github.com/RadeonOpenCompute/ROCm/issues/1828#issuecomment-1277517852

I need help to know where in LuxCore is the lighting code used by LuxMark 3.1 on such very simple scene, in that tree:

https://github.com/LuxCoreRender/LuxCore/tree/luxmark_v3.1/
illwieckz
Posts: 19
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

I looked for a code in LuxCore that contributes to the lighting. I picked `AdvancePaths_MK_HIT_OBJECT` from `include/slg/engines/pathocl/kernels/pathocl_kernels_micro.cl` that has a comment mentionning `MK_DL_ILLUMINATE`.

I tried to see if I can modify it in a way I get an obvious effect on the render while behaving differently given `-cl-fast-relaxed-math` is used or not.

I did that:

Code: Select all

diff --git a/include/slg/engines/pathocl/kernels/pathocl_kernels_micro.cl b/include/slg/engines/pathocl/kernels/pathocl_kernels_micro.cl
index 9d15fc36c..b8119ab0d 100644
--- a/include/slg/engines/pathocl/kernels/pathocl_kernels_micro.cl
+++ b/include/slg/engines/pathocl/kernels/pathocl_kernels_micro.cl
@@ -206,8 +206,6 @@ __kernel __attribute__((work_group_size_hint(64, 1, 1))) void AdvancePaths_MK_HI
        // Read the path state
        __global GPUTaskState *taskState = &tasksState[gid];
        PathState pathState = taskState->state;
-       if (pathState != MK_HIT_OBJECT)
-               return;
 
        //--------------------------------------------------------------------------
        // Start of variables setup
The produced visual effect is obvious enough to make me sure my modification did something and the visual effect differs given `-cl-fast-relaxed-math` is used or not.

More details can be found there:

https://github.com/RadeonOpenCompute/ROCm/issues/1828#issuecomment-1278046464

I need help to know how and where `taskState->state` is computed.
illwieckz
Posts: 19
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

I managed to bisect the old LLVM version 11 I suspected to have introduced the regression, while keeping an environment able to build Mesa so I can check the behaviour of LuxMark on every bisect step. I found the commit in LLVM that introduced the regression. And in this commit, what introduces the regression is an optimization applied to the `FMUL` operation, here:

https://github.com/llvm/llvm-project/blob/29a2b20ab363bcc0b9573e358a5ad12c0eddca86/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L12508-L12509

Now, we still don't know the bug itself.

We now know that the regression may happen if:

- an FMUL optimization in `llvm-project/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp` is applied,
- `MatteMaterial_ConstSample` from `LuxCore/include/slg/materials/materialdefs_funcs_matte.cl` is returning something else than `(0.0, 0.0, 0.0)`

If the bug is in the LuxCore code used by LuxMark 3.1, we may have to find a float multiplication that can go wrong when lighting a matte surface.

Everything starting from there isn't sure. Maybe I'm wrong and tweaking the `pathState != MK_HIT_OBJECT` just produces another bug canceling the other but this is what I noticed while tweaking the code:

Maybe the bug we look for makes the `pathState != MK_HIT_OBJECT` test to fail in `LuxCore/include/slg/engines/pathocl/kernels/pathocl_kernels_micro.cl`, making the code returning when it shouldn't, here:

https://github.com/LuxCoreRender/LuxCore/blob/358d7754f29508b73d6e9cd8c11ec8141b2176b9/include/slg/engines/pathocl/kernels/pathocl_kernels_micro.cl#L209-L210

Code: Select all

	if (pathState != MK_HIT_OBJECT)
		return;
And `pathState` is another name for `taskState->state`. I found an obvious place where `taskState->state` can be set to `MK_HIT_OBJECT`, it's even in the same file, here:

https://github.com/LuxCoreRender/LuxCore/blob/358d7754f29508b73d6e9cd8c11ec8141b2176b9/include/slg/engines/pathocl/kernels/pathocl_kernels_micro.cl#L100
const bool rayMiss = (rayHits[gid].meshIndex == NULL_INDEX);
taskState->state = rayMiss ? MK_HIT_NOTHING : MK_HIT_OBJECT;
This is all I know.
illwieckz
Posts: 19
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

The bug was correctly tracked down. The root cause was a division by zero but because of some reasons, the wreckage happened when multiplying the result of the division by zero when applying a multiplication optimization in LLVM.

Matt Arsenault provided a precious help to track down this bug.

The result of the bad division by zero is used in a multiplication that is optimized in a way the multiplication result itself becomes wrong.

Here is a patch for LuxMark 3.1 (LuxCore patch):

https://gitlab.com/illwieckz/i-love-compute/-/commit/c5d298f36dfc17f45f2d1eda96d602d7f0cd1a7b

The patch fixes the issue when LuxMark 3.1 is running on Mesa Clover or on AMD ROCm OpenCL platform, as both platforms use LLVM amdgcn bytecode generation.

The LLVM FMUL optimization that uncovered the bug is there:

https://github.com/llvm/llvm-project/blob/29a2b20ab363bcc0b9573e358a5ad12c0eddca86/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L12508-L12509

The broken computation dividing something by pdf clamp even if it's zero is there:

https://github.com/LuxCoreRender/LuxCore/blob/luxmark_v3.1/include/slg/engines/pathocl/kernels/pathocl_kernels_micro.cl#L642-L643

The pdf clamp value is set to zero there:

https://github.com/LuxCoreRender/LuxCore/blob/luxmark_v3.1/src/slg/engines/pathocl/pathocl.cpp#L121

The patches rewrites this:

Code: Select all

min(1.f, something / (PARAM_PDF_CLAMP_VALUE == 0.f ? something : PARAM_PDF_CLAMP_VALUE))
into this:

Code: Select all

min(1.f, something / (PARAM_PDF_CLAMP_VALUE == 0.f ? something : PARAM_PDF_CLAMP_VALUE))
Since something / something == 1.f, doing this never divides by zero, while doing that would still divide by zero (GPUs may still compute the unused code even if not using it):

Code: Select all

min(1.f, PARAM_PDF_CLAMP_VALUE == 0.f ? 1.f : something / PARAM_PDF_CLAMP_VALUE))
Here are some validated run of patched LuxMark 3.1 running with `-cl-fast-relaxed-math` on Mesa Clover and ROCm:

Mesa Clover on Hawaii GCN2, validated run with `-cl-fast-relaxed-math`: http://luxmark.info/node/9562

Image

Image

AMD ROCm on VanGogh RDNA2, validated run with `-cl-fast-relaxed-math`: http://luxmark.info/node/9563

Image

Image
Post Reply