Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Discussion related to the LuxCore functionality, implementations and API.
Post Reply
illwieckz
Posts: 3
Joined: Tue May 17, 2022 11:41 pm

Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

Hi, I found a bug in the `-ffast-math` option of the LLVM Clang compiler that is reproduced when compiling OpenCL code using LuxMark.

I'm looking for help to narrow the reproduction test with hope to get this Clang bug fixed.

Some context: The Mesa Clover radeonsi OpenCL driver for AMD GCN and later devices is relying on LLVM Clang compiler. I've noticed a regression when running LuxMark v3.1 LuxBall benchmark: it now render garbages with default OpenCL Compiler option. After some investigations, it appears the bug is not in Mesa but in LLVM and is reproduced once `-cl-fast-relaxed-math` is enabled. With with more investigations it appears the culprit is `-ffast-math` which looks to be automatically enabled when `-cl-fast-relaxed-math` is enabled: not enabling `-cl-fast-relaxed-math` but enabling `-ffast-math` reproduces the bug, and removing `-cl-fast-relaxed-math` dependency on `-ffast-math` workaround the bug.

One problem I face is that to reproduce the bug, one has to run the complete LuxMark v3.1 benchmark, this does not make easy to identify what's wrong.

Even LuxMark 3 is hard to build today (though I wrote a script to make it easy). This makes me think it would be good if newer LuxMark keep the LuxBall scene even if not the default one, just because it's a simple scene.

So I'm looking for help to narrow the reproduction test to help identify the bug in Clang. Here is the issue on Mesa side, but more important, here is the issue on LLVM side, this comment may be relevant.

Maybe it would be useful to identify which kernel is suffering from the bug, to begin with? I feel like I did the maximum I could and hit my limit and I don't know what to do myself to help more.

For information, this is how the bug looks like,

without `-ffast-math` (without `-cl-fast-relaxed-math`):

Image

with `-ffast-math` (implied by default `-cl-fast-relaxed-math`):

Image
Last edited by illwieckz on Thu May 26, 2022 4:51 pm, edited 1 time in total.
User avatar
Dade
Developer
Developer
Posts: 5664
Joined: Mon Dec 04, 2017 8:36 pm
Location: Italy

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by Dade »

LuxMark v3.1 is very old and the latest OpenCL code has changed a lot so it may not show the same bug at all.

From the look of your rendering, I can tell you that you have probably NaNs "traveling" across the code and landing on pixels: NaNs are like a poison, everything they touch become a NaN itself. I guess you have some kind of math function returning a NaN somewhere instead of a correct value but back tracking however where it is is, may be very hard.
Support LuxCoreRender project with salts and bounties
illwieckz
Posts: 3
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

Thank you for your answer. I was afraid investigating this from LuxMark side would not be easy, you seems to confirm it. One interesting thing I noticed in the mean time is that PoCL pthread (CPU implementation) that is also relying heavily on LLVM suffered from a very similar problem in LLVM 11, meaning a very similar bug was introduced for CPUs between LLVM 10 and LLVM 11 and fixed in LLVM 12, while the bug affecting AMD GCN GPUs is known to have been introduced between LLVM 9 and LLVM 11 (I did not find an easy way to test LLVM 10) but was never fixed.

I wrote more about my findings there.

I wonder if it would be possible to produce a simpler scene just involving this surface, as we know this part is badly rendered while looking simple, that may help to reduce the number of LuxMark OpenCL functions involved:

Image
illwieckz
Posts: 3
Joined: Tue May 17, 2022 11:41 pm

Re: Looking for help to identify a compiler bug in Clang uncovered by LuxMark/LuxCore

Post by illwieckz »

I'm still able to reproduce the bug when I reduce the luxball/scene.scn scene file to this:

Code: Select all

scene.camera.lookat.orig = 0.649246990680694580078125 0.279195994138717651367188 0.148695006966590881347656
scene.camera.lookat.target =0.107509911060333251953125 1.0487911701202392578125 -0.189305052161216735839844
scene.camera.up = 0 0 1
scene.lights.infinitelight.type = infinite
scene.lights.infinitelight.file = scenes/luxball/imagemap-00000.exr
scene.materials.material-0x7f4e0b47a5e0.type = matte
scene.objects.extmesh-0x7f4e0b4c3920.material = material-0x7f4e0b47a5e0
scene.objects.extmesh-0x7f4e0b4c3920.ply = scenes/luxball/mesh-00006.ply
This may help to identify which OpenCL code is failing, by reducing the amount of OpenCL code the scene relies on to be rendered.

I've noticed those material types reproduce the bug:

Code: Select all

matte
roughmatte
cloth
While those not:

Code: Select all

mattetranslucent
mirror
metal2
glossy2
velvet
archglass
Post Reply