Optimizing NVidia kernel compile ( linux + macOS builds in testing )

Post by **jensverwiebe** » Fri Feb 09, 2018 8:14 pm

Hi
I profiled the ptxas of nv kernels and handoptimized kernel compile now.

Linux testbuild with last state optimizations:
Basically the kernel compile speedup is around factor 100 to 300 now.
There might be a minimal tradein in renderspeed for older gpu series, found nothing significant in my setup.
Hotel now 3.5 sec and Mic 15 sec to compile kernel.

As per march 10th combined master with profiled in/unlining:
Hotel now 2.7 sec and Mic 6.5 sec to compile kernel.

EDIT: days are too short always, having linux 2.0 profiled kernels and MacOS 2.0 with NVidia support
in testing by friends but absolute no time to work out my aimed NVidia render speed optimizations.

Insights as per june 12th:
- macOS 2.0 is fine, 2.1alpha works atm. without multithreaded denoiser, as apple removed openmp from their clang
- linux 2.0 profiled is fine, 2.1 alpha needs solving inlining conflicts arising with deviation in master

Jens

Post by **jensverwiebe** » Fri Feb 09, 2018 9:30 pm

I think i got it: i mark all functions in texture_blender_noisefunctions2.cl as __attribute__((always_inline)) and am down to 16 seconds per gpu.
Compare: before: > 3 minutes per gpu, now 16 sec per gpu, for the record: with disabled opts its 3 seconds per gpu.

More testing ....

Edit works across all my testfiles, stunning faster kernel compile, also a tad faster render.
Btw: thats possibly cause NV now uses an llvm based compiler, where inllining is only a mild suggestion , formerly was stricter.

Todo: make a #define force_inline __attribute__((always_inline))
for Nvidia case. Also this could speedup compile in other ( nested ) place too perhaps.
Trying to get it done on weekend, my 1 week holidays end on sunday

Actual diff can be asked for.

Jens

Post by **jensverwiebe** » Fri Feb 09, 2018 10:07 pm

Down to 8 seconds

.... inlined also texture_bump_funcs.cl
Although with all procedurals, its a bit mixed, mostly i start < 1 minute where it was 9 min before.
Seems also NV caches something extra in userhome/.nv/ComputeCache, so that makes testing a bit odd sometimes.
@Dade, i propose you check this out on your GTX 980 and implement to your like.

Jens

Post by **Sharlybg** » Sat Feb 10, 2018 11:05 am

You're improving Nvidia scene compile time ? if yes it's really great (compile time is so long on it that i'd stop trying).

Post by **Dade** » Sat Feb 10, 2018 11:46 am

jensverwiebe wrote: ↑Fri Feb 09, 2018 10:07 pm Down to 8 seconds .... inlined also texture_bump_funcs.cl
Although with all procedurals, its a bit mixed, mostly i start < 1 minute where it was 9 min before.
Seems also NV caches something extra in userhome/.nv/ComputeCache, so that makes testing a bit odd sometimes.

NV cache kernels exactly like we do (only at driver level). Are you sure to have deleted all kernel caches ? GPU have no function calls at all so inlining functions should just doing nothing.

Post by **jensverwiebe** » Sat Feb 10, 2018 1:20 pm

Just again deleted .Nv/ComputeCache and .config/luxcore cache and got all kernels
for my GTX 980ti + 2 * GTX 1080 in 30 seconds, where it was 9 minutes before.

Cylces also handles inlinings, why should it when they would be meaningless ?
With changing inlining, we overcame performance problems in cycles with cuda 8
release for example.

Here is the cuda example for that fix: https://developer.blender.org/D2269
NVidia ocl plain wraps to cuda thus ptxas remains involved.

Jens

Post by **Dade** » Sat Feb 10, 2018 3:14 pm

jensverwiebe wrote: ↑Sat Feb 10, 2018 1:20 pm Just again deleted .Nv/ComputeCache and .config/luxcore cache and got all kernels
for my GTX 980ti + 2 * GTX 1080 in 30 seconds, where it was 9 minutes before.

I run this test:

Code: Select all

./bin/luxcoreui -D opencl.devices.select 100 -D renderengine.type PATHOCL -D sampler.type SOBOL scenes/luxball/proctexball-mix.cfg

The kernel compilation time is 50 secs with the NVIDIA 980GTX. It requires 3.7secs with AMD R290X.

I added "__attribute__((always_inline))" to all FBM and CheckerBoard2D related functions. They are the only 2 procedural textures used in the scene.

The compiling time is still 50secs.

You can edit (for instance add a comment) one of the .cl files used, to be sure to trigger a kernel re-compilation and avoid any cache.

jensverwiebe wrote: ↑Sat Feb 10, 2018 1:20 pm Cylces also handles inlinings, why should it when they would be meaningless ?
With changing inlining, we overcame performance problems in cycles with cuda 8
release for example.

Here is the cuda example for that fix: https://developer.blender.org/D2269
NVidia ocl plain wraps to cuda thus ptxas remains involved.

The above patch doesn't mention anywhere compilation times, it is only related to rendering performance as far as I can see.

The NVIDIA problem is likely to be related to loops unrolling and it is something similar to a bug (because it is not normal to be 14 times slower than AMD compiler).

Post by **jensverwiebe** » Sat Feb 10, 2018 5:21 pm

I talked about bumpmaps ! Procedural as colormaps is not thus bad ( although could be much better too )
Your testcase was simple procs on colorinput. ( also 43 seconds here, aka not covered case in my patch )

I nowhere said cycles example is about about compiletimes, its about inlining attributtes used in gpgpu code.

Test:

Code: Select all

./bin/luxcoreui -D opencl.devices.select 1000 -D renderengine.type PATHOCL -D sampler.type SOBOL scenes/inline_test/render.cfg

Without my patch:

Code: Select all

[LuxCore][1.044] [PathOCLBaseRenderThread::0] Compiling kernels 
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Kernels not cached
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling Film_Clear Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling InitSeed Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling Init Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_RT_NEXT_VERTEX Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_HIT_NOTHING Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_HIT_OBJECT Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_RT_DL Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_DL_ILLUMINATE Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_DL_SAMPLE_BSDF Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_GENERATE_NEXT_VERTEX_RAY Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_SPLAT_SAMPLE Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_NEXT_SAMPLE Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_GENERATE_CAMERA_RAY Kernel
[LuxCore][311.704] [PathOCLBaseRenderThread::0] Kernels compilation time: 310661ms <------------
[LuxCore][311.716] Film OpenCL image pipeline
[LuxCore][311.717] Film OpenCL Device used: GeForce GTX 1080 Intersect
[LuxCore][311.717]   Device OpenCL version: OpenCL 1.2 CUDA

With patch:

Code: Select all

[LuxCore][1.055] [PathOCLBaseRenderThread::0] Compiling kernels 
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Kernels not cached
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling Film_Clear Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling InitSeed Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling Init Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_RT_NEXT_VERTEX Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_HIT_NOTHING Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_HIT_OBJECT Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_RT_DL Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_DL_ILLUMINATE Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_DL_SAMPLE_BSDF Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_GENERATE_NEXT_VERTEX_RAY Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_SPLAT_SAMPLE Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_NEXT_SAMPLE Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_GENERATE_CAMERA_RAY Kernel
[LuxCore][18.120] [PathOCLBaseRenderThread::0] Kernels compilation time: 17066ms <------------
[LuxCore][18.133] Film OpenCL image pipeline
[LuxCore][18.134] Film OpenCL Device used: GeForce GTX 1080 Intersect

Atm. i only concentrated on the procedural stuff that made a senseful usage impossible on nvidia.
Imagine 310661ms is only for the 1080, 980ti would add up to this in a similar duration.
That said i had not time to go any deeper.

Sidenote: exporting as luxcoreui textfiles ended up in:

Code: Select all

[LuxCore][1.201] Building visibility map of light source: __WORLD_BACKGROUND_LIGHT__  memory acces violation

unless you add:

Code: Select all

scene.lights.__WORLD_BACKGROUND_LIGHT__.visibilitymap.enable = 0

to the scenefile. Guess a forgotten export as it works normal in blender/pyluxcore.

Jens

Post by **Dade** » Sat Feb 10, 2018 7:01 pm

jensverwiebe wrote: ↑Sat Feb 10, 2018 5:21 pm Sidenote: exporting as luxcoreui textfiles ended up in:
Code: Select all
[LuxCore][1.201] Building visibility map of light source: __WORLD_BACKGROUND_LIGHT__  memory acces violation
unless you add:
Code: Select all
scene.lights.__WORLD_BACKGROUND_LIGHT__.visibilitymap.enable = 0
to the scenefile. Guess a forgotten export as it works normal in blender/pyluxcore.

It has been fixed yesterday: https://github.com/LuxCoreRender/LuxCore/issues/61

Post by **jensverwiebe** » Sat Feb 10, 2018 8:02 pm

@Dade: i worked on the scene you used as testcase ( proctexball-mix ):

Code: Select all

./bin/luxcoreui -D opencl.devices.select 100 -D renderengine.type PATHOCL -D sampler.type SOBOL scenes/luxball/proctexball-mix.cfg

[LuxCore][4.977] [PathOCLBaseRenderThread::0] Kernels compilation time: 3681ms ( was 43164 ms )

This time i made sure some is not inlined

Code: Select all

[LuxCore][1.362] [PathOCLBaseRenderThread::0] Compiling kernels 
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Kernels not cached
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling Film_Clear Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling InitSeed Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling Init Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_RT_NEXT_VERTEX Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_HIT_NOTHING Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_HIT_OBJECT Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_RT_DL Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_DL_ILLUMINATE Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_DL_SAMPLE_BSDF Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_GENERATE_NEXT_VERTEX_RAY Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_SPLAT_SAMPLE Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_NEXT_SAMPLE Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Compiling AdvancePaths_MK_GENERATE_CAMERA_RAY Kernel
[LuxCore][4.997] [PathOCLBaseRenderThread::0] Kernels compilation time: 3637ms
[LuxCore][4.999] Film OpenCL image pipeline
[LuxCore][5.000] Film OpenCL Device used: GeForce GTX 1080 Intersect

Diff only for that scene compile speedup:

Code: Select all

diff --git a/include/slg/textures/texture_funcs.cl b/include/slg/textures/texture_funcs.cl
index 4f7f273..203173c 100644
--- a/include/slg/textures/texture_funcs.cl
+++ b/include/slg/textures/texture_funcs.cl
@@ -159,12 +159,12 @@ float3 FresnelApproxKTexture_ConstEvaluateSpectrum(__global HitPoint *hitPoint,
 
 #if defined(PARAM_ENABLE_TEX_MIX)
 
-float MixTexture_ConstEvaluateFloat(__global HitPoint *hitPoint,
+__attribute__((noinline)) float MixTexture_ConstEvaluateFloat(__global HitPoint *hitPoint,
 		const float amt, const float value1, const float value2) {
 	return mix(value1, value2, clamp(amt, 0.f, 1.f));
 }
 
-float3 MixTexture_ConstEvaluateSpectrum(__global HitPoint *hitPoint,
+__attribute__((noinline)) float3 MixTexture_ConstEvaluateSpectrum(__global HitPoint *hitPoint,
 		const float3 amt, const float3 value1, const float3 value2) {
 	return mix(value1, value2, clamp(amt, 0.f, 1.f));
 }
@@ -177,7 +177,7 @@ float3 MixTexture_ConstEvaluateSpectrum(__global HitPoint *hitPoint,
 
 #if defined(PARAM_ENABLE_CHECKERBOARD2D)
 
-float CheckerBoard2DTexture_ConstEvaluateFloat(__global HitPoint *hitPoint,
+__attribute__((noinline)) float CheckerBoard2DTexture_ConstEvaluateFloat(__global HitPoint *hitPoint,
 		const float value1, const float value2, __global const TextureMapping2D *mapping) {
 	const float2 uv = VLOAD2F(&hitPoint->uv.u);
 	const float2 mapUV = TextureMapping2D_Map(mapping, hitPoint);
@@ -185,7 +185,7 @@ float CheckerBoard2DTexture_ConstEvaluateFloat(__global HitPoint *hitPoint,
 	return ((Floor2Int(mapUV.s0) + Floor2Int(mapUV.s1)) % 2 == 0) ? value1 : value2;
 }
 
-float3 CheckerBoard2DTexture_ConstEvaluateSpectrum(__global HitPoint *hitPoint,
+__attribute__((noinline)) float3 CheckerBoard2DTexture_ConstEvaluateSpectrum(__global HitPoint *hitPoint,
 		const float3 value1, const float3 value2, __global const TextureMapping2D *mapping) {
 	const float2 uv = VLOAD2F(&hitPoint->uv.u);
 	const float2 mapUV = TextureMapping2D_Map(mapping, hitPoint);
@@ -197,7 +197,7 @@ float3 CheckerBoard2DTexture_ConstEvaluateSpectrum(__global HitPoint *hitPoint,
 
 #if defined(PARAM_ENABLE_CHECKERBOARD3D)
 
-float CheckerBoard3DTexture_ConstEvaluateFloat(__global HitPoint *hitPoint,
+__attribute__((noinline)) loat CheckerBoard3DTexture_ConstEvaluateFloat(__global HitPoint *hitPoint,
 		const float value1, const float value2, __global const TextureMapping3D *mapping) {
 	// The +DEFAULT_EPSILON_STATIC is there as workaround for planes placed exactly on 0.0
 	const float3 mapP = TextureMapping3D_Map(mapping, hitPoint) +  + DEFAULT_EPSILON_STATIC;
@@ -205,7 +205,7 @@ float CheckerBoard3DTexture_ConstEvaluateFloat(__global HitPoint *hitPoint,
 	return ((Floor2Int(mapP.x) + Floor2Int(mapP.y) + Floor2Int(mapP.z)) % 2 == 0) ? value1 : value2;
 }
 
-float3 CheckerBoard3DTexture_ConstEvaluateSpectrum(__global HitPoint *hitPoint,
+__attribute__((noinline)) float3 CheckerBoard3DTexture_ConstEvaluateSpectrum(__global HitPoint *hitPoint,
 		const float3 value1, const float3 value2, __global const TextureMapping3D *mapping) {
 	// The +DEFAULT_EPSILON_STATIC is there as workaround for planes placed exactly on 0.0
 	const float3 mapP = TextureMapping3D_Map(mapping, hitPoint) +  + DEFAULT_EPSILON_STATIC;

Now the hercules work: understanding when to inline and when not, grrrr ....
As far as i see, one must go through all textures and check where inlining, noinlining or eventually #pragma unroll 1 ( == no unroll )
should be used.

Another succes:

Code: Select all

./bin/luxcoreui -D opencl.devices.select 1000 -D renderengine.type PATHOCL -D sampler.type SOBOL scenes/bump/bump-proc-mix.cfg

[LuxCore][6.839] [PathOCLBaseRenderThread::0] Kernels compilation time: 5867ms ( was > 10 minutes , i always aborted )
Got marble down from 65s to 2,3s ! Got succi down to 5s also.
I now understood the underlaying problem.
Empirically verified.

By doing selective inline my own musgrave color/bumptex is now from 15 to 12 secs per gpu.

Jens

LuxCoreRender Forums

Optimizing NVidia kernel compile ( linux + macOS builds in testing )

Optimizing NVidia kernel compile ( linux + macOS builds in testing )

Re: Optimizing NVidia kernel compile

Re: Optimizing NVidia kernel compile

Re: Optimizing NVidia kernel compile

Re: Optimizing NVidia kernel compile

Re: Optimizing NVidia kernel compile

Re: Optimizing NVidia kernel compile

Re: Optimizing NVidia kernel compile

Re: Optimizing NVidia kernel compile

Re: Optimizing NVidia kernel compile