Plus: Coding opencl is really nightmare: Comment one line or add one useless line will cause the result 100% different.
Sorry for my national holiday, but the result is exciting.
I totally agree, coding OpenCL is a nightmare. Unfortunately I can explain the theory, why this happens: The SIMD modell is playing against us. Adding or deleting instructions, that threads have to skip or used to sync execution heavily plays into overall performance.
Let’s look at the code at the end of fastkdf():
if (a >= output_len)
// copy
else
// merge
Now “a” depends on the input data, the chances that for a bunch of threads trying to execute this conditional on multiple (different) data - remember every thread has its own distinct data - makes some threads execute the then part, while others do the merge part. SIMD now dictates that all threads execute the same instruction or skip it. Or with other words: All threads execute both parts of the conditional. Now, OpenCL is able to switch off some of the threads, i.e., the threads sees the instruction, but does not execute it. It idles. The compiler tries to handle this, but is not always successfull.
So if just one thread needs to execute the other part of the contidional than all other threads, then nevertheless all threads will step through all instructions in both the then and the else branch.
So far just for the background. :)