Your Own Constant Folder in C/C++
I was talking with someone today that really really wanted the sqrtps
to be used in some code they were writing. And because of a quirk with clang (still there as of clang 18.1.0), if you happened to use -ffast-math
clang would butcher the use of the intrinsic. So for the code:
__m128 test(const __m128 vec)
{
return _mm_sqrt_ps(vec);
}
Clang would compile it correctly without fast-math:
test: # @test
sqrtps xmm0, xmm0
ret
And create this monstrosity with -ffast-math
:
.LCPI0_0:
.long 0xbf000000 # float -0.5
.long 0xbf000000 # float -0.5
.long 0xbf000000 # float -0.5
.long 0xbf000000 # float -0.5
.LCPI0_1:
.long 0xc0400000 # float -3
.long 0xc0400000 # float -3
.long 0xc0400000 # float -3
.long 0xc0400000 # float -3
test:
rsqrtps xmm1, xmm0
movaps xmm2, xmm0
mulps xmm2, xmm1
movaps xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
mulps xmm3, xmm2
mulps xmm2, xmm1
addps xmm2, xmmword ptr [rip + .LCPI0_1]
mulps xmm2, xmm3
xorps xmm1, xmm1
cmpneqps xmm0, xmm1
andps xmm0, xmm2
ret
The optimization flow here in LLVM is:
- That under fast-math conditions,
sqrt(x)
==x * rsqrt(x)
, so it usesrsqrtps
instead. - But that has precision issues between Intel and AMD due to a high ULP tolerance for the
rsqrtps
instruction. - So LLVM does two newton-raphson iterations anytime it calls
rsqrtps
to correct the precision between the CPU implementations.
The ‘fix’ here is just to use inline assembly to guarantee you’ll get the instruction selection you want always:
__m128 test(__m128 vec)
{
__asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
return vec;
}
But there is one additional thing I’d advocate you do if you need to use inline assembly - write your own constant folding.
See the one downside to the inline assembly above is that if test
is inlined and vec
was a constant, it wouldn’t constant fold it. For example:
__attribute__((always_inline)) __m128 test(__m128 vec)
{
__asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
return vec;
}
__m128 call_test()
{
return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f));
}
Will produce:
test:
sqrtps xmm0, xmm0
ret
.LCPI1_0:
.long 0x3f800000 # float 1
.long 0x40000000 # float 2
.long 0x40400000 # float 3
.long 0x40800000 # float 4
call_test:
movaps xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = [1.0E+0,2.0E+0,3.0E+0,4.0E+0]
sqrtps xmm0, xmm0
ret
So that even under inlining, when we could have constant folded it away entirely, we are still calling sqrtps
when we don’t have to. So what is the fix?
LLVM has an intrinsic is_constant
which can be got at via the Clang-supported GCC extension __builtin_constant_p
. If we extend our test
above to check when vec
is constant, we can call _mm_sqrt_ps
when it is constant, and benefit from the constant folder doing its thing and removing the call entirely. So our code becomes:
__attribute__((always_inline)) __m128 test(__m128 vec)
{
if (__builtin_constant_p(vec))
{
return _mm_sqrt_ps(vec);
}
__asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
return vec;
}
__m128 call_test()
{
return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f));
}
And we get:
call_test:
movaps xmm0, xmmword ptr [rip + .LCPI11_0] # xmm0 = [1.0E+0,2.0E+0,3.0E+0,4.0E+0]
sqrtps xmm0, xmm0
ret
What the heck?! It hasn’t constant folded! Turns out GCC is a bit picky with this builtin, and it looks like LLVM has inherited that funky behaviour. You cannot use it with a vector - even though LLVM happily has the support in the IR for it. But there is a workaround, an ugly one:
__attribute__((always_inline)) __m128 test(__m128 vec)
{
if (__builtin_constant_p(vec[0]) &&
__builtin_constant_p(vec[1]) &&
__builtin_constant_p(vec[2]) &&
__builtin_constant_p(vec[3]))
{
return _mm_sqrt_ps(vec);
}
__asm__ ("sqrtps %1, %0" : "=x"(vec) : "x"(vec));
return vec;
}
__m128 call_test()
{
return test(_mm_setr_ps(1.f, 2.f, 3.f, 4.f));
}
Will produce:
.LCPI15_0:
.long 0x3f800000 # float 1
.long 0x3fb504f3 # float 1.41421354
.long 0x3fddb3d7 # float 1.73205078
.long 0x40000000 # float 2
call_test:
movaps xmm0, xmmword ptr [rip + .LCPI15_0] # xmm0 = [1.0E+0,1.41421354E+0,1.73205078E+0,2.0E+0]
ret
Nice! We’ve got the constant folding we want. And also nicely, if we mark test
as noinline
instead, the code for test
is:
test:
sqrtps xmm0, xmm0
ret
Meaning the branch is folded away. In both cases we now get the behaviour we want. We’ve wrote our own constant folder. Nice! You can see the full example on godbolt.
It’d be nice if we could just use the vector in __builtin_constant_p
, but I think the LLVM folks have purposefully tried to match what GCC would do. I’d personally advocate for a loosening of the builtin, and I might file a GitHub issue about just that.