Using Floating-Point Scalar Evolution to Propagate Fast-Math Flags
As a follow-up to my post An Experimental Floating-Point Scalar Evolution, I’ve started looking into using my previous analysis to actually change the LLVM IR for the better.
The LLVM optimization pass discussed in this post is available on github here.
Fast-Math Flags⌗
LLVM has a way of encoding the fast-math flags on various floating-point operations. These flags let the compiler assume something about the operation, which generally allows more performance.
Of these flags, there are two that can use my previous floating-point scalar evolution (fpscev) analysis to propagate through other instructions - no-NaNs and no-inifinites.
This can change the assembly that gets spit out of the compiler in noticable ways - being able to assume that you won’t ever see a NaN or an infinity means more optimal paths could be taken. For instance, let us look at the difference between some fcmp’s, with and without the fast-math flags.
; Normal fcmp (as would be generated by a C/C++ a < b check)
define i1 @a(float, float) {
%3 = fcmp olt float %0, %1
ret i1 %3
}
; Same fcmp as above but with fast-math flags
define i1 @b(float, float) {
%3 = fcmp nnan ninf olt float %0, %1
ret i1 %3
}
; The equivalent unordered less than check for comparison
define i1 @c(float, float) {
%3 = fcmp ult float %0, %1
ret i1 %3
}
If the above is compiled for a generic aarch64 target (64-bit ARM), I get:
a: // @a
fcmp s0, s1
cset w0, mi
ret
b: // @b
fcmp s0, s1
cset w0, lt
ret
c: // @c
fcmp s0, s1
cset w0, lt
ret
And as can be seen the code for b & c is identical, but a is different. The compiler is generating the same code for an unordered less than compare as it would for an ordered less than compare with the fast-math flags we specified.
One sidenote here: even if there was zero impact to codegen currently I still follow a firm philosophy that the more information you can give a compiler the better the results will be over time. I run into this issue a ton during my day job as a compiler engineer where frontends won’t encode all the information they know about because ’the compiler did not do anything with it anyway so who cares’. What happens in the end is a chicken and egg problem - the frontends won’t give the information because nothing is done with it, and the middle/backends do not optimize in the presence of information because the frontends don’t generate it… So more information is always a good idea in my book.
Overview of the Pass⌗
The new pass I’ve added is FastMathPropagationPass
(see https://github.com/sheredom/fpscev/blob/master/fpscev.cpp#L1384) - it runs over a function using an InstVisitor
, and uses the fpscev analysis to propagate nnan and ninf if the inputs & outputs of certain operations could not contain NaNs or infinities.
There are only a few instructions that we can propagate fast-math flags through, so lets run through them all in turn.
Floating-Point Comparison⌗
Like in the example above - the first obvious case for fast-math propagation is in floating-point comparison. We’ve already seen a difference in codegen when these flags are specified, so greater minds than myself have already thought about the optimal mapping of these instructions in the various backends (I say hopefully!).
%b = fcmp olt float %f, %f2
For the above fcmp statement - if %f
and %f2
are both definitely not a NaN when I evaluate their range, I can set the nnan flag. If they are both not infinity, I can set the ninf flag.
%b = fcmp nnan olt float %f, %f2
Floating-Point Negation⌗
%f2 = fneg float %f
For negation - I can propagate if %f
was nnan or ninf onto the fneg itself.
%f2 = fneg nnan ninf float %f
Floating-Point Binary Operators⌗
%f3 = fadd float %f, %f2
%f4 = fsub float %f, %f2
%f5 = fmul float %f, %f2
%f6 = fdiv float %f, 4.000000e+08
For the binary-operators fadd, fsub, fmul, and fdiv, to save me from putting in logic in the pass to re-understand the intricacies of the operation itself, I simply look at the fpscev for the input operands and the fpscev of the instruction itself. If each of the inputs and the instruction are not NaN or not infinity, we can set the nnan/ninf flags.
void visitBinaryOperator(BinaryOperator &inst) {
switch (inst.getOpcode()) {
default:
return;
case Instruction::FAdd:
case Instruction::FSub:
case Instruction::FMul:
case Instruction::FDiv:
case Instruction::FRem:
break;
}
const FPSCEV *const xFpscev = fpse->getFPSCEV(inst.getOperand(0));
const FPSCEV *const yFpscev = fpse->getFPSCEV(inst.getOperand(1));
const FPSCEV *const fpscev = fpse->getFPSCEV(&inst);
if (fpscev->isFinite() && xFpscev->isFinite() && yFpscev->isFinite()) {
inst.setHasNoInfs(true);
modified = true;
}
if (!fpscev->isNaN() && !xFpscev->isNaN() && !yFpscev->isNaN()) {
inst.setHasNoNaNs(true);
modified = true;
}
}
This lets us optimize the above operations if the fpscev has a smaller range.
%f3 = fadd nnan ninf float %f, %f2
%f4 = fsub nnan ninf float %f, %f2
%f5 = fmul nnan ninf float %f, %f2
%f6 = fdiv nnan ninf float %f, 4.000000e+08
Floating-Point Intrinsics⌗
For all the floating-point intrinsics that can take fast-math flags, I follow a similar approach to how I handle binary operators.
void visitIntrinsicInst(IntrinsicInst &inst) {
bool atLeastOneFP = false;
bool allFinite = true;
bool allNotNaN = true;
for (Value *const arg : inst.args()) {
const FPSCEV *const fpscev = fpse->getFPSCEV(arg);
if (fpscev) {
atLeastOneFP = true;
allFinite = allFinite && fpscev->isFinite();
allNotNaN = allNotNaN && !fpscev->isNaN();
}
}
const FPSCEV *const fpscev = fpse->getFPSCEV(&inst);
if (fpscev) {
atLeastOneFP = true;
allFinite = allFinite && fpscev->isFinite();
allNotNaN = allNotNaN && !fpscev->isNaN();
}
if (atLeastOneFP) {
if (allFinite) {
inst.setHasNoInfs(true);
modified = true;
}
if (allNotNaN) {
inst.setHasNoNaNs(true);
modified = true;
}
}
}
Basically loop over all the arguments to the intrinsic, if any have a fpscev (which all floating-point values should already have), and then record if the argument could be a NaN or infinity. Then do the same for the result fpscev to make sure the result conforms too.
Then, if I have seen at least one floating-point argument or result, and they are not NaN or infinity, I can set the fast-math flags.
%f3 = call nnan ninf float @llvm.fma.f32(float %f, float %f2, float 4.000000e+00)
This lets us handle all the floating-point intrinsics I had previously identified correctly and propagate the fast-math flags onto them.
Conclusion⌗
So now I’ve got a pass that uses my fpscev analysis to change the IR - hopefully for the better!
One place where this kind of propagation could be super useful is when inlining has taken place. Lets assume I have a function:
define float @called_func(float %x, float %y) {
%f = fmul float %x, %y
%f2 = fadd float %f, 1.0
ret float %f2
}
define float @func(float %x, float %y) {
%f = call nnan ninf float @called_func(float %x, float %y)
ret float %f
}
And lets look at the function after inlining:
define float @func(float %x, float %y) {
%f.i = fmul float %x, %y
%f2.i = fadd float %f.i, 1.000000e+00
ret float %f2.i
}
You can see that the nnan and ninf flags on the original call are lost after inlining - %x
and %y
were not NaN and not infinity, so the fmul and fadd could have the nnan/ninf flags on them. We could make an inliner change that would use my fpscev pass to propagate the information even after inlining has taken place - which could be super powerful!
In the next post I’ll look at the next of my optimization ideas - instruction simplification using the fpscev ranges. Stay tuned for some really cool optimizations investigations to come.