Using Floating-Point Scalar Evolution to Simplify Instructions
In this post, we’ll use floating-point scalar evolution (fpscev) to simplify instructions based on any range data we’ve managed to deduce.
This post is the latest in a series about my experimentations with floating-point scalar evolution, you really want to read the them in order:
- An Experimental Floating-Point Scalar Evolution
- Using Floating-Point Scalar Evolution to Propagate Fast-Math Flags
The LLVM optimization pass discussed in this post is available on github here.
Instruction Simplification⌗
LLVM is missing a bunch of instruction simplification surrounding floating-point - precisely because it is missing the fpscev analysis.
%f2 = tail call float @llvm.sqrt.f32(float %f)
In the above example, lets say we have deduced that %f
is always negative. The rules for sqrt state that if the input is negative, the output is NaN. LLVM cannot spot this code because it cannot deduce the range of %f
.
Another example is with the minnum/maxnum/minimum/maximum intrinsics:
%f3 = tail call float @llvm.minnum.f32(float %f, float %f2)
If %f2
was strictly less than %f
, then we could replace this call with %f2
since it is always less than. Again LLVM cannot spot this.
LLVM does have some pseudo floating-point evaluations surrounding the fcmp and all the various round-to-integer intrinsics. For instance the following fcmp:
%b = fcmp oeq float %f, %f2
Will be simplified to the constant false if the compiler has locally worked out that %f
can never be equal to %f2
.
This is great that LLVM has some insight into these values - but I’d still argue having a single mechanism based around the fpscev analysis would be more fruitful in the long term for optimizations.
Overview of the Pass⌗
The new pass I’ve added is FPInstSimplifyPass
(see https://github.com/sheredom/fpscev/blob/master/fpscev.cpp#L1585) - it runs over a function using an InstVisitor
, and uses the fpscev analysis to simplify instructions and intrinsics.
The main purpose of this pass is to try and remove instructions or replace instructions with more simple variants where knowing the range of the inputs lets us deduce a more simple form.
For example, lets say we have a simple code snippet of the form:
define i1 @fcmp_oeq_false(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fadd float %f2, 16.0
%b = fcmp oeq float %f3, %f
ret i1 %b
}
Because %f3
and %f
cannot overlap (and neither can be NaN), their values can never compare equal. LLVM already manages to optimize this to:
define i1 @fcmp_oeq_false(i4 %i, i4 %i2) {
ret i1 false
}
It does this in a roundabout way though - it turns the floating-point operations above into integer operations (in the float-to-int pass), and then relies on scalar evolution of integers to optimize the code.
Another case, this time with something that fails to optimize with LLVM:
define i1 @fcmp_ole_true(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fadd float %f2, 15.0
%b = fcmp ole float %f, %f3
ret i1 %b
}
LLVM turns this into:
define i1 @fcmp_ole_true(i4 %i, i4 %i2) {
%1 = zext i4 %i to i32
%2 = zext i4 %i2 to i32
%f31 = add nuw nsw i32 %2, 15
%b2 = icmp uge i32 %f31, %1
ret i1 %b2
}
A little weird - but it basically canonicalizes the ordered-less-than-equal into an ordered-greater-than-equal (and swaps the operands), and then converts this into integer operations like it did above. For some reason LLVM won’t optimize this further (even though I think integer scalar-evolution should be able to?).
I personally think for both the above cases LLVM shouldn’t be converting them to use integers, and I suspect it does this only so that scalar evolution can help with optimization. Instead, my fpscev analysis can be used to optimize both the cases above. Lets take a look!
Floating-Point Comparison⌗
Lets start our floating-point comparison optimization by looking at the example that LLVM doesn’t do a fantastic job of above:
define i1 @fcmp_ole_true(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fadd float %f2, 15.0
%b = fcmp ole float %f, %f3
ret i1 %b
}
In this example %f
is always ordered-less-than-equal to %f3
because %f3
is in the range [15.0..30.0] and %f
is in the range [0.0..15.0].
Using my fpscev analysis we can use the range of these values to fold this fcmp away:
FastMathFlags fmf = inst.getFastMathFlags();
const FPSCEV xFpscev =
fpse->getFPSCEV(inst.getOperand(0))->cloneWithFastMathFlags(fmf);
const FPSCEV yFpscev =
fpse->getFPSCEV(inst.getOperand(1))->cloneWithFastMathFlags(fmf);
CmpInst::Predicate predicate = CmpInst::BAD_FCMP_PREDICATE;
switch (inst.getPredicate()) {
default:
return;
case CmpInst::FCMP_OLE:
if (!xFpscev.isNaN() && !yFpscev.isNaN()) {
if (xFpscev.isLessThanEqual(yFpscev)) {
predicate = CmpInst::FCMP_TRUE;
} else if (xFpscev.isGreaterThan(yFpscev)) {
predicate = CmpInst::FCMP_FALSE;
}
}
break;
// all the other cases...
}
if (predicate != CmpInst::BAD_FCMP_PREDICATE) {
inst.setPredicate(predicate);
modified = true;
}
Since LLVM already has logic to handle folding FCMP_TRUE and FCMP_FALSE away, I just modify the predicate of the existing fcmp operation and will let LLVM handle the rest.
The resulting code after applying this optimization is:
define i1 @fcmp_ole_true(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fadd float %f2, 1.500000e+01
%b = fcmp true float %f, %f3
ret i1 %b
}
As can be seen the fcmp ole has become an fcmp true - the optimization has correctly noted that the compare would always return true. Note that because this is an ordered compare we had to be careful with NaNs. You’ll notice that I had a check on both the fpscev’s to the fcmp to see whether either of them were NaN - and not apply the optimization if so. This is another case where my fpscev analysis, precisely because it is very good at detecting no-NaN situations, allows us to perform a greater set of optimizations across a broader set of input values.
Intrinsics⌗
The other case where I try and use fpscev analysis to simplify instructions is with intrinsics. LLVM is pretty poor at folding floating-point min/max operations for instance, and fpscev allows us to perform similar optimizations as with fcmp on these.
One thing that sucks with LLVM’s InstVisitor is that it doesn’t have a nice way to iterate through intrinsics. The best method I’ve found is to do something like:
void visitIntrinsicInst(IntrinsicInst &inst) {
switch (inst.getIntrinsicID()) {
default:
break;
case Intrinsic::sqrt:
// do something!
But this means you end up with uber-functions containing a ton of implementation detail on each intrinsic. Also the fact that everything inhabits a big switch statement means that it is easy to fall foul of C++’s insane switch fallthrough default.
To get around this I’ve used some very basic templates to let us have each intrinsic optimization in its own function, and using a macro to reduce duplicating the intrinsic id:
void FPInstSimplifyPass::visitIntrinsicInst(IntrinsicInst &inst) {
switch (inst.getIntrinsicID()) {
default:
return;
#define INTRINSIC_VISIT(x) \
case x: \
return visitIntrinsic<x>(inst)
INTRINSIC_VISIT(Intrinsic::sqrt);
INTRINSIC_VISIT(Intrinsic::pow);
INTRINSIC_VISIT(Intrinsic::fabs);
INTRINSIC_VISIT(Intrinsic::minnum);
INTRINSIC_VISIT(Intrinsic::maxnum);
INTRINSIC_VISIT(Intrinsic::minimum);
INTRINSIC_VISIT(Intrinsic::maximum);
INTRINSIC_VISIT(Intrinsic::copysign);
INTRINSIC_VISIT(Intrinsic::floor);
INTRINSIC_VISIT(Intrinsic::ceil);
INTRINSIC_VISIT(Intrinsic::trunc);
INTRINSIC_VISIT(Intrinsic::rint);
INTRINSIC_VISIT(Intrinsic::nearbyint);
INTRINSIC_VISIT(Intrinsic::round);
#undef INTRINSIC_VISIT
}
}
This just lets me have all the intrinsics above inhabit their own function which makes debugging much easier (at least for me). Note - I use this pattern pseudo-regularly in C++ and I always recommend that you do not provide a default implementation of visitIntrinsic
. Doing this means you’ll get a linker error if you forget to explicitly add the visit method for that intrinsic.
Square Root⌗
For the sqrt intrinsic the main thing we can do is with negative inputs. A negative input to a sqrt always results in a NaN.
template <>
void FPInstSimplifyPass::visitIntrinsic<Intrinsic::sqrt>(
IntrinsicInst &inst) {
FastMathFlags fmf = inst.getFastMathFlags();
Value *const x = inst.getOperand(0);
const FPSCEV xFpscev = fpse->getFPSCEV(x)->cloneWithFastMathFlags(fmf);
// If the input is all negative, the result is NaN.
if (xFpscev.isAllNegative()) {
// If the instruction cannot return NaNs, replace it with undef.
if (inst.hasNoNaNs()) {
inst.replaceAllUsesWith(UndefValue::get(inst.getType()));
} else {
inst.replaceAllUsesWith(ConstantFP::getNaN(inst.getType()));
}
toRemoves.push_back(&inst);
}
}
So the code above gets the fpscev for the input, checks if all the values of it are negative, and then removes the sqrt entirely if so.
define float @sqrt_all_negative(float %f) {
%f2 = call float @llvm.copysign.f32(float %f, float -1.0)
%f3 = call float @llvm.sqrt.f32(float %f2)
ret float %f3
}
In the above example we can see that the input to sqrt is always negative, and so my optimization will turn the code into:
define float @sqrt_all_negative(float %f) {
%f2 = call float @llvm.copysign.f32(float %f, float -1.000000e+00)
ret float 0x7FF8000000000000
}
EG. always returning a NaN.
There is another case we have to be wary of - what if the sqrt intrinsic has a no-NaNs fast-math flag specified?
define float @sqrt_all_negative_nnan(i4 %i) {
%f = uitofp i4 %i to float
%f2 = fneg float %f
%f3 = call nnan float @llvm.sqrt.f32(float %f2)
ret float %f3
}
In this case - the user has specifically said that sqrt cannot produce a NaN, but we know 100% that a NaN is being produced. This is called a poison value in LLVM. LLVM does not have a way to specifically create a poison value, so in the above case I replace the sqrt with an undef instead:
define float @sqrt_all_negative_nnan(i4 %i) {
%f = uitofp i4 %i to float
%f2 = fneg float %f
ret float undef
}
X to the Power of Y⌗
The pow intrinsic is something that I couldn’t get a great handle on its range for fpscev due to the complexity of the function. One thing I can do though is deduce the range of Y - the value we are promoting X to the power of. LLVM has a companion intrinsic powi - X to the power of integer Y. I’ve implemented the pow/powi functions for OpenCL-level ulp requirements in a past life, and I know one of the complications of pow is that if you have a fractional component to the value you are raising to the power of, this increases the complexity of the calculation. Knowning that the power you are raising to is entirely integer (as is the case with powi) lets math library writers do some fun optimizations (I vaguely recall having at least 20% faster calls to powi versus pow when I worked on this many years ago!).
template <>
void FPInstSimplifyPass::visitIntrinsic<Intrinsic::pow>(
IntrinsicInst &inst) {
FastMathFlags fmf = inst.getFastMathFlags();
Value *const x = inst.getOperand(0);
Value *const y = inst.getOperand(1);
const FPSCEV xFpscev = fpse->getFPSCEV(x)->cloneWithFastMathFlags(fmf);
const FPSCEV yFpscev = fpse->getFPSCEV(y)->cloneWithFastMathFlags(fmf);
// If y is an integer
if (yFpscev.isInteger) {
const fltSemantics &semantics = xFpscev.min.getSemantics();
// 2^24 is the last integer number that we can fully represent in both
// floating-point and i32.
const APInt minInt(32, -(1 << 24), true);
const APInt maxInt(32, 1 << 24, true);
// If y is between min/max int, we can use powi instead of pow!
if (yFpscev.isGreaterThanEqual(getFromInt(semantics, minInt, true)) &&
yFpscev.isLessThanEqual(getFromInt(semantics, maxInt, true))) {
IRBuilder<> irb(&inst);
StringRef yName(y->getName());
Twine name(yName, ".i32cast");
Value *const yCast = irb.CreateFPToSI(y, irb.getInt32Ty(), name);
Instruction *const powi = irb.CreateIntrinsic(
Intrinsic::powi, inst.getType(), {x, yCast}, &inst);
powi->takeName(&inst);
// Copy the fpscev information from the original instruction onto the new.
fpse->copyFPSCEVFromValue(powi, &inst);
inst.replaceAllUsesWith(powi);
toRemoves.push_back(&inst);
}
}
}
The two conditions we need to pass if we can convert pow to powi is that Y must be an integer-hiding in a floating-point (which fpscev tracks), and that this integer must be less than 2^24. 2^24 is an interesting value because it just so happens to be the last integer value that can be represented without conversion loss in a 32-bit floating-point.
Once these two conditions are met, we can convert Y to a signed integer, and then call the powi intrinsic instead.
define float @pow(float %f, i25 %i2) {
%f2 = sitofp i25 %i2 to float
%f3 = call float @llvm.pow.f32(float %f, float %f2)
ret float %f3
}
The above becomes:
define float @pow(float %f, i25 %i2) {
%f2 = sitofp i25 %i2 to float
%f2.i32cast = fptosi float %f2 to i32
%f3 = call float @llvm.powi.f32(float %f, i32 %f2.i32cast)
ret float %f3
}
After this optimization is applied.
Floating-Point Absolute Value⌗
Abosolute value can be simplified if the value it was performing the absolute of was already positive.
template <>
void FPInstSimplifyPass::visitIntrinsic<Intrinsic::fabs>(
IntrinsicInst &inst) {
FastMathFlags fmf = inst.getFastMathFlags();
Value *const x = inst.getOperand(0);
const FPSCEV xFpscev = fpse->getFPSCEV(x)->cloneWithFastMathFlags(fmf);
// If the input is all not negative, we can just use the input.
if (xFpscev.isAllNonNegative()) {
inst.replaceAllUsesWith(x);
toRemoves.push_back(&inst);
}
}
We just check if the input x was all not negative, and if so replace the fabs intrinsic with the value being fed to fabs. This means the following example:
define float @fabs_already_positive(float %f) {
%f2 = call float @llvm.copysign.f32(float %f, float 1.0)
%f3 = call float @llvm.fabs.f32(float %f2)
ret float %f3
}
Can be folded to:
define float @fabs_already_positive(float %f) {
%f2 = call float @llvm.copysign.f32(float %f, float 1.0)
ret float %f2
}
Minimum’s and Maximum’s⌗
For both minnum/maxnum/minimum/maximum I perform a similar optimization - so I’ll just look at maxnum here for the sake of brevity.
template <>
void FPInstSimplifyPass::visitIntrinsic<Intrinsic::maxnum>(
IntrinsicInst &inst) {
FastMathFlags fmf = inst.getFastMathFlags();
Value *const x = inst.getOperand(0);
Value *const y = inst.getOperand(1);
const FPSCEV xFpscev = fpse->getFPSCEV(x)->cloneWithFastMathFlags(fmf);
const FPSCEV yFpscev = fpse->getFPSCEV(y)->cloneWithFastMathFlags(fmf);
// If one of the inputs is always less than the other, we fold away the
// intrinsic.
if (xFpscev.isGreaterThanEqual(yFpscev)) {
inst.replaceAllUsesWith(x);
toRemoves.push_back(&inst);
} else if (yFpscev.isGreaterThanEqual(xFpscev)) {
inst.replaceAllUsesWith(y);
toRemoves.push_back(&inst);
}
}
For maxnum, if we know that either of the inputs are always greater than or equal to the other value (assuming that neither of them are NaNs), we can replace the call to maxnum with the input that is guaranteed to dominate the other.
This means that the following code:
define float @maxnum_x(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fadd float %f2, 16.0
%f4 = call float @llvm.maxnum.f32(float %f3, float %f)
ret float %f4
}
Is folded to:
define float @maxnum_x(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fadd float %f2, 1.600000e+01
ret float %f3
}
Removing the call to maxnum entirely! This same optimization technique is applied to all four of the min/max intrinsics as I previously stated.
Copy Sign⌗
The intrinsic copysign has a few interesting optimization avenues.
- If the input and the output have the same sign, then the intrinsic is a no-op.
- If the input and output always have opposite signs, then we are really just negating the input X.
- If the input Y whom we are copying the sign from is always positive, then we are really just calling fabs on the input X.
template <>
void FPInstSimplifyPass::visitIntrinsic<Intrinsic::copysign>(
IntrinsicInst &inst) {
FastMathFlags fmf = inst.getFastMathFlags();
Value *const x = inst.getOperand(0);
Value *const y = inst.getOperand(1);
const FPSCEV xFpscev = fpse->getFPSCEV(x)->cloneWithFastMathFlags(fmf);
const FPSCEV yFpscev = fpse->getFPSCEV(y)->cloneWithFastMathFlags(fmf);
if ((xFpscev.isAllNonNegative() && yFpscev.isAllNonNegative()) ||
(xFpscev.isAllNegative() && yFpscev.isAllNegative())) {
inst.replaceAllUsesWith(x);
toRemoves.push_back(&inst);
} else if ((xFpscev.isAllNonNegative() && yFpscev.isAllNegative()) ||
(xFpscev.isAllNegative() && yFpscev.isAllNonNegative())) {
IRBuilder<> irb(&inst);
Value *const value = irb.CreateFNeg(x);
value->takeName(&inst);
// Copy the fpscev information from the original instruction onto the new.
fpse->copyFPSCEVFromValue(value, &inst);
if (Instruction *const otherInst = dyn_cast<Instruction>(value)) {
otherInst->setFastMathFlags(inst.getFastMathFlags());
}
inst.replaceAllUsesWith(value);
toRemoves.push_back(&inst);
} else if (yFpscev.isAllNonNegative()) {
IRBuilder<> irb(&inst);
Instruction *const fabs =
irb.CreateUnaryIntrinsic(Intrinsic::fabs, x, &inst);
fabs->takeName(&inst);
// Copy the fpscev information from the original instruction onto the new.
fpse->copyFPSCEVFromValue(fabs, &inst);
inst.replaceAllUsesWith(fabs);
toRemoves.push_back(&inst);
}
}
Above we can see the three cases of optimizations we can apply - which matches to five variants of copysign we could see in the wild. The following example has all five of these cases:
define float @copysign_both_positive(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = call float @llvm.copysign.f32(float %f, float %f2)
ret float %f3
}
define float @copysign_both_negative(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fneg float %f
%f4 = fneg float %f2
%f5 = call float @llvm.copysign.f32(float %f3, float %f4)
ret float %f5
}
define float @copysign_both_opposite_0(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fneg float %f
%f4 = call float @llvm.copysign.f32(float %f2, float %f3)
ret float %f4
}
define float @copysign_both_opposite_1(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fneg float %f
%f4 = call float @llvm.copysign.f32(float %f3, float %f2)
ret float %f4
}
define float @copysign_y_positive(float %f, i4 %i) {
%f2 = uitofp i4 %i to float
%f3 = call float @llvm.copysign.f32(float %f, float %f2)
ret float %f3
}
And after my optimization is applied:
define float @copysign_both_positive(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
ret float %f
}
define float @copysign_both_negative(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fneg float %f
%f4 = fneg float %f2
ret float %f3
}
define float @copysign_both_opposite_0(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fneg float %f
%f4 = fsub float -0.000000e+00, %f2
ret float %f4
}
define float @copysign_both_opposite_1(i4 %i, i4 %i2) {
%f = uitofp i4 %i to float
%f2 = uitofp i4 %i2 to float
%f3 = fneg float %f
%f4 = fsub float -0.000000e+00, %f3
ret float %f4
}
define float @copysign_y_positive(float %f, i4 %i) {
%f2 = uitofp i4 %i to float
%f3 = call float @llvm.fabs.f32(float %f)
ret float %f3
}
Note: I don’t really understand why LLVM has produced fsub from -0 instead of fneg when I used the IRBuilder here - but it should effectively be the same thing.
Floating-Point Rounding⌗
The rounding intrinsics floor/ceil/trunc/rint/nearbyint/round can all be folded to their input if the input is already an integer (because if the input is a whole number, rounding will do nothing). Since the optimization for each intrinsic is the same, lets just look at round:
template <>
void FPInstSimplifyPass::visitIntrinsic<Intrinsic::round>(
IntrinsicInst &inst) {
FastMathFlags fmf = inst.getFastMathFlags();
Value *const x = inst.getOperand(0);
const FPSCEV xFpscev = fpse->getFPSCEV(x)->cloneWithFastMathFlags(fmf);
// If the input is already an integer, floor is a no-op.
if (xFpscev.isInteger) {
inst.replaceAllUsesWith(x);
toRemoves.push_back(&inst);
}
}
The code is simple - if the input is an integer, replace the intrinsic with the input. This will fold code like:
define float @round_already_integer(i4 %i) {
%f = uitofp i4 %i to float
%f2 = call float @llvm.round.f32(float %f)
ret float %f2
}
To:
define float @round_already_integer(i4 %i) {
%f = uitofp i4 %i to float
ret float %f
}
Note these are optimizations that LLVM will do in very simple cases anyway (like the above example) - but LLVM will fall apart once you start to hit more complex variants. As I’ve said before - removing these piecemeal optimizations in LLVM and unifying them around the more correct fpscev analysis seems like a better approach for the long term.
Conclusion⌗
With this pass we’ve started to really get into the fun with fpscev - instruction simplification. Being able to fold away complex code for more simpler variants is what I got into the compiler business for in the first place, and with fpscev we have a great extra tool to do that.
In the next post I’ll look at refining the range of some of the more complex fpscev intrinsics by using some advice from the rather fantastic Marc B. Reynolds to reduce the calculated ranges. Look forward to seeing you there!