For convenience, math/flonum re-exports racket/flonum
as well as providing the functions document below.
2.1 Additional Flonum Functions
|> (fl 1/2)|
|> (fl 0.5)|
|> (fl #i0.5)|
Computes (flsqrt (+ (* x x) (* y y)))
in way that overflows only when the
answer is too large.
|> (flsqrt (+ (* 1e+200 1e+200) (* 1e+199 1e+199)))|
|> (flhypot 1e+200 1e+199)|
Like (apply + xs)
, but incurs rounding error only once.
|> (+ 1.0 1e-16)|
|> (+ (+ 1.0 1e-16) 1e-16)|
|> (flsum '(1.0 1e-16 1e-16))|
function does the same for heterogenous lists of reals.
Worst-case time complexity is O(n2), though the pathological
inputs needed to observe quadratic time are exponentially improbable and are hard
to generate purposely. Expected time complexity is O(n log(n)).
See flvector-sums for a variant that computes all the partial sums in xs.
Maximum observed error is 2 ulps, making these functions (currently) much more accurate than their
racket/math counterparts. They also return sensible values on the largest possible domain.
These functions are as robust and accurate as their corresponding inverses.
Like (fl (factorial (fl->exact-integer n)))
and so on, but computed in constant
time. Also, these return +nan.0
instead of raising exceptions.
For factorial-like functions that return sensible values for non-integers, see
gamma and beta.
Like (fllog (flfactorial n))
and so on, but more accurate and without unnecessary overflow.
For log-factorial-like functions that return sensible values for non-integers, see
log-gamma and log-beta.
Like (fllog (+ 1.0 x))
and (- (flexp x) 1.0)
, but accurate when
is small (within 1 ulp
For example, one difficult input for (fllog (+ 1.0 x))
and (- (flexp x) 1.0)
, which fllog1p
|> (fllog (+ 1.0 1e-14))|
|> (fllog1p 1e-14)|
|> (- (flexp 1e-14) 1.0)|
|> (flexpm1 1e-14)|
These functions are mutual inverses:
Notice that both graphs pass through the origin. Thus, inputs close to 0.0,
around which flonums are particularly dense, result in outputs that are also close
to 0.0. Further, both functions are approximately the identity function
near 0.0, so the output density is approximately the same.
Many flonum functions defined in terms of fllog and flexp
become much more accurate when their defining expressions are put in terms of
fllog1p and flexpm1. The functions exported by this module and
by math/special-functions use them extensively.
One notorious culprit is (flexpt (- 1.0 x) y)
, when x
. Computing it directly too often results in the wrong answer:
We should expect that multiplying a number just less than 1.0 by itself
that many times would result in something less than 1.0. The problem
comes from subtracting such a small number from 1.0 in the first place:
Fortunately, we can compute this correctly by putting the expression in terms
, which avoids the error-prone subtraction:
But see flexpt1p
, which is more accurate still.
Like (flexpt (+ 1.0 x) y)
, but accurate for any x
Like (flexpt (+ x1 x2) y)
, but more accurate.
Equivalent to (flexpt 2.0 x)
, but faster when x
is an integer.
Computes the base-2 log of x
more accurately than (/ (fllog x) (fllog 2.0))
In particular, (fllog2 x)
is correct for any power of two x
Maximum observed error is 0.5006 ulps
, but is almost always no more than 0.5 (i.e. it is
almost always correct
Computes the base-b
log of x
more accurately than (/ (fllog x) (fllog b))
and handles limit values correctly.
Maximum observed error is 2.1 ulps, but is usually less than 0.7 (i.e. near rounding error).
Except possibly at limit values (such as 0.0
, and b = 1.0
and except when the inner expression underflows or overflows, fllogb
these identities for b > 0.0
Unlike with flexpt
, there is no standard for fllogb
’s behavior at limit values.
Fortunately, deriving the following rules (applied in order) is not prohibitively difficult.
Most of these rules are derived by taking limits of the mathematical base-b
Except for (fllogb 1.0 x)
, when doing so gives rise to ambiguities, they are resolved using
’s behavior, which follows the IEEE 754 and C99 standards for pow
For example, consider (fllogb 0.0 0.0).
Taking an interated limit, we get ∞ if the outer limit is with respect to x, or 0 if the
outer limit is with respect to b.
This would normally mean (fllogb 0.0 0.0) = +nan.0.
However, choosing +inf.0 ensures that these additional left-inverse and right-inverse
Further, choosing 0.0 does not ensure that any additional identities hold.
Uses the Brent-Dekker method
find a floating-point root of f
(an x : Flonum
for which (f x)
very near a zero crossing) between a
The values (f a)
and (f b)
must have opposite signs, but a
may be in any order.
There is no guarantee that flbracketed-root will find any particular root.
Moreover, future updates to its implementation could make it find different ones.
There is currently no guarantee that it will find the closest x to an exact root.
It currently runs for at most 5000 iterations.
It usually requires far fewer iterations, especially if the initial bounds a and b are tight.
Equivalent to (λ (y) (flexpt x y))
is a flonum, but much more
accurate for large y
cannot be exactly represented
by a flonum.
Suppose we want to compute πy
, where y
is a flonum.
If we use flexpt
with an approximation
of the irrational base π,
the error is low near zero, but grows with distance from the origin:
, the error is near rounding error everywhere:
This example is used in the implementations of zeta
Like (- (flsqrt (+ 1.0 x)) 1.0)
, but accurate when x
Like (- (fllog1p x) x)
, but accurate when x
Like (flexp (* x x))
, but accurate when x
Like (flexp (- (* x x)))
, but accurate when x
Like (flexp (+ 1.0 x))
, but accurate when x
is near a power of 2.
Like (flsin (* pi x))
, (flcos (* pi x))
and (fltan (* pi x))
but accurate near roots and singularities. When x = (+ n 0.5)
for some integer n
(fltanpix x) = +nan.0
Like (/ 1.0 (flsinpix x))
, (/ 1.0 (flcospix x))
and (/ 1.0 (fltanpix x))
respectively, but the first two return +nan.0
at singularities and flcotpix
a double reciprocal.
2.2 Log-Space Arithmetic
It is often useful, especially when working with probabilities and probability
densities, to represent nonnegative numbers in log space, or as the
natural logs of their true values. Generally, the reason is that the smallest
positive flonum is too large.
For example, say we want the probability density of the standard normal distribution
(the bell curve) at 50 standard deviations from zero:
Mathematically, the density is nonzero everywhere, but the density at 50 is less than
. However, its density in log space, or its log-density, is representable:
While this example may seem contrived, it is very common, when computing the density
of a vector of data, for the product of the densities to be too small to represent directly.
In log space, exponentiation becomes multiplication, multiplication becomes addition, and
addition becomes tricky. See lg+ and lgsum for solutions.
Equivalent to (fl+ logx logy)
, (fl- logx logy)
and (flsum logxs)
When logy > logx, lg- returns +nan.0. Both functions correctly treat
-inf.0 as log-space 0.0.
To add more than two log-space numbers with the same guarantees, use lgsum.
Though more accurate than a naive implementation, both functions are prone to catastrophic
cancellation in regions where they output a value close to 0.0 (or log-space 1.0).
While these outputs have high relative error, their absolute error is very low, and when
exponentiated, nearly have just rounding error. Further, catastrophic cancellation is unavoidable
when logx and logy themselves have error, which is by far the common case.
These are, of course, excuses—but for floating-point research generally. There are currently no
reasonably fast algorithms for computing lg+ and lg- with low relative error.
For now, if you need that kind of accuracy, use math/bigfloat.
Like folding lg+
, but more accurate. Analogous to flsum
Equivalent to (lg+ (fllog 1.0) logx)
and (lg- (fllog 1.0) logx)
respectively, but faster.
, returns #t
when (<= 0.0 x 1.0)
, returns #t
when (<= -inf.0 x 0.0)
2.3 Debugging Flonum Functions
The following functions and constants are useful in authoring and debugging flonum functions
that must be accurate on the largest possible domain.
Suppose we approximate flexp
using its Taylor series centered at 1.0
after three terms (a second-order polynomial):
We can use plot
(documented below) to compare its output
to that of flexp
on very small intervals:
Such plots are especially useful when centered at a boundary between two different
For larger intervals, assuming the approximated function is fairly smooth,
we can get a better idea how close the approximation is using flulp-error
We can infer from this plot that our Taylor series approximation has close to
rounding error (no more than an ulp
) near 1.0
, but quickly becomes worse farther away.
To get a ground-truth function such as exp to test against, compute the
outputs as accurately as possible using exact rationals or high-precision bigfloats.
2.3.1 Measuring Floating-Point Error
, or u
nit in l
the magnitude of the least significant bit in x
|> (flulp 1.0)|
|> (flulp 1e-100)|
|> (flulp 1e+200)|
Returns the absolute number of ulps
difference between x
For non-rational arguments such as +nan.0, flulp-error returns 0.0
if (eqv? x r); otherwise it returns +inf.0.
A flonum function with maximum error 0.5 ulps exhibits only rounding error;
it is correct. A flonum function with maximum error no greater than a few ulps
is accurate. Most moderately complicated flonum functions, when implemented
directly, seem to have over a hundred thousand ulps maximum error.
* You can make an exception when the result is to be exponentiated.
If x has small absolute-error, then (exp x)
has small relative-error and small flulp-error.
The last example subtracts two nearby flonums, the second of which had already been
rounded, resulting in horrendous error. This is an example of catastrophic
. Avoid subtracting nearby flonums whenever possible.*
See relative-error for a similar way to measure approximation error when the
approximation is not necessarily represented by a flonum.
2.3.2 Flonum Constants
The nonzero, rational flonums with maximum and minimum magnitude.
The smallest flonum that can be added to 1.0 to yield a larger number,
or the magnitude of the least significant bit in 1.0.
Epsilon is often used in stopping conditions for iterative or additive approximation methods.
For example, the following function uses it to stop Newton’s method to compute square roots.
(Please do not assume this example is robust.)
When (<= (abs dy) (abs (* 0.5 epsilon.0 y)))
, adding dy
rarely results in a different flonum. The value 0.5
can be changed to allow
looser approximations. This is a good idea when the approximation does not have to be
as close as possible (e.g. it is only a starting point for another approximation method),
or when the computation of dy
is known to be inaccurate.
Approximation error is often understood in terms of relative error in epsilons.
Number of epsilons relative error roughly corresponds with error in ulps, except
when the approximation is subnormal.
2.3.3 Low-Level Flonum Operations
Returns the signed ordinal index of x in a total order over flonums.
These properties mean that flonum->ordinal
does not distinguish -0.0
Returns the flonum n
flonums away from x
, according to flonum->ordinal
, returns +nan.0
|> (flstep 0.0 1)|
|> (flstep (flstep 0.0 1) -1)|
|> (flstep 0.0 -1)|
|> (flstep +inf.0 1)|
|> (flstep +inf.0 -1)|
|> (flstep -inf.0 -1)|
|> (flstep -inf.0 1)|
|> (flstep +nan.0 1000)|
Though flonum operations on subnormal numbers are still often implemented
by software exception handling, the situation is improving. Robust
flonum functions should handle subnormal inputs correctly, and reduce error
in outputs as close to zero ulps as possible.
The maximum positive and negative subnormal flonums. A flonum x
is subnormal when
it is not zero and (<= (abs x) +max-subnormal.0)
2.4 Double-Double Operations
For extra precision, floating-point computations may use two nonoverlapping flonums to represent a single number.
Such pairs are often called double-double numbers.
The exact sum of the pair is the number it represents.
(Because they are nonoverlapping, the floating-point sum is equal to the largest.)
For speed, especially with arithmetic operations, there is no data type for double-double numbers.
They are always unboxed: given as two arguments, and received as two values.
In both cases, the number with higher magnitude is first.
Inputs are never checked to ensure they are sorted and nonoverlapping, but outputs are guaranteed to be
sorted and nonoverlapping if inputs are.
Converts a real number or the sum of two flonums into a double-double
|> (fl 1/7)|
|> (relative-error (fl 1/7) 1/7)|
|> (define-values (x2 x1) (fl2 1/7))|
|> (list x2 x1)|
Notice that the exact sum of x2 and x1 in the preceeding example has very low
If x is not rational, fl2 returns (values x 0.0).
Returns the exact sum of x2 and x1 if x2 is rational, x2 otherwise.
is rational, returns #t
when (flabs x2) > (flabs x1)
is not rational, returns (fl= x1 0.0)
This function is quite slow, so it is used only for testing.
Compute the same values as (fl+ x y)
, (fl- x y)
, (fl* x y)
, (fl/ x y)
(fl* x x)
, (flsqrt x)
, (flexp x)
and (flexpm1 x)
, but return the normally
rounded-off low-order bits as the second value.
The result is an unboxed double-double
Use these functions to generate double-double numbers directly from the results of floating-point operations.
For flexp/error and flexpm1/error, the largest observed error is 3 ulps.
For the rest, the largest observed error is 0.5 ulps.
For arithmetic, error is less than 8 ulps. (See fl2ulp.)
For fl2sqr and fl2sqrt, error is less than 1 ulp, and fl2abs is exact.
For fl2exp and fl2expm1, error is less than 3 ulps. (See fl2ulp.)
For fl2log and fl2log1p, error is less than 2 ulps.
2.4.1 Debugging Double-Double Functions
The unit in last place of a double-double is that of the higher-order of the pair, shifted 52 bits right.
|> (fl2ulp 1.0 0.0)|
The high-order flonum of the maximum-magnitude, subnormal double-double
Try to avoid computing with double-doubles in the subnormal range in intermediate computations.
2.4.2 Low-Level Double-Double Operations
The following syntactic forms are fast versions of functions like fl+/error.
They are fast because they make assumptions about the magnitudes of and relationships between their arguments,
and do not handle non-rational double-double flonums properly.
Return two values: (fl+ x y)
or (fl- x y)
, and its rounding error.
Both assume (flabs x) > (flabs y)
The values are unspecified when x
is not rational.
Returns nonoverlapping (values y2 y1)
, each with 26 bits precision, with (flabs y2) > (flabs y1)
such that (fl+ y2 y1) = x
For (flabs x) > 1.3393857490036326e+300
, returns (values +nan.0 +nan.0)
Used to implement double-double multiplication.
2.5 Additional Flonum Vector Functions
Creates a length-n
flonum vector by applying proc
to the indexes
to (- n 1)
. Analogous to build-vector
, but always inlined. This increases speed at the expense of code size.
to the corresponding elements of xs
. Analogous to
The proc is meant to accept the same number of arguments as the number of its following
flonum vector arguments. However, a current limitation in Typed Racket requires proc
to accept any number of arguments. To map a single-arity function such as fl+
over the corresponding number of flonum vectors, for now, use inline-flvector-map.
Convert between lists and flonum vectors, and between vectors and flonum vectors.
Arithmetic lifted to operate on flonum vectors.
Computes the partial sums of the elements in xs in a way that incurs rounding error only
once for each partial sum.
Compare the same example computed by direct summation:
'(1.0 1.0 1.0 1.0 1.0 1e+100 0.0)