Playing Around with Floating Point Numbers with MATLAB – MATLAB and Python Recipes for Earth Sciences

Computers do not work with real numbers, of which there are infinitely many, but with a finite number of floating point numbers. Most real numbers do not exist in computers because they fall between two floating point numbers. Everything we calculate with a computer may therefore have a finite accuracy. Here are some MATLAB examples.

In the computer we describe numbers by a finite (e.g. 32 or 64) number of zeros and ones, the so-called binary system. Thus the number of representable numbers is finite, in contrast to real numbers. In a 64 bit system these are 2^64 or ~1.8447e+19 different numbers. These large numbers of floating point numbers are arranged so that the density is greatest around zero and decreases in the direction of minus infinity and plus infinity.

To demonstrate the difference between real numbers and floating point numbers, consider the value of x = 4/3 which you can write as 1.3333333…. with infinite repeating decimal 3. Try 4/3 in MATLAB:

format long, 4/3

which yields

ans =
   1.333333333333333

MATLAB rounds the value of 4/3 to the next floating-point number. Real numbers are continuous whereas floating numbers are not. Therefore, the result of your calculation can fall within the gap between two floating-point numbers, is then rounded to next floating-point number and therefore not exact.

Converting decimals to binaries (and back) is not difficult. As an example, let us convert 25.1 to a binary representation. This number has an integer part of 25 and a fractional part of 0.1 to be converted separately. First let us convert the integer part:

2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0
128 64  32  16  8   4   2   1 
0   0   0   1   1   0   0   1   binary number
                                (power of 2)
            16  8           1 = 25 decimal number
                                (power of 10)

You can calculate the binary representation by using

25/2 = 12 remainder 1
12/2 =  6 remainder 0
 6/2 =  3 remainder 0
 3/2 =  1 remainder 1
 1/2 =  0 remainder 1 (results in a zero therefore stop)

The remainders are the binaries that are arranged in the reverse order so that the first remainder becomes the least significant digit (LSD) and the last remainder becomes the most significant bit (MSD), therefore 11001 or as a normalized floating point number (mantissa between 1 and 2, e.g. 1.fffff) it is 1.1001 x 2^4.

Let us now convert 0.1 into a binary representation. Unfortunately, 0.1 is one of the many examples that has a infinite repeating binary representation. That is because 0.1 lies between two floating point numbers. To the base of 10 it is 1.0×10^(–1) but to the base of 2 it is

.0001100110011001100110011001100110011001100110011001101...

which is calculated by multiplying 0.1 by two, taking the decimal as the digit, take the fraction as the starting point for the next step, repeat until you either get to 0 or a periodic number, then read the binary number starting from the top :

0.1 * 2 = 0.2 -> 0
0.2 * 2 = 0.4 -> 0
0.4 * 2 = 0.8 -> 0
0.8 * 2 = 1.6 -> 1
0.6 * 2 = 1.2 -> 1
0.2 * 2 = 0.4 -> 0
0.4 * 2 = 0.8 -> 0
0.8 * 2 = 1.6 -> 1
0.6 * 2 = 1.2 -> 1

The result is .00011(0011) periodic or as a normalized floating point number it is 1.10011001…. x 2^(-4). Merging the integer and fractional part we get

11001.00011001100110011001100110011001100110011...

as the binary representation of 25.1. Here is a simple MATLAB script that does the same. First we create the decimal representation of the floating point number:

clear, clc
x = 25.1;

Then we separate the integer and fractional part of 25.1.

xi = round(25.1,0,'decimals');
xf = x - xi

Here we see some floating point issues already, since the difference between 25.1 and 0.1 is not exactly 0.1 if we switch to format long:

xf =
    0.100000000000001

We therefore use

xf = 0.1

instead. First, we convert the integer part,

i = 0;
bi = 0;
while xi ~= 0
i = i + 1;
xi = xi / 2;
bi(i) = 2*(xi - floor(xi));
xi = xi - 0.5*bi(i);
end
bi = fliplr(bi);
bistr = num2str(bi,'%u');

then the fractional part,

i = 0;
bf = 0;
while xf ~= 0
i = i + 1;
xf = xf * 2;
bf(i) = floor(xf);
xf = xf - bf(i);
end
bfstr = num2str(bf,'%u');

before we merge the two character strings to a single binary representation of 25.1.

bstr = [bistr,'.',bfstr]

which yields

bstr =
'11001.0001100110011001100110011001100110011001100110011001101'

The second while loop should actually never stop since 0.1 does not have a finite binary representation. Because of the limited accuracy of floating point numbers, however, xf indeed reaches zero and therefore the while loop comes to an end.

Ready-to-use functions exist to do this conversion, such as dec2bin of MATLAB, unfortunately only converting the integer part. Others can be found in the File Exchange of MATLAB.

We can calculate the round-off error epsilon (eps) when replacing a real number by a floating point number using eps. As an example, you can calculate the distance from 1 to the next larger double-precision number by typing

eps(1)

which yields

ans =
    2.220446049250313e-16

and for 5 we type

eps(5)

which yields

ans =
    8.881784197001252e-16

nicely demonstrating the increasing distance of floating point numbers with increasing distance from zero. The range of the available floating-point numbers in a computer is limited to realmin and realmax, the smallest and largest possible number. Values outside this range are displayed as -Inf and +Inf. Using

realmin
realmax

yields

ans =
   2.225073858507201e-308

ans =
   1.797693134862316e+308

There are problems associated with the work with floating-point numbers. Commutativity of addition [a+b=b+a] and multiplication [a*b=b*a] still holds, while associativity [(a+(b+c)=(a+b)+c] …

(0.5 + 0.2) - 0.7
0.5 + (0.2 - 0.7)

which yields

ans =
   0
ans =
   5.551115123125783e-17

but would be the same (zero) with real numbers. Also distributivity [a*(b+c)=a*b+a*c] is violated, as this example demonstrates,

0.2 * (0.5 - 0.7) + 0.04
0.2 * 0.5 - 0.2 * 0.7 + 0.04

which yields

ans =
   6.938893903907228e-18

ans =
   2.081668171172169e-17

In a more complex example b is halved at every step. With real numbers, this would never stop but with floating-point numbers it stops when b gets smaller than EPS, which is then considered to be no longer different from zero.

i = 1;
a = 1;
b = 1;
while a + b ~= a;
    b = b/2;
    i = i+1;
end
b
i

which yields

b =
   1.110223024625157e-16
i =
   54

suggesting that the while loop indeed terminates after 54 loops. You think that this doesn’t really affect your work if the differences are so small? Here is another classic example where the violation of associativity is combined with overflow (larger than realmax) or underflow (smaller than realmin). The result of ((1+x)-1)/x should be of course 1 but you get a 11% larger value!

x = 1.e-15;
((1+x)-1)/x

which yields

ans =
   1.110223024625157

instead of one.

Back to the problem in the beginning: There is no exact representation of 4/3 in MATLAB. Therefore you get 2.220446049250313e-16 instead of zero (example from MATLAB Docs):

1 - 3*(4/3 - 1)

which yields

ans =
   2.220446049250313e-16

instead of zero. Similarly, the sine of pi is not exactly zero as pi is not exactly pi in a computer:

sin(pi)

which yields

ans =
   1.224646799147353e-16

again instead of zero.

References

Quarteroni, A., Saleri, F., Gervasio, P. (2014) Scientific Computing with MATLAB and Octave, Fourth Ed., Springer, ISBN: 978-3-642-45367-0.

Trauth, M.H. (2015) MATLAB Recipes for Earth Sciences – Fourth Edition. Springer, 427 p., Supplementary Electronic Material, Hardcover, ISBN: 978-3-662-46244-7.