# Playing Around with Floating Point Numbers with MATLAB

Computers do not work with real numbers, of which there are infinitely many, but with a finite number of floating point numbers. Most real numbers do not exist in computers because they fall between two floating point numbers. Everything we calculate with a computer may therefore have a finite accuracy. Here are some MATLAB examples.

In the computer we describe numbers by a finite (e.g. 32 or 64) number of zeros and ones, the so-called binary system. Thus the number of representable numbers is finite, in contrast to real numbers. In a 64 bit system these are 2^64 or ~1.8447e+19 different numbers. These large numbers of floating point numbers are arranged so that the density is greatest around zero and decreases in the direction of minus infinity and plus infinity.

To demonstrate the difference between real numbers and floating point numbers, consider the value of x = 4/3 which you can write as 1.3333333…. with infinite repeating decimal 3. Try 4/3 in MATLAB:

`format long, 4/3`

which yields

```ans =
1.333333333333333```

MATLAB rounds the value of 4/3 to the next floating-point number. Real numbers are continuous whereas floating numbers are not. Therefore, the result of your calculation can fall within the gap between two floating-point numbers, is then rounded to next floating-point number and therefore not exact.

Converting decimals to binaries (and back) is not difficult. As an example, let us convert 25.1 to a binary representation. This number has an integer part of 25 and a fractional part of 0.1 to be converted separately. First let us convert the integer part:

```2^7 2^6 2^5 2^4 2^3 2^2 2^1 2^0
128 64  32  16  8   4   2   1
0   0   0   1   1   0   0   1   binary number
(power of 2)
16  8           1 = 25 decimal number
(power of 10)```

You can calculate the binary representation by using

```25/2 = 12 remainder 1
12/2 =  6 remainder 0
6/2 =  3 remainder 0
3/2 =  1 remainder 1
1/2 =  0 remainder 1 (results in a zero therefore stop)```

The remainders are the binaries that are arranged in the reverse order so that the first remainder becomes the least significant digit (LSD) and the last remainder becomes the most significant bit (MSD), therefore 11001 or as a normalized floating point number (mantissa between 1 and 2, e.g. 1.fffff) it is 1.1001 x 2^4.

Let us now convert 0.1 into a binary representation. Unfortunately, 0.1 is one of the many examples that has a infinite repeating binary representation. That is because 0.1 lies between two floating point numbers. To the base of 10 it is 1.0×10^(–1) but to the base of 2 it is

`.0001100110011001100110011001100110011001100110011001101...`

which is calculated by multiplying 0.1 by two, taking the decimal as the digit, take the fraction as the starting point for the next step, repeat until you either get to 0 or a periodic number, then read the binary number starting from the top :

```0.1 * 2 = 0.2 -> 0
0.2 * 2 = 0.4 -> 0
0.4 * 2 = 0.8 -> 0
0.8 * 2 = 1.6 -> 1
0.6 * 2 = 1.2 -> 1
0.2 * 2 = 0.4 -> 0
0.4 * 2 = 0.8 -> 0
0.8 * 2 = 1.6 -> 1
0.6 * 2 = 1.2 -> 1```

The result is .00011(0011) periodic or as a normalized floating point number it is 1.10011001…. x 2^(-4). Merging the integer and fractional part we get

`11001.00011001100110011001100110011001100110011...`

as the binary representation of 25.1. Here is a simple MATLAB script that does the same. First we create the decimal representation of the floating point number:

```clear, clc
x = 25.1;```

Then we separate the integer and fractional part of 25.1.

```xi = round(25.1,0,'decimals');
xf = x - xi```

Here we see some floating point issues already, since the difference between 25.1 and 0.1 is not exactly 0.1 if we switch to format long:

```xf =
0.100000000000001```

We therefore use

`xf = 0.1`

instead. First, we convert the integer part,

```i = 0;
bi = 0;
while xi ~= 0
i = i + 1;
xi = xi / 2;
bi(i) = 2*(xi - floor(xi));
xi = xi - 0.5*bi(i);
end
bi = fliplr(bi);
bistr = num2str(bi,'%u');```

then the fractional part,

```i = 0;
bf = 0;
while xf ~= 0
i = i + 1;
xf = xf * 2;
bf(i) = floor(xf);
xf = xf - bf(i);
end
bfstr = num2str(bf,'%u');```

before we merge the two character strings to a single binary representation of 25.1.

`bstr = [bistr,'.',bfstr]`

which yields

```bstr =
'11001.0001100110011001100110011001100110011001100110011001101'```

The second while loop should actually never stop since 0.1 does not have a finite binary representation. Because of the limited accuracy of floating point numbers, however, xf indeed reaches zero and therefore the while loop comes to an end.

Ready-to-use functions exist to do this conversion, such as dec2bin of MATLAB, unfortunately only converting the integer part. Others can be found in the File Exchange of MATLAB.

We can calculate the round-off error epsilon (eps) when replacing a real number by a floating point number using eps. As an example, you can calculate the distance from 1 to the next larger double-precision number by typing

`eps(1)`

which yields

```ans =
2.220446049250313e-16```

and for 5 we type

`eps(5)`

which yields

```ans =
8.881784197001252e-16```

nicely demonstrating the increasing distance of floating point numbers with increasing distance from zero. The range of the available floating-point numbers in a computer is limited to realmin and realmax, the smallest and largest possible number. Values outside this range are displayed as -Inf and +Inf. Using

```realmin
realmax```

yields

```ans =
2.225073858507201e-308

ans =
1.797693134862316e+308```

There are problems associated with the work with floating-point numbers. Commutativity of addition [a+b=b+a] and multiplication [a*b=b*a] still holds, while associativity [(a+(b+c)=(a+b)+c] …

```(0.5 + 0.2) - 0.7
0.5 + (0.2 - 0.7)```

which yields

```ans =
0
ans =
5.551115123125783e-17```

but would be the same (zero) with real numbers. Also distributivity [a*(b+c)=a*b+a*c] is violated, as this example demonstrates,

```0.2 * (0.5 - 0.7) + 0.04
0.2 * 0.5 - 0.2 * 0.7 + 0.04```

which yields

```ans =
6.938893903907228e-18

ans =
2.081668171172169e-17```

In a more complex example b is halved at every step. With real numbers, this would never stop but with floating-point numbers it stops when b gets smaller than EPS, which is then considered to be no longer different from zero.

```i = 1;
a = 1;
b = 1;
while a + b ~= a;
b = b/2;
i = i+1;
end
b
i```

which yields

```b =
1.110223024625157e-16
i =
54```

suggesting that the while loop indeed terminates after 54 loops. You think that this doesn’t really affect your work if the differences are so small? Here is another classic example where the violation of associativity is combined with overflow (larger than realmax) or underflow (smaller than realmin). The result of ((1+x)-1)/x should be of course 1 but you get a 11% larger value!

```x = 1.e-15;
((1+x)-1)/x```

which yields

```ans =
1.110223024625157```

Back to the problem in the beginning: There is no exact representation of 4/3 in MATLAB. Therefore you get 2.220446049250313e-16 instead of zero (example from MATLAB Docs):

`1 - 3*(4/3 - 1)`

which yields

```ans =
2.220446049250313e-16```

instead of zero. Similarly, the sine of pi is not exactly zero as pi is not exactly pi in a computer:

`sin(pi)`

which yields

```ans =
1.224646799147353e-16```