Principal Component Analysis in 5 Steps

The Principal Component Analysis (PCA) is equivalent to fitting an n-dimensional ellipsoid to the data, where the eigenvectors of the covariance matrix of the data set are the axes of the ellipsoid. The eigenvalues represent the distribution of the variance among each of the eigenvectors. To understand the method, it is helpful to know something about matrix algebra, eigenvectors, and eigenvalues. Here is a n=2 dimensional example to perform a PCA without the use of the MATLAB function pca, but with the function of eig for the calculation of eigenvectors and eigenvalues.

Assume a data set that consists of measurements of p variables on n samples, stored in an n-by-p array. As an example we are creating a bivariate data set of two vectors, 30 data points each, with a strong linear correlation, overlain by normally distributed noise.

clear, clc, close all

rng(0)
data(:,1) = rand(30,1);
data(:,2) = 3.4 + 1.2 * data(:,1);
data(:,2) = data(:,2) + 0.2*rand(size(data(:,1)));
data = sortrows(data,1);

figure
axes('LineWidth',0.6,...
    'FontName','Helvetica',...
    'FontSize',8,...
    'XAxisLocation','Origin',...
    'YAxisLocation','Origin');
line(data(:,1),data(:,2),...
    'LineStyle','None',...
    'Marker','o');
axis equal

The task is to find the unit vector pointing into the direction with the largest variance within the bivariate data set data. The solution of this problem is to calculate the largest eigenvalue D of the covariance matrix C and the corresponding eigenvector V

C * V = λ*V

where

C = data’ * data

which is equivalent to

(C – D * E) V = 0

where E is the identity matrix, which is a classic eigenvalue problem: it is a linear system of equations with the eigenvectors V as the solution.

Step 1

First we have to mean center the data, i.e. substract the univariate means from the two columns, i.e. the two variables of data. In the following steps we therefore study the deviations from the mean(s) only.

data(:,1) = data(:,1)-mean(data(:,1));
data(:,2) = data(:,2)-mean(data(:,2));
Step 2

Next we calculate the covariance matrix of data. The covariance matrix contains all necessary information to rotate the coordinate system.

C = cov(data)

which yields

C =
   0.0987   0.1173
   0.1173   0.1425
Step 3

The rotation helps to create new variables which are uncorrelated, i.e. the covariance is zero for all pairs of the new variables. The decorrelation is achieved by diagonalizing the covariance matrix C. The eigenvectors V belonging to the diagonalized covariance matrix are a linear combination of the old base vectors, thus expressing the correlation between the old and the new time series. We find the eigenvalues of the covariance matrix C by solving the equation

det(C – D * E) = 0

The eigenvalues D of the covariance matrix, i.e. the diagonalized version of C, gives the variance within the new coordinate axes, i.e. the principal components. We now calculate the eigenvectors V and eigenvalues D of the covariance matrix C. As you can see the second eigenvalue of 0.24 is larger than the first eigenvalue of 0.0013. Therefore we need to change the order of the new variables in newdata after the transformation.

[V,D] = eig(C)

which yields

V =
   -0.7693 0.6389
    0.6389 0.7693

D =
    0.0013 0
    0 0.2400

Display the data together with the eigenvectors representing the new coordinate system.

figure
axes('LineWidth',0.6,...
    'FontName','Helvetica',...
    'FontSize',8,...
    'XAxisLocation','Origin',...
    'YAxisLocation','Origin');
line(data(:,1),data(:,2),...
    'LineStyle','None',...
    'Marker','o');
line([0 V(1,1)],[0 V(2,1)],...
    'Color',[0.8 0.5 0.3],...
    'LineWidth',0.75);
line([0 V(1,2)],[0 V(2,2)],...
    'Color',[0.8 0.5 0.3],...
    'LineWidth',0.75);
axis equal

The eigenvectors are unit vectors and orthogonal, therefore the norm is one and the inner (scalar, dot) product is zero.

norm(V(:,1))
norm(V(:,2))
dot(V(:,1),V(:,2))

which yields

ans = 1
ans = 1
ans = 0
Step 4

Calculating the data set in the new coordinate system. We need to flip newdata left/right since the second column is the one with the larges eigenvalue.

newdata = V * data';
newdata = newdata';
newdata = fliplr(newdata)

which yields

newdata =
   -0.8863  0.0283
   -0.7676  0.0465
   -0.8057 -0.0234
   -0.7497  0.0038
   -0.7330 -0.0028
   -0.7448 -0.0303
   -0.5047  0.0296
   -0.3816 -0.0159
   -0.3193 -0.0026
   -0.1661  0.0419
     .
     .
     .
Step 5

The eigenvalues of the covariance matrix (the diagonals of the diagonalized covariance matrix) indicate the variance in this (new) coordinate direction. We can us this information to calculate the relative variance for each new variable by dividing the variances according to the eigenvectors by the sum of the variances. In our example, the 1st and 2nd principal component contains 99.45 and 0.55 percent of the total variance in data.

var(newdata)
var(newdata)/sum(var(newdata))

which yields

ans =
   0.2400 0.0013
ans =
   0.9945 0.0055

Display the newdata together with the new coordinate system.

figure
axes('LineWidth',0.6,...
    'FontName','Helvetica',...
    'FontSize',8,...
    'XAxisLocation','Origin',...
    'YAxisLocation','Origin')
line(newdata(:,1),newdata(:,2),...
    'LineStyle','None',...
    'Marker','o');
axis equal

Do the same experiment using the MATLAB function PCA. We get same values for newdata and variance.

[coeff,newdatapca,latend,tsquared,variance] = pca(data);
newdatapca
variance

which yields

newdatapca =
   -0.8863  0.0283
   -0.7676  0.0465
   -0.8057 -0.0234
   -0.7497  0.0038
   -0.7330 -0.0028
   -0.7448 -0.0303
   -0.5047  0.0296
   -0.3816 -0.0159
   -0.3193 -0.0026
   -0.1661  0.0419
     .
     .
     .

and

variance =
   99.4520
    0.5480

The new data are decorrelated.

corrcoef(newdata)

which yields

ans =
   1.0000    0.0000
   0.0000    1.0000

An earlier post to this blog demonstrated linear unmixing variables using the PCA with MATLAB. A second post explained the use of the principal component analysis (PCA) to decipher the statistically independent contribution of the source rocks to the sediment compositions in the Santa Maria Basin, NW Argentine Andes.

References

Pearson, K. (1901) On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine. 2 (11): 559–572.
Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441, and 498–520.
Hotelling, H. (1936) Relations between two sets of variates. Biometrika, 28, 321–377.
Wikipedia (2017) Article on Principal Component Analysis, Weblink.