The Principal Component Analysis (PCA) is equivalent to fitting an n-dimensional ellipsoid to the data, where the eigenvectors of the covariance matrix of the data set are the axes of the ellipsoid. The eigenvalues represent the distribution of the variance among each of the eigenvectors. To understand the method, it is helpful to know something about matrix algebra, eigenvectors, and eigenvalues. Here is a n=2 dimensional example to perform a PCA without the use of the MATLAB function pca, but with the function of eig for the calculation of eigenvectors and eigenvalues.
Assume a data set that consists of measurements of p variables on n samples, stored in an n-by-p array. As an example we are creating a bivariate data set of two vectors, 30 data points each, with a strong linear correlation, overlain by normally distributed noise.
clear, clc, close all rng(0) data(:,1) = randn(30,1); data(:,2) = 3.4 + 1.2 * data(:,1); data(:,2) = data(:,2) + 0.2*randn(size(data(:,1))); data = sortrows(data,1); figure axes('LineWidth',0.6,... 'FontName','Helvetica',... 'FontSize',8,... 'XAxisLocation','Origin',... 'YAxisLocation','Origin'); line(data(:,1),data(:,2),... 'LineStyle','None',... 'Marker','o'); axis equal
The task is to find the unit vector pointing into the direction with the largest variance within the bivariate data set data. The solution of this problem is to calculate the largest eigenvalue D of the covariance matrix C and the corresponding eigenvector V
C * V = λ*V
C = data’ * data
which is equivalent to
(C – D * E) V = 0
where E is the identity matrix, which is a classic eigenvalue problem: it is a linear system of equations with the eigenvectors V as the solution.
First we have to mean center the data, i.e. substract the univariate means from the two columns, i.e. the two variables of data. In the following steps we therefore study the deviations from the mean(s) only.
data(:,1) = data(:,1)-mean(data(:,1)); data(:,2) = data(:,2)-mean(data(:,2));
Next we calculate the covariance matrix of data. The covariance matrix contains all necessary information to rotate the coordinate system.
C = cov(data)
C = 1.6901 2.0186 2.0186 2.4543
The rotation helps to create new variables which are uncorrelated, i.e. the covariance is zero for all pairs of the new variables. The decorrelation is achieved by diagonalizing the covariance matrix C. The eigenvectors V belonging to the diagonalized covariance matrix are a linear combination of the old base vectors, thus expressing the correlation between the old and the new time series. We find the eigenvalues of the covariance matrix C by solving the equation
det(C – D * E) = 0
The eigenvalues D of the covariance matrix, i.e. the diagonalized version of C, gives the variance within the new coordinate axes, i.e. the principal components. We now calculate the eigenvectors V and eigenvalues D of the covariance matrix C. As you can see the second eigenvalue of 0.24 is larger than the first eigenvalue of 0.0013. Therefore we need to change the order of the new variables in newdata after the transformation.
[V,D] = eig(C)
V = -0.7701 0.6380 0.6380 0.7701 D = 0.0177 0 0 4.1266
Display the data together with the eigenvectors representing the new coordinate system.
figure axes('LineWidth',0.6,... 'FontName','Helvetica',... 'FontSize',8,... 'XAxisLocation','Origin',... 'YAxisLocation','Origin'); line(data(:,1),data(:,2),... 'LineStyle','None',... 'Marker','o'); line([0 V(1,1)],[0 V(2,1)],... 'Color',[0.8 0.5 0.3],... 'LineWidth',0.75); line([0 V(1,2)],[0 V(2,2)],... 'Color',[0.8 0.5 0.3],... 'LineWidth',0.75); axis equal
The eigenvectors are unit vectors and orthogonal, therefore the norm is one and the inner (scalar, dot) product is zero.
norm(V(:,1)) norm(V(:,2)) dot(V(:,1),V(:,2))
ans = 1 ans = 1 ans = 0
Calculating the data set in the new coordinate system. We need to flip newdata left/right since the second column is the one with the larges eigenvalue.
newdata = V * data'; newdata = newdata'; newdata = fliplr(newdata)
newdata = -4.5374 0.1091 -2.9687 -0.0101 -2.6655 -0.2065 -2.7186 -0.0324 -1.9021 -0.1632 -1.4716 -0.0606 -1.2612 -0.0659 -1.1692 -0.0142 . . .
The eigenvalues of the covariance matrix (the diagonals of the diagonalized covariance matrix) indicate the variance in this (new) coordinate direction. We can us this information to calculate the relative variance for each new variable by dividing the variances according to the eigenvectors by the sum of the variances. In our example, the 1st and 2nd principal component contains 99.45 and 0.55 percent of the total variance in data.
ans = 4.1266 0.0177 ans = 0.9957 0.0043
Display the newdata together with the new coordinate system.
figure axes('LineWidth',0.6,... 'FontName','Helvetica',... 'FontSize',8,... 'XAxisLocation','Origin',... 'YAxisLocation','Origin') line(newdata(:,1),newdata(:,2),... 'LineStyle','None',... 'Marker','o'); axis equal
Do the same experiment using the MATLAB function PCA. We get same values for newdata and variance.
[coeff,newdatapca,latend,tsquared,variance] = pca(data); newdatapca variance
newdatapca = -4.5374 0.1091 -2.9687 -0.0101 -2.6655 -0.2065 -2.7186 -0.0324 -1.9021 -0.1632 -1.4716 -0.0606 -1.2612 -0.0659 -1.1692 -0.0142 . . .
variance = 99.5720 0.4280
The new data are decorrelated.
ans = 1.0000 0.0000 0.0000 1.0000
An earlier post to this blog demonstrated linear unmixing variables using the PCA with MATLAB. A second post explained the use of the principal component analysis (PCA) to decipher the statistically independent contribution of the source rocks to the sediment compositions in the Santa Maria Basin, NW Argentine Andes.
Pearson, K. (1901) On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine. 2 (11): 559–572.
Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441, and 498–520.
Hotelling, H. (1936) Relations between two sets of variates. Biometrika, 28, 321–377.
Wikipedia (2017) Article on Principal Component Analysis, Weblink.