Mastodon

Classical Linear Regression of Log-Transformed Data

A common error in the regression analysis is that bivariate data with an exponential relationship are log-transformed, then a best-fit line is calculated using a classical linear regression using the least squares, before the result is backtransformed. Here is a MATLAB example to show how to do it better.

Log-transforming the y-values has two important consequences that can influence the result. First, if the relationship between the data is of the type y=a0+a1ex, then logarithmizing y-values in the form log(y) does not provide a complete linearization of the data the parameters a0 and a1 are ignored. Second, classical regression assumes that y responds to x and that the entire dispersion in the data set is contained within the y-value (see Section 4.3). This means that x is the independent variable defined by the experimenter and regarded as being free of errors. Linear regression minimizes the deviations Δy between the data points xy and the value y predicted by the best-fit line y=b0+b1x using a least-squares criterion. The classic linear regression method makes two assumptions about the data: (1) there is a linear relationship between x and y, (2) the unknown errors around the means have a normal (Gaussian) distribution, with a similar variance for all data points. Logarithmizing the y-values violates this assumption: the errors then have a log-normal distribution and the regression therefore places less weight on the larger y-values.

To see the difference in the results we first create a synthetic data set. The y-values, stored in data(:,2), have a exponential relationship with the x-values, stored in data(:,1). After computing the x- and y-values we add Gaussian noise to the y-values in data(:,2) and sort the data.

```clear
rng(0)
data(:,1) = 0.5 : 0.1 : 3;
data(:,1) = data(:,1) + 0.2*randn(size(data(:,1)));
data(:,2) = 3 + 0.2 * exp(data(:,1));
data(:,2) = data(:,2) + 0.5*randn(size(data(:,2)));
data = sortrows(data,1);```

Here is the linear fit of the logarithmized data using polyfit:

`[pl,s] = polyfit(data(:,1),log(data(:,2)),1)`

Then we use the function nlinfit from the Statistics and Machine Learning Toolbox to calculate the nonlinear fit without transforming the data:

```model = @(phi,t)(phi(1)*exp(t) + phi(2));
p0 = [0 0];
pn = nlinfit(data(:,1),data(:,2),model,p0)```

We can also calculate the true (noisefree) line using the exponential equation from above before adding noise:

```trueline(:,1) = 0.5 : 0.01 : 3;
trueline(:,2) = 3 + 0.2 * exp(trueline(:,1));```

Displaying the data clearly shows the difference. The yellow curve calculated by linear regression of the log-transformed data has a much lower curvature than the nonlinear fit. The dotted black line is the noise-free curve. Of course the result of nonlinear regression and the true line do not perfectly match due to the noise in the data used with nlinfit. The yellow line, however, is statistically incorrect because it is the result of the use of the wrong method.

```figure1 = figure(...
'Position',[200 200 800 600],...
'Color',[1 1 1]);
axes1 = axes(...
'Box','on',...
'Units','Centimeters',...
'Position',[2 2 10 6],...
'LineWidth',0.6,...
'FontName','Helvetica',...
'FontSize',8);
hold on
line(data(:,1),data(:,2),...
'LineStyle','none',...
'LineWidth',2,...
'LineWidth',0.75,...
'Color',[0 0.4453 0.7383],...
'Marker','o',...
'MarkerFaceColor',[0 0.4453 0.7383],...
'MarkerEdgeColor',[0 0 0]);
line(data(:,1),pn(1)*exp(data(:,1)) + pn(2),...
'Color',[0.8477 0.3242 0.0977],...
'LineWidth',0.75);
line(data(:,1),exp(polyval(pl,data(:,1),s)),...
'Color',[ 0.9258 0.6914 0.1250],...
'LineWidth',0.75);
line(trueline(:,1),trueline(:,2),...
'Color',[0 0 0],...
'LineStyle',':',...
'LineWidth',0.75);
legend('Data','Nonlinear Fit',...
'Linear Fit of Log Data (wrong)',...
'True Noisefree Line',...
'Location','Northwest')
xlabel('x-Values',...
'FontName','Helvetica',...
'FontSize',8);
ylabel('y-Values',...
'FontName','Helvetica',...
'FontSize',8);```