Classical Linear Regression of Log-Transformed Data

A common error in the regression analysis is that bivariate data with an exponential relationship are log-transformed, then a best-fit line is calculated using a classical linear regression using the least squares, before the result is backtransformed. Here is a MATLAB example to show how to do it better.

Classical regression assumes that y responds to x and that the entire dispersion in the data set is contained within the y-value. This means that x is then the independent variable (also known as the predictor variable, or the regressor). The values of x are defined by the experimenter and are often regarded as being free of errors. Linear regression minimizes the deviations Δy between the data points xy and the value y predicted by the best-fit line y=b0+b1x using a least-squares criterion.

The classic linear regression method makes two assumptions about the data: (1) there is a linear relationship between x and y, (2) the unknown errors around the means have a normal (Gaussian) distribution with a similar variance for all data points. Logarithmizing the y-values violates this assumption: the errors then have a log-normal distribution and therefore the regression places less weight on the larger y-values. See Chapter 4 of MRES if you wish to learn more about the MATLAB functions polyfit and nlinfit.

To see the difference in the results we first create a synthetic data set. The y-values, stored in data(:,2), have a exponentional relationship with the x-values, stored in data(:,1). After computing the x- and y-values we add Gaussian noise to the y-values in data(:,2) and sort the data.

clear
rng(0)
data(:,1) = 0.5 : 0.1 : 3;
data(:,1) = data(:,1) + 0.2*randn(size(data(:,1)));
data(:,2) = 3 + 0.2 * exp(data(:,1));
data(:,2) = data(:,2) + 0.5*randn(size(data(:,2)));
data = sortrows(data,1);

Here is the linear fit of the logarithmized data using polyfit:

[pl,s] = polyfit(data(:,1),log(data(:,2)),1)

Then we use the function nlinfit from the Statistics and Machine Learning Toolbox to calculate the nonlinear fit without transforming the data:

model = @(phi,t)(phi(1)*exp(t) + phi(2));
p0 = [0 0];
pn = nlinfit(data(:,1),data(:,2),model,p0)

We can also calculate the true (noisefree) line using the exponentional equation from above before adding noise:

trueline(:,1) = 0.5 : 0.01 : 3;
trueline(:,2) = 3 + 0.2 * exp(trueline(:,1));

Displaying the data clearly shows the difference. The yellow curve calculated by linear regression of the log-transformed data has a much lower curvature than the nonlinear fit. The dotted black line is the noise-free curve. Of course the result of nonlinear regression and the true line do not perfectly match due to the noise in the data used with nlinfit. The yellow line, however, is statistically incorrect because it is the result of the use of the wrong method.

figure1 = figure(...
    'Position',[200 200 800 600],...
    'Color',[1 1 1]);
axes1 = axes(...
    'Box','on',...
    'Units','Centimeters',...
    'Position',[2 2 10 6],...
    'LineWidth',0.6,...
    'FontName','Helvetica',...
    'FontSize',8);
hold on
line(data(:,1),data(:,2),...
    'LineStyle','none',...
    'LineWidth',2,...
    'LineWidth',0.75,...
    'Color',[0 0.4453 0.7383],...
    'Marker','o',...
    'MarkerFaceColor',[0 0.4453 0.7383],...
    'MarkerEdgeColor',[0 0 0]);
line(data(:,1),pn(1)*exp(data(:,1)) + pn(2),...
    'Color',[0.8477 0.3242 0.0977],...
    'LineWidth',0.75);
line(data(:,1),exp(polyval(pl,data(:,1),s)),...
    'Color',[ 0.9258 0.6914 0.1250],...
    'LineWidth',0.75);
line(trueline(:,1),trueline(:,2),...
    'Color',[0 0 0],...
    'LineStyle',':',...
    'LineWidth',0.75);
legend('Data','Nonlinear Fit',...
    'Linear Fit of Log Data (wrong)',...
    'True Noisefree Line',...
    'Location','Northwest')
xlabel('x-Values',...
    'FontName','Helvetica',...
    'FontSize',8);
ylabel('y-Values',...
    'FontName','Helvetica',...
    'FontSize',8);

Do you have a better example? I currently thinking about including this topic in a new edition of the MRES book but I am not sure whether this example is a good one!  Comments via email a very welcome!