Residuals: A crucial tool for understanding the representativeness of statistical data and the goodness - of - fit of models

  

Definition of residual: Random error after mean removal

  The residual is the difference between a single observation in a statistical model and the corresponding group "level mean" - here, the "level mean" is the "expected estimated value" of the observation by the model (such as the mean of a treatment group in a one - way analysis of variance, or the value of the dependent variable predicted by the independent variable in a regression model). In essence, the residual is the "remaining error after stripping the explainable variation of the model", reflecting the "random fluctuations in the data that are not captured by the model".

  

Association between residual analysis and data representativeness: "Normality test" of random errors

  The core assumption of the statistical model is that the residuals follow a normal distribution — the essence of this assumption is that "the random errors in the sample must be a microcosm of the overall random errors." If the residual analysis shows non-normality (such as skewness or excessive kurtosis), or there are extreme outliers, it indicates that "non-random biases" have been mixed into the sample:

  - For example, if you want to study the weights of ordinary adults but measure the weights of several obese patients, these outliers will make the residuals deviate from the normal distribution.

  - Or over - select a certain type of individuals during sampling (e.g., only measuring the heights of young women), resulting in a systematic skewness of the residuals.

  At this time, the non-normality of residuals directly indicates that the sample fails to represent the population. This is because the normality of residuals is equivalent to the errors unexplained by the model conform to the random fluctuation characteristics of the population. If they are non-normal, it means that non-random sampling biases are mixed in the random errors, and the sample loses its representativeness.

  

Two core functions of residuals

  

1. Evaluate data representativeness: "Overall test" for random errors

  The residual is the pure error after removing the mean. If the sample is representative, these errors should perfectly reproduce the random fluctuations of the population, that is, follow a normal distribution, have no extreme values, and have a mean of zero. Otherwise:

  - Residual skewness → There are characteristics that do not belong to the population in the sample (e.g., sampling bias).

  - Extreme values in residuals → "Outliers from non-target groups" are mixed into the sample (for example, when studying students' grades, the test data of people outside the school are included).

  In short, the "normality and absence of anomalies" of residuals are the "touchstone" for the representativeness of the sample.

  

2. Judge the central tendency of the data: "Consistency of fluctuations" around the model expectation

  "Central tendency" refers to whether the data fluctuates randomly around the expected level of the model. The design of a statistical model will force the mean of the residuals to be 0. However, in actual data, if there is a systematic deviation in the residuals (for example, most of the residuals are positive or negative), it indicates:

  - The data does not fluctuate around the expected level of the model (for example, a key independent variable is omitted in the regression model, resulting in the predicted values always being lower than the actual values and the residuals being positive overall).

  - Or the expected estimate of the model itself deviates from the true level of the population (for example, omitting a certain key variable during sampling, resulting in a deviation in the model's estimate of the average level).

  The "zero mean property" and "absence of systematic trends" of residuals directly reflect whether the data "tend to the expected level of the model".

  

Residual graphical analysis: "Visual diagnosis" of model fit

  Through the scatter plot of residuals and predicted values, we can quickly determine whether the model meets the assumptions. The core conclusions are as follows:

  

1. The residuals are scattered within the "banded straight line": The ideal state of model fitting

  If the scatter plot of residuals and predicted values shows a banded area with a consistent width (no obvious change in width) and no extreme outliers, it indicates that:

  - The variance of the residuals is constant (satisfying the "homoscedasticity assumption").

  - No systematic trend (the model has captured all interpretable variations);

  - No outliers (no bias in the sample).

  At this point, the model is a perfect fit, and the unexplained errors are only random fluctuations.

  

2. The residuals show a funnel shape: A signal of non-constant variance

  The funnel shape means that "the fluctuation range of residuals widens (or narrows) as the predicted value increases" - for example, when studying "the impact of income on consumption", the consumption residuals of low - income groups are very small (consumption is concentrated), while those of high - income groups are very large (there are large consumption differences). This situation violates the "homoscedasticity assumption":

  - The model assumes that the error variances of all observations are the same, but in reality, the variance changes with the predicted values.

  - It will lead to inaccurate standard deviations of parameter estimates (the estimates in high-variance regions are even less reliable).

  At this time, the model is not well - fitted and needs to be adjusted (for example, perform a square - root transformation on the dependent variable to stabilize the variance).

  

3. The residuals show a curvilinear shape: A sign of a wrong model form

  If the scatter plot of residuals and predicted values shows an obvious curvilinear trend (e.g., negative first, then positive, and negative again), it indicates that the functional form of the model is incorrect. For example, the actual relationship is a quadratic curve (such as the relationship between study time and grades: grades first increase with time and then decline after exceeding a threshold), but a linear model is used, resulting in the residuals retaining an unexplained non - linear trend.

  At this time, the model is completely unable to capture the real patterns of the data, and the model form must be reset (e.g., adding a quadratic term).

  

4. The residuals show an "elliptical shape": The variance problem of probability value variables

  When the response variable is a probability value (such as a categorical variable between 0 and 1), the variance of the residuals will change with the predicted values (the variance decreases when the probability is close to 0 or 1 and increases when it is close to 0.5). In this case, the scatter plot of the residuals will be "elliptical". This situation also violates the "homoscedasticity assumption". Data transformation (such as logit transformation) is required to map the probability values to the real number domain to stabilize the variance.

  In essence, an ellipse represents the "non - constant variance phenomenon specific to probability value variables" and is a typical signal of model misfit.

  

Residuals are the "bridge" between the model and the data

  The core value of residuals lies in connecting the sample with the population, and the model with the data through the "random error after mean removal".

  - Sample representativeness of the normality test for residuals;

  - Test the model fit through the trend of residuals;

  - The graphical analysis of residuals intuitively reveals the problems in the model.

  In short, residuals are the "key tool for diagnosing data quality and model reliability" in statistical analysis. By understanding residuals, one can understand the "fit" between data and models.