In - depth analysis of statistical learning from concepts to methods, cracking the code of the population distribution.

  

I. The core goal of statistical learning

  The underlying logic of statistical analysis is "inferring the whole from parts". Four learning objectives correspond to the three key links of this logic, and none of them can be omitted:

  1. Master the concepts of population, individual, sample, and statistic: This is the language system of statistics. If you can't distinguish between the population (the whole to be studied) and the sample (the part being studied), subsequent analyses will fall into the logical fallacy of using fragments to represent the whole picture.

  2. Be familiar with data organization methods: Raw data are scattered "information fragments" (such as the lifespan values of 100 light bulbs). Organization (sorting, grouping, tabulating) is the necessary path to transform these fragments into an "analyzable structure" — without organization, it is impossible to extract effective information.

  3. Master the concepts and calculations of the sample mean and median: They are the core indicators of central tendency. To answer the question "Around which value do the data fluctuate" (e.g., "What is the average lifespan of the light bulbs"), we must use the mean (arithmetic mean) or the median (the representative value at the middle position).

  4. Master the concepts and calculations of the range, variance, and standard deviation of a sample: They are the core indicators of the degree of dispersion. To answer the question "How large is the fluctuation of the data?" (e.g., "Is there a large difference in the lifetimes of the bulbs?"), one must use the range (the maximum value minus the minimum value), the variance (the average of the squared deviations), or the standard deviation (the square root of the variance).

  

II. Sampling: Why not study the entire population directly?

  

1. The essence of sampling

  The process of extracting some individuals from the population according to rules. The core purpose is to "infer the characteristics of the population through sample information" - it's not that we don't want to study the whole, but that it's impossible or unnecessary.

  

2. The four major reasons for sampling that must be carried out

  Violation of the research purpose: When testing the lifespan of light bulbs, each bulb breaks after being tested. If all bulbs are tested, all of them will become waste products, which goes against the original intention of "studying the lifespan to improve production".

  Objective impossibility: When studying the "pH value of the global oceans", it is impossible to collect samples of every drop of seawater.

  Cost/time is unaffordable: If the national population census conducts a person-by-person registration, it will require millions of human resources and take several years; while a 1% population sample survey only takes a few months, reducing the cost by 99%.

  Margin of error can be tolerated: In a public opinion poll, when the sample size is 1,000, the margin of error is about ±3% — this margin of error is completely sufficient for "predicting election results" or "formulating policies", and there is no need to spend 10 times the cost to pursue an accuracy of ±1%.

  

III. Population and Individual: What on earth is the research object of statistics?

  Many people's understanding of the "population" remains at the "physical collection" (such as "a batch of light bulbs" or "the population of a city"), but the population in statistics is a "collection of quantitative indicators" - what we care about is "a certain characteristic of the individual", rather than "the individual itself".

  

1. Precise definition

  Statistical population: The entirety of a certain quantitative indicator of the individuals under investigation (such as "the entirety of the lifespan values of this batch of light bulbs", "the entirety of the age values of the population in this city"), denoted as the random variable \(X\) (because the individual indicators are random).

  Individual: Each quantitative indicator value that makes up the population (e.g., "the lifespan of a certain light bulb is 1000 hours", "the age of a certain person is 25 years old").

  

2. Example illustration

  - When studying the "lifespan of bulbs", the physical population is "a batch of bulbs", but the statistical population is "the set of lifespan values of this batch of bulbs" (e.g., \(\{800, 1200, 950, ...\}\)).

  - When studying the "population age", the statistical population is the "set of age values of the urban population", and the individual is the "age value of each person".

  

IV. Simple random sample: The key to making the sample represent the population

  To enable the sample to effectively infer the population, two conditions, randomness and independence, must be met. Such a sample is called a simple random sample (referred to as "sample" for short):

  

1. Two core conditions

  Randomness: Each individual sample has the same distribution as the population \(X\) (for example, if the population follows a normal distribution, the sample must also follow a normal distribution).

  Independence: There is no correlation between sample individuals (for example, if light bulb A is selected, it does not affect the probability of light bulb B being selected, nor does it affect the lifespan value of B).

  

2. Warning from Counterexamples

  - Only sample the "first box of light bulbs" (which might be produced earlier and have more stable quality): The sample does not meet the requirement of randomness - the lifespan distribution of the first box is different from that of the whole batch.

  - Only conduct opinion polls among "one's own friends": The sample does not meet the independence requirement. Friends tend to have highly similar ages and occupations, so they cannot represent the entire population.

  

V. The core task of statistics: Crack the "distribution code" of the population

  The essence of the population is a probability distribution (for example, the lifespan of a light bulb follows an exponential distribution, and height follows a normal distribution). The task of statistics is to solve two problems:

  1. What distribution does the population follow? (For example, "Is the tensile strength of rubber parts normally distributed?" "Is the lifespan of televisions skewed distributed?")

  2. What are the parameters of the distribution? (For example, in a normal distribution, the mean \(\mu\) represents the average tensile strength, and the variance \(\sigma^2\) represents the degree of fluctuation of the tensile strength.)

  

Example: The value after distribution

  If it is known that "the lifespan of the light bulb follows the exponential distribution \(Exp(\lambda)\)" and \(\lambda = 0.001\) (the average lifespan is 1000 hours), then the "probability that the lifespan exceeds 1500 hours" can be calculated:

  \[ P(X

  

VI. Distribution cases of common populations

  

1. Product qualification status: Binomial distribution population

  When investigating whether a product is qualified or not (denote a qualified product as 0 and an unqualified one as 1), the population is the set of the qualification states of all products. If the proportion of non - qualified products is \(p\), then the population follows a Bernoulli distribution (a binomial distribution with \(n = 1\)) \(b(1,p)\), and its probability mass function is as follows:

  \(X\) (Qualified status) 0 (Qualified) 1 (Unqualified)

  Probability \(P\) \(1 - p\) \(p\)

  - For Factory A, \(p = 0.01\): The population distribution is \(b(1, 0.01)\) (the probability of passing the quality test is 99%).

  - For Factory B with \(p = 0.08\): The population distribution is \(b(1, 0.08)\) (the probability of passing is 92%).

  

2. Tensile strength of rubber parts: Normal population

  The tensile strength of rubber parts is "a real number from 0 to \(\infty\)" and generally follows a normal distribution \(N(\mu, \sigma^2)\) —— since the tensile strength is affected by multiple independent factors such as "raw material purity, vulcanization time, and temperature", according to the central limit theorem, the distribution of the sum approaches a normal distribution.

  

3. Television lifespan: Skewed population

  The population distribution of the lifespan of televisions is skewed (e.g., right - skewed) — most of the lifespans are around the average, but a small number of televisions have very long lifespans (e.g., having been used for 10 years), causing the distribution curve to extend to the right.

  - Mixed skewness case: The parts produced by two operators are mixed together (the mean of those produced by Operator A is 5, and the mean of those produced by Operator B is 10). The overall distribution will have two peaks and show skewness. The causes (such as differences in operators or machines) need to be found to control the quality.

  

VII. Statistic: "Process" the sample into useful information

  The samples are scattered random variables (e.g., \(X_1 = 1000, X_2 = 1200, ..., X_n = 950\)). It is necessary to use statistics to centralize the scattered information and reflect the characteristics of the population.

  

1. Definition of statistics

  A statistic is a sample function that does not contain unknown parameters (such as the sample mean \(\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\)). It is a random variable - since the samples are random, the values of the statistic are also random (different samples have different means).

  

2. Key judgments

  - If the population \(X \sim N(\mu, \sigma^2)\) (\(\mu\) is known, \(\sigma^2\) is unknown):

  - \(\bar{X}\) is a statistic (without unknown parameters).

  - \(\frac{X_1 - \mu}{\sigma}\) is not a statistic (it contains the unknown parameter \(\sigma\)).

  

3. Sampling distribution: The "probability law" of statistics

  The distribution of a statistic is called the sampling distribution, which is the basis of statistical inference. For example:

  - If the population \(X \sim N(\mu, \sigma^2)\), then the sample mean \(\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\) — the larger the sample size, the closer \(\bar{X}\) is to the population mean \(\mu\).

  

VIII. Classification of Commonly Used Statistics

  Commonly used statistics correspond to the two core characteristics of the population. First, an ordered sample needs to be constructed (sort the sample as \(X_{(1)} \leq X_{(2)} \leq ... \leq X_{(n)}\)):

  

1. Measures of central tendency (reflecting the "degree of concentration")

  Sample mean: \(\bar{X} = \frac{1}{n}\sum_{i = 1}^n X_i\) (Arithmetic mean, most commonly used);

  Sample median: \(M_e = \begin{cases} X_{(\frac{n+1}{2})} & \text{when } n \text{ is odd} \\ \frac{X_{(\frac{n}{2})} + X_{(\frac{n}{2}+1)}}{2} & \text{when } n \text{ is even} \end{cases}\) (The value at the middle position, resistant to the interference of extreme values).

  

2. Statistics of dispersion degree (reflecting the "degree of fluctuation")

  Sample range: \(R = X_{(n)} - X_{(1)}\) (the maximum value minus the minimum value, which is simple but easily affected by extreme values);

  Sample variance: \(S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\) (the average of the squared deviations, used to measure fluctuations);

  Sample standard deviation: \(S = \sqrt{S^2}\) (the square root of the variance, consistent with the unit of the original data).

  Through the deconstruction of the above concepts, the logical chain of statistical analysis has become clear: define the population → select a simple random sample → calculate the statistic → infer the population distribution and parameters using the sampling distribution – this is the core methodology of statistics.