Foundations of Statistical Analysis
Statistics is the mathematical science centered on the collection, organization, condensation, analysis, and interpretation of quantitative data.
Core Branches of Statistics
- Descriptive Statistics: Focuses on organizing, summarizing, and presenting data using tables, charts, and summary values.
- Inferential Statistics: Focuses on drawing conclusions about a larger population based on data collected from a representative sample. This includes testing hypotheses and making predictions.
Key Terms
- Population: The complete set of all elements, items, or individuals under analysis.
- Sample: A subset selected from a population to be analyzed in detail.
- Simple Random Sample: A sample chosen such that every possible subset of a given size has an equal probability of selection.
Sampling Frameworks
Probability Sampling (Random Methods)
- Simple Random Sampling: Every member of the population has an equal chance of selection.
- Systematic Sampling: Members are selected at regular intervals from an ordered list, starting from a randomly chosen initial point (e.g., picking every -th item).
- Stratified Random Sampling: The population is divided into distinct subgroups (strata), and random samples are drawn from each subgroup proportional to its size.
- Cluster Sampling: The population is divided into geographic or operational groups (clusters). A random sample of clusters is chosen, and all members within those selected clusters are analyzed.
Non-Probability Sampling (Non-Random Methods)
- Convenience (Accidental) Sampling: Samples are chosen based on ease of access rather than random selection.
- Purposive Sampling: The researcher uses personal judgment to select individuals who seem best suited for the study's goals.
- Quota Sampling: Participants are sampled within specific categories until a predetermined target number is reached.
- Snowball Sampling: Initial participants recruit additional subjects from their networks, a technique often used to study hard-to-reach groups.
Sample Size Formulations
1. Slovin's Formula
Used to estimate the required sample size () when analyzing a finite population () given a maximum allowable margin of error ():
2. Estimation for a Population Mean
To estimate a population mean within a specific margin of error () at a chosen confidence level (), the minimum required sample size is:
Where represents the known population standard deviation and is the standard normal z-score corresponding to the confidence level.
3. Cochran's Formula for Proportions
To estimate a population proportion within a target margin of error (), the required sample size is calculated as:
Where is the estimated baseline proportion. If no prior estimate is available, setting provides the most conservative sample size estimate.
Levels of Data Measurement
Data is classified into four levels of measurement, which dictate the types of statistical analysis that can be performed.
[Highest Level] Ratio -> True zero point, meaningful ratios (e.g., Weight, Age)
|
Interval -> Consistent intervals, no true zero (e.g., Temperature Celsius)
|
Ordinal -> Meaningful ranking, unequal intervals (e.g., Customer Ratings)
[Lowest Level] Nominal -> Categorical labeling only, no order (e.g., Eye Color)
- Nominal: Categorical classification with no inherent order or numerical ranking (e.g., gender, nationality, or text labels).
- Ordinal: Categorical data that can be logically ranked or ordered, though the mathematical differences between ranks cannot be quantified (e.g., performance ratings like excellent, good, or poor).
- Interval: Numerical data with consistent differences between values, but without a true, meaningful zero point (e.g., temperature scales like Celsius or Fahrenheit).
- Ratio: Numerical data featuring both consistent intervals and a true zero point, which allows for direct comparison of ratios (e.g., height, weight, salary, or age).
Measures of Central Tendency and Dispersion
Central Tendency Measures (Averages)
- Arithmetic Mean (): The sum of all values divided by the total number of observations.
- Weighted Mean: Used when observations carry varying degrees of importance or frequency:
- Median (): The middle value when data points are arranged in ascending or descending order. For an even number of observations, it is the average of the two central values.
- Mode (): The most frequently occurring value in a data set.
Dispersion Measures (Variability)
- Range: The absolute difference between the largest and smallest values in a data set.
- Sample Variance (): Measures the average squared deviation of data points from the sample mean:
- Sample Standard Deviation (): The square root of the sample variance, providing a measure of spread in the original units of the data: