BSIT Survival Kit

Foundations of Statistical Analysis

Statistics is the mathematical science centered on the collection, organization, condensation, analysis, and interpretation of quantitative data.

Core Branches of Statistics

Descriptive Statistics: Focuses on organizing, summarizing, and presenting data using tables, charts, and summary values.
Inferential Statistics: Focuses on drawing conclusions about a larger population based on data collected from a representative sample. This includes testing hypotheses and making predictions.

Key Terms

Population: The complete set of all elements, items, or individuals under analysis.
Sample: A subset selected from a population to be analyzed in detail.
Simple Random Sample: A sample chosen such that every possible subset of a given size has an equal probability of selection.

Sampling Frameworks

Probability Sampling (Random Methods)

Simple Random Sampling: Every member of the population has an equal chance of selection.
Systematic Sampling: Members are selected at regular intervals from an ordered list, starting from a randomly chosen initial point (e.g., picking every $k$ -th item).
Stratified Random Sampling: The population is divided into distinct subgroups (strata), and random samples are drawn from each subgroup proportional to its size.
Cluster Sampling: The population is divided into geographic or operational groups (clusters). A random sample of clusters is chosen, and all members within those selected clusters are analyzed.

Non-Probability Sampling (Non-Random Methods)

Convenience (Accidental) Sampling: Samples are chosen based on ease of access rather than random selection.
Purposive Sampling: The researcher uses personal judgment to select individuals who seem best suited for the study's goals.
Quota Sampling: Participants are sampled within specific categories until a predetermined target number is reached.
Snowball Sampling: Initial participants recruit additional subjects from their networks, a technique often used to study hard-to-reach groups.

Sample Size Formulations

1. Slovin's Formula

Used to estimate the required sample size ( $n$ ) when analyzing a finite population ( $N$ ) given a maximum allowable margin of error ( $E$ ):

$n = \frac{N}{1 + N E^2}$

2. Estimation for a Population Mean

To estimate a population mean within a specific margin of error ( $E$ ) at a chosen confidence level ( $1-\alpha$ ), the minimum required sample size is:

$n = \frac{(z_{\alpha/2})^2 \sigma^2}{E^2}$

Where $\sigma$ represents the known population standard deviation and $z_{\alpha/2}$ is the standard normal z-score corresponding to the confidence level.

3. Cochran's Formula for Proportions

To estimate a population proportion within a target margin of error ( $E$ ), the required sample size is calculated as:

$n_0 = \frac{(z_{\alpha/2})^2 \hat{p}(1-\hat{p})}{E^2}$

Where $\hat{p}$ is the estimated baseline proportion. If no prior estimate is available, setting $\hat{p} = 0.5$ provides the most conservative sample size estimate.

Levels of Data Measurement

Data is classified into four levels of measurement, which dictate the types of statistical analysis that can be performed.

[Highest Level]   Ratio      -> True zero point, meaningful ratios (e.g., Weight, Age)
                    |
                  Interval   -> Consistent intervals, no true zero (e.g., Temperature Celsius)
                    |
                  Ordinal    -> Meaningful ranking, unequal intervals (e.g., Customer Ratings)
[Lowest Level]    Nominal    -> Categorical labeling only, no order (e.g., Eye Color)

Nominal: Categorical classification with no inherent order or numerical ranking (e.g., gender, nationality, or text labels).
Ordinal: Categorical data that can be logically ranked or ordered, though the mathematical differences between ranks cannot be quantified (e.g., performance ratings like excellent, good, or poor).
Interval: Numerical data with consistent differences between values, but without a true, meaningful zero point (e.g., temperature scales like Celsius or Fahrenheit).
Ratio: Numerical data featuring both consistent intervals and a true zero point, which allows for direct comparison of ratios (e.g., height, weight, salary, or age).

Measures of Central Tendency and Dispersion

Central Tendency Measures (Averages)

Arithmetic Mean ( $\bar{x}$ ): The sum of all values divided by the total number of observations.
Weighted Mean: Used when observations carry varying degrees of importance or frequency:

$\bar{x} = \frac{\sum w x}{\sum w}$

Median ( $\tilde{x}$ ): The middle value when data points are arranged in ascending or descending order. For an even number of observations, it is the average of the two central values.
Mode ( $\hat{x}$ ): The most frequently occurring value in a data set.

Dispersion Measures (Variability)

Range: The absolute difference between the largest and smallest values in a data set.
Sample Variance ( $s^2$ ): Measures the average squared deviation of data points from the sample mean:

$s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}$

Sample Standard Deviation ( $s$ ): The square root of the sample variance, providing a measure of spread in the original units of the data:

$s = \sqrt{s^2}$

Unit IV: Data Analytics and Statistical Management