Formula Sheet for Statistics and Probability
Understanding statistics and probability is essential for analyzing data, making informed decisions, and solving real-world problems. Because of that, whether you’re a student, researcher, or professional, having a quick reference for key formulas can save time and improve accuracy. This comprehensive formula sheet covers the most important equations in descriptive statistics, probability theory, and inferential statistics, along with brief explanations to help you apply them effectively The details matter here..
Descriptive Statistics Formulas
Descriptive statistics summarize and describe the features of a dataset. These formulas help calculate central tendencies, variability, and positions within data.
Measures of Central Tendency
-
Mean (Arithmetic Mean):
$ \bar{x} = \frac{\sum x_i}{n} $
The average of all data points. -
Median:
The middle value when data is sorted in ascending order. -
Mode:
The value that appears most frequently in a dataset.
Measures of Variability
-
Variance (Sample):
$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} $
Measures how spread out the data is from the mean. -
Standard Deviation (Sample):
$ s = \sqrt{s^2} $
The square root of variance; indicates dispersion in the same units as the data. -
Range:
$ \text{Range} = \max(x) - \min(x) $
Difference between the largest and smallest values.
Position and Distribution
-
Z-Score:
$ z = \frac{x - \bar{x}}{s} $
Indicates how many standard deviations a data point is from the mean. -
Coefficient of Variation (CV):
$ CV = \frac{s}{\bar{x}} \times 100% $
Relative measure of variation, useful for comparing datasets with different units.
Probability Formulas
Probability quantifies the likelihood of events. These foundational formulas are used in risk assessment, prediction models, and statistical inference Surprisingly effective..
Basic Probability Rules
-
Probability of an Event:
$ P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}} $ -
Complement Rule:
$ P(A') = 1 - P(A) $
The probability that event A does not occur. -
Addition Rule for Two Events:
$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $
Used to find the probability of either A or B occurring.
Conditional Probability and Independence
-
Conditional Probability:
$ P(A|B) = \frac{P(A \cap B)}{P(B)} $
Probability of A given that B has occurred. -
Multiplication Rule:
$ P(A \cap B) = P(A) \times P(B|A) $
Used to find the probability of both A and B happening. -
Independent Events:
If A and B are independent, then:
$ P(A \cap B) = P(A) \times P(B) $
Bayes' Theorem
$
P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}
$
Updates the probability of a hypothesis based on new evidence.
Inferential Statistics Formulas
Inferential statistics let us make predictions or inferences about a population based on sample data Took long enough..
Sampling and Estimation
-
Standard Error of the Mean (SEM):
$ SEM = \frac{s}{\sqrt{n}} $
Estimates the variability of the sample mean. -
Confidence Interval for the Mean:
$ \bar{x} \pm z \times \frac{s}{\sqrt{n}} $
Provides a range likely to contain the population mean with a certain level of confidence.
Hypothesis Testing
-
Test Statistic (Z-test):
$ z = \frac{\bar{x} - \mu_0}{SEM} $
Used to determine whether to reject the null hypothesis. -
P-Value:
The probability of observing results as extreme as the current data, assuming the null hypothesis is true. -
Type I and Type II Errors:
- Type I Error: Rejecting a true null hypothesis (False Positive).
- Type II Error: Failing to reject a false null hypothesis (False Negative).
Correlation and Regression
-
Pearson Correlation Coefficient (r):
$ r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} $
Measures the linear relationship between two variables. -
Simple Linear Regression:
$ y = a + bx $
Where:
$ b = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum x^2 - (\sum x)^2}, \quad a = \bar{y} - b\bar{x} $
Models the relationship between a dependent and independent variable.
Probability Distributions
Normal Distribution
-
Standard Normal Variable (Z):
$ Z = \frac{X - \mu}{\sigma} $
Converts any normal variable to a standard normal distribution. -
Empirical Rule:
- 68% of data lies within 1 standard deviation.
- 95% within 2 standard deviations.
- 99.7% within 3 standard deviations.
Binomial Distribution
- Probability Mass Function:
$ P(X = k) = C(n, k) \times p^k \times (1-p)^{n-k} $
Where $ C(n, k) $ is the combination of n items taken k at a time.
Conclusion
This formula sheet provides a solid foundation for tackling problems in statistics and probability. By mastering these equations, you can analyze data more effectively, interpret results accurately, and build a strong analytical skill set. Regular practice using these formulas will enhance your problem-solving abilities and
lead to more informed decisions in research, business, and everyday life. Whether you're analyzing survey data, testing a new drug, or predicting trends, these tools provide the framework to turn numbers into insights.
Understanding these concepts also builds a foundation for advanced topics like machine learning, econometrics, and data science. As you apply these formulas, remember that statistics is not just about computation—it's about asking the right questions, interpreting results critically, and communicating findings clearly Surprisingly effective..
By mastering these essentials, you're not just solving problems—you're developing a data-driven mindset that empowers you to deal with an increasingly quantitative world. Keep practicing, stay curious, and let these tools guide you toward deeper understanding and smarter decisions.
Advanced Topics Worth Adding to Your Toolkit
1. Confidence Intervals for Proportions
When dealing with categorical data, the confidence interval for a population proportion (p) is often estimated using the normal approximation:
[ \hat{p} \pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} ]
where
- (\hat{p}= \dfrac{x}{n}) is the sample proportion,
- (z_{\alpha/2}) is the critical value from the standard normal distribution (e.g., 1.96 for a 95 % confidence level), and
- (n) is the sample size.
If the sample size is small or (\hat{p}) is near 0 or 1, the Wilson or Agresti–Coull adjustments give more accurate intervals Turns out it matters..
2. Chi‑Square Tests
-
Goodness‑of‑Fit Test: Checks whether observed frequencies (O_i) match expected frequencies (E_i) under a hypothesized distribution.
[ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} ]
-
Test of Independence (contingency tables): Assesses whether two categorical variables are independent Most people skip this — try not to..
[ \chi^2 = \sum_{i=1}^{r}\sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}},\qquad E_{ij}= \frac{(R_i)(C_j)}{N} ]
where (R_i) and (C_j) are the marginal totals for row (i) and column (j), respectively, and (N) is the overall sample size.
Both tests rely on the chi‑square distribution with ((k-1)) or ((r-1)(c-1)) degrees of freedom.
3. ANOVA (Analysis of Variance)
ANOVA extends the t‑test to compare means across three or more groups. The core idea is to partition total variability into between‑group and within‑group components.
[ F = \frac{\text{Mean Square Between}}{\text{Mean Square Within}} = \frac{SS_{\text{B}}/(k-1)}{SS_{\text{W}}/(N-k)} ]
- (SS_{\text{B}}) – sum of squares due to the group means,
- (SS_{\text{W}}) – sum of squares within groups,
- (k) – number of groups,
- (N) – total number of observations.
If the computed (F) exceeds the critical value from the (F)-distribution (with ((k-1, N-k)) degrees of freedom), we reject the null hypothesis that all group means are equal That's the part that actually makes a difference..
4. Non‑Parametric Alternatives
When assumptions of normality or equal variances are violated, consider:
| Problem | Parametric Test | Non‑Parametric Counterpart |
|---|---|---|
| One‑sample location | One‑sample t | Sign test / Wilcoxon signed‑rank |
| Two independent samples | Independent t | Mann‑Whitney U |
| Paired samples | Paired t | Wilcoxon signed‑rank |
| More than two groups | ANOVA | Kruskal‑Wallis |
| Correlation | Pearson (r) | Spearman (\rho) or Kendall (\tau) |
These methods rely on ranks rather than raw data, preserving validity under weaker distributional assumptions Simple, but easy to overlook..
5. Bayesian Basics
While the sheet focuses on frequentist inference, a quick Bayesian reminder can be useful:
[ \text{Posterior } p(\theta| \text{data}) = \frac{\text{Likelihood } \times \text{Prior } p(\theta)}{\text{Evidence } p(\text{data})} ]
Key concepts:
- Prior – belief about (\theta) before seeing data.
- Likelihood – probability of observing the data given (\theta).
- Posterior – updated belief after incorporating the data.
Credible intervals (the Bayesian analogue of confidence intervals) are derived directly from the posterior distribution.
6. Time‑Series Essentials
If your data are ordered in time, the following quick formulas are handy:
-
Moving Average (MA) (order (q)):
[ \hat{y}t = \frac{1}{q}\sum{i=0}^{q-1} y_{t-i} ]
-
Exponential Smoothing (simple version):
[ \hat{y}_{t+1} = \alpha y_t + (1-\alpha)\hat{y}_t,\qquad 0<\alpha<1 ]
-
AR(1) Model (autoregressive of order 1):
[ y_t = \phi y_{t-1} + \varepsilon_t,\qquad |\phi|<1 ]
These tools lay the groundwork for more sophisticated models such as ARIMA, state‑space, or GARCH.
7. Sample Size Determination
Before collecting data, a rough estimate of the required sample size helps ensure adequate power.
For estimating a mean with margin of error (E):
[ n = \left(\frac{z_{\alpha/2}\sigma}{E}\right)^2 ]
For comparing two proportions with desired power (1-\beta):
[ n = \frac{ \bigl[ z_{\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_{\beta}\sqrt{p_1(1-p_1)+p_2(1-p_2)} \bigr]^2 }{(p_1-p_2)^2} ]
where (\bar{p} = (p_1+p_2)/2) Turns out it matters..
Practical Tips for Using the Sheet
- Keep It Visible – Print a single‑sided version and tape it near your workspace. The visual cue reinforces recall.
- Annotate on the Fly – When you encounter a problem that tweaks a standard formula (e.g., a weighted mean), jot a quick note in the margin. Over time you’ll build a personalized “cheat‑layer.”
- Cross‑Check Units – A common source of error is mismatched measurement units; always verify that the numerator and denominator share the same scale before plugging numbers into a formula.
- Validate Assumptions – Before applying a test, perform a quick diagnostic (e.g., Shapiro‑Wilk for normality, Levene’s test for equal variances). If assumptions fail, switch to the appropriate non‑parametric alternative.
- Use Software as a Calculator – Modern statistical packages (R, Python’s SciPy, Stata, SPSS) implement these formulas under the hood. Knowing the underlying mathematics lets you interpret the output correctly and troubleshoot unexpected results.
Final Thoughts
Statistical formulas are more than a collection of symbols; they embody a disciplined way of thinking about uncertainty, variation, and evidence. By internalizing the core equations—means, variances, hypothesis‑testing mechanics, correlation, regression, and the major probability distributions—you acquire a versatile analytical language that translates raw numbers into meaningful narratives.
Remember that mastery comes from application, not memorization alone. Work through real‑world datasets, test each method, and reflect on why a particular test succeeded or fell short. As you expand into advanced realms—multivariate analysis, Bayesian inference, or machine‑learning pipelines—these fundamentals will remain your anchor Surprisingly effective..
In short, treat this formula sheet as a launchpad: let it accelerate your learning, guide your investigations, and, most importantly, keep you asking the right questions. With practice, the symbols will become second nature, and you’ll be equipped to turn data into decisive, data‑driven action. Happy analyzing!
Conclusion
The journey through statistical formulas is inherently dynamic—a balance of precision and adaptability in the face of uncertainty. As you apply these tools, you’ll encounter scenarios where rigid formulas fall short, demanding creativity to refine methods or interpret nuances. In practice, while equations provide the scaffolding for analysis, their true value emerges when paired with critical thinking and a willingness to question assumptions. This iterative process—testing, refining, and re-evaluating—is where statistical literacy truly deepens.
Beyond technical proficiency, these formulas cultivate a mindset of evidence-based reasoning. They remind us that data is not merely numbers to crunch but a narrative to decode, requiring clarity on what we seek to learn and why. In practice, whether designing experiments, validating hypotheses, or communicating findings, the principles embedded in these equations guide us to ask: *What does this data reveal? What uncertainties remain?
As statistical methodologies evolve—integrating machine learning, AI, or novel computational techniques—the foundational formulas remain relevant. They underpin algorithms, validate models, and inform decisions in an increasingly data-driven world. Mastery of these basics ensures you can work through both traditional and modern approaches with a grounded understanding of their strengths and limitations Turns out it matters..
In the long run, the formula sheet is not an endpoint but a starting point. On top of that, it equips you to move beyond rote calculation, fostering a habit of analytical rigor and intellectual curiosity. By embracing this duality—applying formulas with purpose while remaining open to innovation—you position yourself to tackle complex problems with confidence. As you continue your exploration, let these tools inspire not just accuracy, but insight. But in the end, statistics is less about memorizing symbols and more about harnessing them to transform uncertainty into knowledge. Keep building, keep questioning, and let data illuminate your path forward.
This is the bit that actually matters in practice.