Summary of Probability

Axiom, condition, IID, DRV, distribution...

Posted by Hao Xu on September 21, 2018

1. Axioms of Probability


The set of all possible outcomes of an experiment is known as the sample space of the experiment, denoted by $S$.


Any subset $E$ of the sample space is known as an event.


If $E_2, E_2, …$ are events.

The union of these events, denoted by $\bigcup_{n=1f}^\infty E_n$, is defined to be that event which consists of all outcomes that are in $E_n$ for at least one value of $n=1, 2, …$.

The intersection of the events $E_n$, denoted by $\bigcap _{n=1}^\infty E_n$, is defined to be the event consisting of those outcomes which are in all of the events $E_n, n=1, 2, …$.


The complement of $E$,denoted by $E^c$, consists of all outcomes in the sample space $S$ that are not in $E$.

  • $E^c$ occurs iff $E$ does not occur.
  • $E \bigcup E^c = S$
  • $S^c = \emptyset$

Theorem The DeMorgan’s Laws

\[(\bigcup_{i=1}^n E_i)^c = \bigcap_{i=1}^n E_i^c\] \[(\bigcap_{i=1}^n E_i)^c = \bigcap^n_{i=1} E_i^c\]


​ The probability of the event E is defined as \(P(E) = lim_{n \rightarrow \infty} \cfrac{n(E)}{n}\) For each event $E$ of the sample space $S$, we assume that a number $P(E)$ is defined and satisfies the following three axioms:

Axiom 1

\[0 \leq P(E) \leq 1\]

Axiom 2

\[P(S) = 1\]

Axiom 3

for any sequence of mutually exclusive events $E_1, E_2, …$

\[P(\bigcup_{i=1}^\infty E_i) = \sum_{i=1}^\infty P(E_i)\]


\[P(E^c) = 1 - P(E)\]


If $E \subset F$, then $P(E) <= P(F)$.


\[P(E \bigcup F) = P(E) + P(F) - P(EF)\]


\[\begin{eqnarray*} P(E_1 \cup E_2 \cup ... \cup E_n) = &\sum_{i=1}^n& P(E_i) - \sum_{i_1<i_2} P(E_{i_1}E_{i_2}) + (-1)^{r+1} \\ &+&...+ \sum_{i_1<i_2<...<i_r}P(E_{i_1}E_{i_2}...E_{i_r}) \\ &+& ...+ (-1)^{n+1}P(E_{i_1}E_{i_2}...E_{i_n}) \end{eqnarray*}\]

2. Conditional Probability


The conditional probability that $E$ occurs given that $F$ has occurred is denoted by $P(E|F)$. If $P(F)>0$, then

\[P(E|F)= \cfrac{P(EF)}{P(F)}\]

Theorem The multiplication rule

\[P(E_1E_2E_3...E_n) = P(E_1)P(E_2|E_1)P(E_3|E_1E_2)...P(E_n|E_1...E_{n-1})\]


Let $E$ and $F$ be events. We can express $E$ as

\[E = EF \ \cup \ EF^c\]

. By Axiom 3 we have

\[P(E) = P(E|F)P(F)\ +\ P(E|F^c)[1\ -\ P(F)]\]


The odds of an event $A$ are defined by

\[\cfrac{P(A)}{P(A^c)} = \cfrac{P(A)}{1\ -\ P(A)}\]

This tells how much more likely it is that the event $A$ occurs than it is that does not occur. If the odds are equal to $\alpha$, then it is common to say that the odds are “$\alpha$ to 1” in favor of the hypothesis.

The new odds after the evidence $E$ are

\[\cfrac{P(H|E)}{P(H^c|E)} = \cfrac{P(H)}{P(H^c)} \cfrac{P(E|H)}{P(E|H^c)}\]

Theorem Bayes’s formula

\[P(F_j|E) = \cfrac{P(EF_j)}{P(E)}= \cfrac{P(E|F_j)P(F_j)}{\sum_{i=1}^n P(E|F_i)P(F_i)}\]

Bayes’s formula shows us how to use new evidence to modify existing opinions

  • $P(F_j|E)$: the likelihood of event $F_j$ occurring given that $E$ is true.
  • $P(E|F_j)$:

3. Independence


Two events $E$ and $F$ are said to be independent if the following equation holds.

\[P(EF) = P(E)P(F)\]

Two events $E$ and $F$ that are not independent are said to be dependent.


If $E$ and $F$ are independent, then so are $E$ and $F^c$.


Three events $E_1, E_2, …, E_n$ are said to be independent if, for every subset $E_{1’}, E_{2’}, …, E_{r’}, r<=n$, of these events,

\[\begin{eqnarray*} P(E_{1'}, E_{2'}, ..., E_{r'}) &=& P(E_{1'})P(E_{2'})...P(E_{r'}) \\ \end{eqnarray*}\]


Conditional probabilities satisfy all of the properties of ordinary probabilities.

(a) $$0 \leq P(E F) \leq 1$$
(b) $$P(S F) = 1$$

(c) If $ E_i,\ i=1, 2, …$, are mutually exclusive events, then

\[P(\bigcup_1^\infty E_i|F) = \sum_1^\infty P(E_i|F)\]

4. Discrete Random Variables


Definition A random variable X is a function from the sample space S to the set of real numbers R:

\[X: S \rightarrow R\]


For a discrete random variable $X$, we define the probability mass function $p(a)$ of $X$ by

\[p(a) = P(X=a)\]

$X$ must take on one of the value $x_i$ for $i=1, 2, …$, and we have

\[\begin{eqnarray*} &p&(x_i) \geq 0 \quad \quad \ \ \ for \ i=1, 2, ...\\ &p&(x)=0 \qquad \quad for \ all \ other \ values \ of \ x \\ &\sum_{i=1}^\infty& p(x_i) = 1 \end{eqnarray*}\]


If $X$ is a discrete random variable having a probability mass function $p(x)$, then the expectation, or the expected value, of $X$, denoted by $E[X]$, is defined by

\[E[X] = \sum_{x:p(x)>0}xp(x)\]

$E[X]$ is also referred to as the mean or the first moment of $X$. The quantity $E[Xn], n \geq 1$, is called the nth moment of X.


the sample space S—is either finite or countably infinite. For a random variable X, let X(s) denote the value of X when s ∈ S is the outcome of the experiment.

The expected value of a sum of random variables is equal to the sum of their expectations. X be a random variable,X(s) is the value of X \(E[X] = \sum_{s \in S} X(s) p(s)\)


We say that $I$ is an indicator variable for the event $A$ if

\[I=\begin {cases} 1, & if\ A\ occurs \\ 0, & if\ A^c\ occurs \end {cases}\]

, and we have $E[I] = P(A)$


If $X$ is a discrete random variable that takes on one of the values $x_i, i \geq 1$, with respective probabilities $p(x_i)$, then, for any real-valued function $g$,

\[E[g(X)]=\sum_i g(x_i)p(x_i)\]


If $a$ and $b$ are constants, then

\[E[aX + b] = aE[X] + b\]


If $X$ is a random variable with mean $\mu$, then the variance of $X$, denoted by $Var(X)$, is defined by

\[Var(X) = E[(X − \mu)^2]\]

An alternative formula for $Var(X)$ is derived as follows:

\[Var(X) = E[X^2] − (E[X])^2\]


For any constants $a$ and $b$

\[Var(aX + b) = a^2Var(X)\]

Definition 4.5

The square root of the $Var(X)$ is called the standard deviation of $X$, and we denote it by $SD(X)$. That is,

\[SD(X) = \sqrt{Var(X)}\]


Analogous to the means being the center of gravity of a distribution of mass, the variance represents, in the terminology of mechanics, the moment of inertia.

5. The Bernoulli and Binomial Random Variables


Suppose now that $n$ independent trials, each of which results in a success with probability $p$ and in a failure with probability $1 − p$, are to be performed. If $X$ represents the number of successes that occur in the $n$ trials, then $X$ is said to be a binomial random variable with parameters $(n, p)$, and its probability mass function is given by

\[p(i) = \left(\begin{matrix} n\\i \end{matrix} \right) p^i(1\ -\ p)^{n-i} \qquad i=0, 1, ..., n\]


A random variable $X$ is said to be a Bernoulli random variable if its probability mass function is given by following equations for some $p \in (0, 1)$

\[\begin{eqnarray*} &p(0)& = P{X = 0} = 1 − p \\ &p(1)& = P{X = 1} = p \end{eqnarray*}\]

A Bernoulli random variable is just a binomial random variable with parameters $(1, p)$.


The expected value and variance of binomial random variable with parameters $n$ and $p$.

\[E[X^k] = \sum_{i=0}^n i^k \left(\begin{matrix} n\\i \end{matrix}\right) p^i (1 \ - p)^{n-i} = npE[(Y\ + 1)^{k-1}]\]

where $Y$ is a binormial random variable with parameters $n-1$ and $p$.

\[E[X] = np\] \[Var(X)= np(1 − p)\]


If $X$ is a binomial random variable with parameters $(n, p)$, where $0 < p < 1$, then as $k$ goes from $0$ to $n$, $P{X = k}$ first increases monotonically and then decreases monotonically, reaching its largest value when k is the largest integer less than or equal to (n + 1)p.

\[P\{X = k + 1\} = \cfrac{p}{1 − p} \cfrac{n − k}{k + 1} P{X = k}\]


The binomial distribution function is

\[P\{X\leq i\} = \sum_{k=0}^i \left(\begin{matrix} n\\k \end{matrix}\right) p^k (1 − p)^{n−k} \qquad i = 0, 1,... , n\]

6. Continuous Random Variable


We say that $X$ is a continuous random variable if there exists a nonnegative function $f$ , defined for all real $x ∈ (−q,q)$, having the property that, for any set $B$ of real numbers,

\[P\{X ∈ B\} = \int_B f(x)\ dx\]

Since X must assume some value, f, called the probability density function of the random variable X, must satisfy

\[1 = P\{X ∈ (−\infty, \infty)\} = \int_{−\infty}^\infty f(x)\ dx\]

7. Distribution Function


If $X$ is a random variable, its distribution function is a function $F_X: \mathbb{R} \rightarrow [0, 1]$ such that \(F_X(x) = P(X \leq x) \qquad \forall x \in \mathbb{R}\) where $P(X \leq x)$ is the probability that $X$ is less than or equal to $x$.


Every distribution function enjoys the following four properties:

  • Increasing
\[F_X(x_1) \leq F_X(x_2) \qquad if\ x_1 < x_2\]
  • Right-continuous
\[\lim_{t \rightarrow x} F_X(t) = F_X(x) \qquad for\ t \geq x\]
  • Limit at minus infinity
\[\lim_{x \rightarrow -\infty} F(x) = 0\]
  • Limit at plus infinity \(\lim_{x \rightarrow \infty} F(x) = 1\)


If $X$ is continuous, then its distribution function $F$ will be differentiable and

\[\cfrac{d}{dx} F(x) = f(x)\]


The expected value of $X$ is defined by

\[E[X] = \int_{−\infty}^\infty xf(x)\ dx\]


For any real-valued function $g$

\[E[g(X)] = \int_{−\infty}^\infty g(x)f(x)\ dx\]


For a nonnegative random variable Y,

\[E[Y] = \int_0^\infty P(Y>y)\ dy\]


If a and b are constants, then

\[E[aX + b] = aE[X] + b\]


The variance of random variable $X$ with expected value μ is defined by

\[Var(X) = E[(X − μ)^2] = E[X^2] − (E[X])^2\]

9. The Uniform Random Variable


A random variable is said to be uniformly distributed over the interval (0, 1) if its probability density function is given by

\[f(x)=\begin {cases} 1, & 0 < x < 1 \\ 0, & otherwise \end {cases}\]

for any 0 < a < b < 1

\[P\{a \leq X \leq b\} = \int_a^b f (x) dx = b − a\]


we say that X is a uniform random variable on the interval (α, β) if the probability density function of X is given by

\[f(x)=\begin {cases} \cfrac{1}{\beta - \alpha}, & \alpha < x < \beta \\ 0, & otherwise \end {cases}\]


The (cumulative) distribution function of a uniform random variable on the interval $(\alpha, \beta)$ is given by

\[F(x)=\begin {cases} 0, & x \leq \alpha \\ \cfrac{x - \alpha}{\beta - \alpha}, & \alpha < x < \beta \\ 1, & x \geq \beta \end {cases}\]


\[\begin {eqnarray*} E[X] &=& \cfrac{a+b}{2} \\ Var(X) &=& \cfrac{(b-a)^2}{12} \end {eqnarray*}\]

10. Normal Random Variable


We say that $X$ is a normal random variable, or simply that $X$ is normally distributed, with parameters $\mu$ and $\sigma^2$ if the density of $X$ is given by

\[f(x) = \cfrac{1}{\sqrt{2\pi \sigma}}e^{-(x-\mu)^2 / 2\sigma^2} \qquad -\infty<x<\infty\]


If $X$ is normally distributed with parameters $\mu$ and $\sigma^2$, then $Y = aX + b$ is normally distributed with parameters $a\mu+b$ and $a^2\sigma^2$,


$Z = (X − \mu)/\sigma$ is normally distributed with parameters $0$ and $1$. Such a random variable is said to be a standard (unit) normal random variable.


It is customary to denote the cumulative distribution function of a standard normal random variable by $\Phi(x)$. That is,

\[\Phi(x) = \cfrac{1}{\sqrt{2\pi}} \int_{-\infty}^x e^{-y^2/2} \ dy\]


\[\begin{eqnarray*} E(X) &=& \mu \\ Var(X) &=& \sigma^2 \end{eqnarray*}\]

The DeMoivre-Laplace Limit Theorem

If $S_n$ denotes the number of successes that occur when $n$ independent trials, each resulting in a success with probability $p$, are performed, then, for any $a<b$,

\[P \left\{ a \leq \cfrac{S_n - np}{\sqrt{np(1-p)}} \leq b \right\} \rightarrow \ \Phi(b) - \Phi(a)\]

as $n \rightarrow \infty$.

In other words, the probability distribution function of a binomial random variable with parameters $n$ and $p$ can be approximated by that of a normal random variable having mean $np$ and variance $np(1 − p)$.

11. The distribution of A Function of A Random Variable


Let $X$ be a continuous random variable having probability density function $f_X$. Suppose that $g(x)$ is a strictly monotonic (increasing or decreasing), differentiable (and thus continuous) function of $x$. Then the random variable $Y$ defined by $Y = g(X)$ has a probability density function given by

\(f_Y(y) = \begin{cases} f_X[g^{-1}(y)] \left| \cfrac{d}{dy} g^{-1}(y) \right|, & y = g(x)\ for\ some\ x \\ 0, & y \neq g(x)\ for\ all\ x \end{cases}\) where $g^{-1}(y)$ is defined to equal that value of $x$ such that $g(x)=y$

12. Joint Distribution Function


For any two random variables $X$ and $Y$, the joint cumulative probability distribution function of $X$ and $Y$ by

\[F(a, b) = P\{X \leq a, Y \leq b \} \qquad -\infty<a, b< \infty\]


The marginal distributions of $X$ and $Y$ are defined by

\[\begin{eqnarray*} &F_X(x)& = P\{X \leq x \} = \lim_{y \rightarrow \infty} F(x, y) \\ &F_Y(y)& = P\{Y \leq y \} = \lim_{x \rightarrow \infty} F(x, y) \end{eqnarray*}\]


\[P\{ a_1 < X \leq a_2, b_1 < Y \leq b_2 \} = F(a_2, b_2) + F(a_1, b_1) - F(a_1, b_2) - F(a_2, b_1)\]


In the case when $X$ and $Y$ are both distrete random variables, it is convenient to define the joint probability mass function of $X$ and $Y$ by

\[p(x,y) = P\{X=x, Y=y \}\]


The probability mass function of $X$ and $Y$ can be obtained from $p(x, y)$ by

\[\begin{eqnarray*} &p_X&(x) = P\{X=x \} = \sum_{y:p(x,y)>0} p(x, y) \\ &p_Y&(y) = P\{Y=y \} = \sum_{x:p(x,y)>0} p(x, y) \end{eqnarray*}\]

13. Independent Random Variables


The random variables $X$ and $Y$ are said to be independent if, for any two sets of real number $A$ and $B$,

\[P{X \in A, Y \in B} = P{X \in A}P{Y \in B}\]

When $X$ and $Y$ are discrete random variables, the condition of independence is equivalent to

\[p(x, y) = p_X(x)p_Y(y) \qquad for\ all\ x, y\]

For continuous random variables $X$ and $Y$, the condition of independence is equivalent to

\[F(a, b) = F_X(a)F_Y(b) \qquad for\ all\ a, b \\ f(x, y) = f_X(x)f_Y(y) \qquad for\ all\ x, y\]

Random Variables that are not independent are said to be dependent.


The continuous (discrete) random variable $X$ and $Y$ are independent if and only if their joint probability density (mass) function ca be expressed as

\[f_{X,Y}(x, y) = h(x)g(y) \qquad -\infty < x, y < \infty\]


Independence is a symmetric relation. To say that $X$ is independent of $Y$ is equivalent to saying that $Y$ is independent of X, or just that $X$ and $Y$ are independent.

14. Sums of Independent Random Variables


Suppose that $X$ and $Y$ are independent, continuous random variables having probability density function $f_X$ and $f_Y$. The cumulative distribution function of $X+Y$ is obtained as follows:

\[F_{X+Y}(a) = P\{X + Y \leq a \} = \int_{-\infty}^\infty F_X(a-y)f_Y(y)dy\]

$F_{X+Y}$ is called the convolution of the distributions $F_X$ and $F_Y$.

The probability density function $f_{X+Y}$ of $X$ and $Y$ is given by

\[f_{X+Y}(a) = \cfrac{d}{da} F_{X+Y}(a) = \int_{-\infty}^\infty f_X(a-y)f_Y(y)dy\]

Identically Distributed Uniform Random Variables

Suppose $X$ and $Y$ are independent uniform random variables. The probability density of $X$ and $Y$ is

\[f_{X+Y}(a) = \begin{cases} a & 0 \leq a \leq 1 \\ 2-a & 1 < a < 2 \\ 0 & otherwise \end{cases}\]