
MSC Classification: 60 (Probability theory) 
Prerequisites: combinatorics
Getting Oriented
Rough Guides to Probability 
Probability is the study of chance…
Combinatorics
Before one can ask “how likely?” one must ask “how many?”. The answer to that question is given by combinatorics. The basic question is: how many ways are there to choose $k$ objects out of a set of $n$? Well, the answer depends on whether order matters, whether they are distinct objects, and so on:
 There are $n!$ ways to order a set of $n$ objects;
 Combination: there are $C(n,k)=\binom{n}{r}=\frac{n!}{(nr)!r!}$ ways to select $k$ objects out of $n$ when order does not matter;
 Permutation: there are $P(n,k)=\frac{n!}{r!}$ ways to make this selection when order matters;
 Multichoose: there are $\binom{n}{n_1\cdots n_r}=\frac{n!}{n_1!\cdots n_r!}$ ways to partition $n$ objects into sets of size $n_1, n_2, \ldots, n_r$.
These are related to powers of polynomials by the Binomial Theorem
(1)and the Multinomial Theorem
(2)Basic Probability
Probability theory uses the notion of an experiment, and an outcome for that experiment, also called an event. The sample space of events is the set of possible outcomes. We may, of course, talk about unions, intersections (denoted $EF$ rather than $E \cap F$), and complements of events (and sets of events). The probability of certain outcomes is given by the probability measure:
Definition
A probability measure is a function $P:S\to [0,1]$, where $S$ is the sample space, such that: (1) $P(S)=1$; and (2) $P(\cup E_i)=\sum_i P(E_i)$ if the events $E_i$ are disjoint.
For example, if all outcomes in a given sample space are equally likely, then the probability of event $A$ occurring is $P(A)=\frac{A}{S}$.
One can conclude from these axioms that $P(E^c)=1P(E)$ and that if $E\subset F$ then $P(E)\leq P(F)$. We can also show that for an increasing (decreasing) sequence of events $E_n$, we have $\lim_{n\to\infty}P(E_n)=P(\lim_{n\to\infty} E_n)$. The Principle of Inclusion and Exclusion, or PIE, states that $P(E\cup F)=P(E)+P(F)P(EF)$, and can easily be extended to any number of sets.
Conditional Probability
If an experiment is repeated, the first outcome may affect the probability of the second. In an extreme example, if someone draws two balls out of a bag contains a black ball and a white ball, then the color of the second ball is completely determined by the color of the first. This is an example of conditional probability
Definition
Conditional probability is the probability of an event $E$ given that an event $F$ has already occurred, given by $P(EF)=\frac{P(EF)}{P(F)}=\frac{P(E)P(FE)}{P(F)}$.
Note that conditional probability $P_F(E)$ is itself a probability measure.
The above formulae give rise to several computational rules: for two events, we have $P(EF)=P(E)P(FE)$ and $P(E)=P(EF)+P(EF^c)=P(EF)P(F)+P(EF^c)P(F^c)$. Both of these may be generalized for several events.
If the first outcome does not effect the second, then the two events are independent and $P(EF)=P(E)P(F)$, in which case $P(EF)=P(E)$.
The Odds Ratio
The odds ratio of an event $F$ is given by
(3)which uniquely determines the probability. If an event $E$ has already occurred, the odds ratio becomes $\frac{P(HE)}{P(H^CE)}=\frac{P(H)}{P(H^c)}\frac{P(EH)}{P(EH^c)}$, so is just multiplied by some factor.
Random Variables
Discrete Random Variables
A random variable $X$ may encode the outcomes of an experiment if they are numerical. For discrete outcomes, the probability mass function is $p(x)=P(X=x)$, and the cumulative distribution function is $f(x)=P(X \leq x)=\sum_{a\leq x} p(a)$. Note that $\limtominf{x} f(x)=0$, $\lim_{x\to\infty} f(x)=1$, and that $f$ is right continuous. Given a random variable, we have:
 expected value: the linear function $E[X]=\sum_x xp(x)$;
 expected value of a function $g(X)$: [[$E[g(X)]=\sum_x g(x)p(x)$;
 variance: $\mathsf{Var}(X)=E[(XE[X])^2]=E[X^2](E[X])^2$;
 standard deviation: $\StDev(X)=\sqrt{Var(X)}$.
Examples:
Many experiments have just two outcomes: success and failure. This simple experiment gives rise to several random variables:
 Bernoulli random variable $X$: a single experiment is performed, with probability $p$ of success;
 Binomial random variable $X^n$: $n$ independent Bernoulli trials are performed, with probability of $i$ successes given by $P(X^n=i)=\binom{n}{i}p^i(np)^i$. This has expected value $E[X^n]=np$ and variance $\mathsf{Var}(X^n)=np(1p)$;
 Poisson random variable $X^\infty$: an approximation of $X^n$ for $n$ large, given by $P(X^\infty=n)=\frac{\lambda^n}{n!}e^{\lambda}$. The parameter $\lambda$ is both the expected value and the variance;
 Geometric random variable $\hat{X}$: measures the time until the first success in independent Bernoulli trials, given by $P(\hat{X}=n)=(1p)^{n1}p$. Here, $E[\hat{X}]=\frac{1}{p}$ and $Var(\hat{X})=\frac{1p}{p^2}$;
 Negative binomial random variable $\hat{X}^r$: measures the number of trials before $r$ successes, given by $P(\hat{X}^r=n)=\binom{n1}{r1}p^r(1p)^{nr}$. We have $E[\hat{X}^r]=\frac{r}{p}$ and $\mathsf{Var}(\hat{X}^r)=\frac{r(1p)}{p^2}$.
When taking $k$ objects from a sample of $m$ white and $nm$ black objects, the probability of getting $i$ white objects is given by the hypergeometric random variable $X_g$, with $P(X_g=i)=\frac{\binom{m}{i}\binom{nm}{ki}}{\binom{n}{k}}$. The expected value is $E[X_g]=\frac{km}{n}$.
Continuous Random Variables
Random variables may also be continuous (take the speeds of cars along a highway for example), in which case the //probability mass function $f(x)$} is defined on a continuous set. Rather than exact values, we usually speak of the random variable lying in some range of values $B$: [[$P(X\in B)=\int_B f(x)dx$. When $f$ is continuous, the probability of a precise value being taken is always zero, but if $f$ has a discontinuity at $x$, then the probability of $X$ taking the value $x$ is just the size of the jump. We have:
 cumulative distribution function: $F(x)=P(X\leq x)=\int_{\infty}^{x}f(u)du$ (so $f(x)=F'(x)$);
 expected value of $X$: $E[X]=\int_{\infty}^{\infty} xf(x)dx$;
 expected value of a function $g(X)$: $E[g(X)]=\int_{\infty}^{\infty} g(x)f(x)dx$;
 variance: as before, $\mathsf{Var}(X)=E[(XE[X])^2]=E[X^2](E[X])^2$.
A function $Y=g(X)$ of a continuous random variable $X$ has its own probability mass function: $f_Y(y)=f_X(g^{1})(y))(\frac{d}{dy}g^{1}(y))$.
Examples:
 Uniform Distribution: all outcomes in an interval $[a,b]$ are equally probable, giving a mass function $f(x)=\frac{1}{ba}$;
 Normal Distribution: given by $f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{(x\mu)^2/2\sigma^2}$, with parameters $\mu$ (the mean) and $\sigma$ (the standard deviation). It is an amazing fact that this distribution approximates the behavior of almost all experiments when the number of trials is very high;
 Exponential Distribution: given by $f(x)=\lambda e^{\lambda x}$, with parameter $\lambda$. Its mean and variance are $1/\lambda$ and $1/\lambda^2$.
 Gamma Distribution: given by $f(x)=\lambda e^{\lambda x}(\lambda x)^{t1}/\Gamma(t)$ for $x\geq 0$, where $\Gamma(t)=\intzinf e^{x}x^{t1}dx$ is the standard gamma function. It has expected value $t/\lambda$ and variance $t/\lambda^2$;
 Beta Distribution: given by $f(x)=x^{a1}(1x)^{b1}/B(a,b)$ for $0\leq x\leq 1$, where $B(a,b)=\int_0^1 x^{a1}(1x)^{b1}dx$. Its mean and variance are $\frac{a}{a+b}$ and $\frac{ab}{(a+b)^2(a+b+1)}$.
The exponential distribution is closely related to the hazard or failure rate function $lambda(t)=\frac{f(t)}{1F(t)}$. Here, $\lambda(t)dt$ represents the probability that an item which is $t$ years old will fail within an additional time $dt$. For the exponential distribution, $\lambda(t)$ is constant, so failure does not become more likely over time. Actually, the exponential distribution is the only one having a constant failure rate.
Jointly Distributed Random Variables
This section generalizes both random variables and conditional probability. When the outcome of two random values (discrete or continuous) are not independent, we have a joint probability mass function given by $f(x,y)=P(X=x,Y=y)$ and a corresponding cumulative distribution function $F(x,y)$. Note that the cumulative functions for $x$ and $y$ can be found from $F(x,y)$ by $F_X(x)=\lim_{y\to\infty}F(x,y)$ and $F_Y(y)=\lim_{x\to\infty}F(x,y)$.
The random variables $X$ and $Y$ are independent iff their mass function factors into separate functions of $x$ and $y$: [[$f(x,y)=g(x)h(y)$. Otherwise, we have a conditional probability mass function given by $f_{XY}(x)=\frac{f(x,y)}{f_Y(y)}$.
The distribution of the sum of two random variables is given by their convolution $F_{X+Y}(a)=\int_{\infty}^{\infty} F_X(ay)F_Y(y)dy$. This allows us to show that the distribution of several independent normal random variables with parameters $(\mu_i,\sigma_i^2)$ is again a normal random variable with parameters $(\sum_i\mu_i,\sum_i\sigma_i^2)$. Parameters also add for the sum of independent Poisson variables.
Properties of Expectation
The expected value of a function $g(X,Y)$ is $E[g(X,Y)]=\sum_xy g(x,y)p(x,y)$ in the discrete case, and with an integral in the continuous case. Clearly, the sum of expected values is the expected value of the sums.
The covariance of $X$ and $Y$ is defined as $\mathsf{Cov}(X,Y)=E[(XE[X])(YE[Y])]$, which can be shown to equal $E[XY]E[X]E[Y]$. Note that $\mathsf{Cov}(\sum_i X_i,\sum_j Y_j)=\sum_ij \mathsf{Cov}(X_i,Y_j)$. Covariance is related to variance by the identity $\mathsf{Var}(\sum_i X_i)=\sum_i\mathsf{Var}(X_i)+2\sum_{i<j}\mathsf{Cov}(X_i,X_j)$. We define the correlation between $X$ and $Y$ to be $\rho(X,Y)=\frac{\mathsf{Cov}(X,Y)}{\sqrt{\mathsf{Var}(X)\mathsf{Var}(Y)}}$.
Assuming that $Y=y$, we can define the conditional expected value of $X$. This is given by $E[XY=y]=\sum_x xP(X=xY=y)$ in the discrete case and $\int_{\infty}^{\infty} x f_{XY}(x)dx$ in the continuous case. Note that $E[X]=E[E[XY]]$; as a consequence we have $E[X]=\sum_y E[XY=y]P(Y=y)$ in the discrete case, and a corresponding formula in the continuous case.
We can also define the conditional variance: $\mathsf{Var}(XY=y)=E[(XE[XY=y])^2Y=y]$, giving the formula $\mathsf{Var}(X)=E[\mathsf{Var}(XY)]+Var(E[XY])$.
Given a random variable $X$, the moment generating function is given by $M_X(t)=E[e^{tX}]$. The moments of $X$ can then be found by differentiating $M(t)$ and evaluating the result at $t=0$. This uniquely determines the distribution of the random variable. We also have $M_{X+Y}(t)=M_X(t)M_Y(t)$, if $X$ and $Y$ are independent.
The multivariate normal distribution is defined for linear combinations of a finite set of independent standard normal random variables. They have sample mean $\bar X=\sum_i X_i/n$ and sample variance $S^2=\sum_i\frac{(X_i\bar X)^2}{n1}$. Both $\bar X$ and $S^2$ are random variables, for independent identically distributed $X_i$, and they are independent. The variable $\bar X$ has a normal distribution, with mean $\mu$ and variance $\sigma^2/n$.
Going Further
Limit Theorems
The main idea in this section is proving that as the number of trials of an experiment grows very large, we will see a normal distribution.
First, we have two inequalities. The Markov Inequality states that $P(X\geq a)\leq E[X]/a$ for nonnegative random variables. The Chebyshev Inequality states that $P(X\mu\geq k\sigma)\leq 1/k^2$ for all positive $k$. These can be used to prove:
Theorem (Central Limit Theorem)
Given independent identically distributed random variables $X_i$ with mean $\mu$ and variance $\sigma^2$, we have $P(\frac{X_1+\cdots+X_nn\mu}{\sigma\sqrt{n}}\leq a)$ approaches $\frac{1}{\sqrt{2\pi}}\int_{\infty}^a e^{x^2/2}dx$ as $n\to\infty$. This just means that the sum $\sum_i X_i$ approaches the normal distribution.
As a consequence, we have the Strong Law of Large Numbers, which states that the average success of $n$ trials of such variables approaches their mean, as $n\to\infty$.