Definition
The standard Gaussian function is defined as \[ \bbox[5px,border:2px solid red] { g_{0,1}(x)=g(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2} } \] This is by definition the probability density function of a standard normal random variable. It has zero mean and unit variance. A general Gaussian function (with mean \(\mu\) and variance \(\sigma^2\) is defined as \[ \bbox[5px,border:2px solid red] { g_{\mu,\sigma^2}(x)=\frac{1}{\sigma}g\left ( \frac{x-\mu}{\sigma} \right )=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\tfrac{1}{2\sigma^2}\left ( x-\mu \right )^2} } \] This is the probability density function of a general normal distribution. Note that if \(Z\) is a standard normal random variable, then \(X=\mu+\sigma Z\) will be a normal random variable with mean \(\mu\) and variance \(\sigma^2\).
Historical Timeline
1600 | In astronomy, discrepancies in measurements demanded some form of resolution to settle on a single number. Different astronomers had different ways of inferring the true value from a set of measurements, some using medians, some means, but varying widely in their techniques for calculation. Notably, Tycho Brahe and Johannes Kepler use unclear methods to obtain representative values from multiple measurements. |
1632 | Galileo Galilei is concerned with the ambiguity in obtaining a single value from multiple measurements, and notes that there must be a true value, and that errors are symmetrically distributed around this true value, with smaller errors being more likely than larger ones. |
1654 | Fermat and Pascal develop the modern theory of probability. Pascal develops his triangle (the binomial coefficients), and the probabilities of the binomial distribution for success probability \(p=1/2\). |
1710 | James Bernoulli finds the binomial distribution for general success probability. |
1712 | The Dutch mathematician Willem ’s Gravesande uses Pascal's calculation of probability to investigate birthrate discrepancies of male and female children. |
1722 | Roger Cotes suggests the true value of a set of measurements be at the "center of mass" of the observed values (the modern definition of the mean). This is the same as the value with the minimum square deviation from the measurements. |
1733 | Abraham De Moivre found that \[\binom{N}{k}\approx \frac{2^{N+1}}{\sqrt{2\pi N}}e^{-\tfrac{2}{N}\left (k-\tfrac{N}{2} \right )^2}\] and used it in relation to the binomial distribution. This is very clearly the first instance of the Gaussian function in a probability context. |
1756 | Thomas Simpson proved that the expected error is bounded when the error is distributed by what would today be called a two-sided geometric distribution, and in the case where it has a rectangular or triangular distribution. |
1774 | Pierre-Simon, marquis de Laplace argued the distribution of errors should take on a distribution of the form \(p(x)=\frac{m}{2}e^{-m|x|}\). |
1777 | Daniel Bernoulli becomes concerned about the center-of-mass mean being universally accepted without justification. He favored a semicircular distribution of errors. He also proved that with fixed errors, the accumulated error tends toward a Gaussian-like distribution. |
1801 | Giuseppe Piazzi observes a celestial object proposed to be a planet (later identified as Ceres). It goes behind the sun and astronomers try to predict where it will re-emerge. Karl Friedrich Gauss guesses correctly while most others guess incorrectly. |
1809 | Gauss proposes the method of least squares, which leads him to conclude that the error curve must be a Gaussian function. |
1810 | Laplace proves the Central Limit Theorem, which concludes that the Gaussian is the limiting distribution of averages from almost any distribution. |
1846 | Adolphe Quetelet applies the Gaussian distribution to sociology and anthropometry, particularly to the distribution of chest sizes of Scottish soldiers. |
1860 | James Clark Maxwell shows that the Gaussian distribution applies to the velocities of gas particles. |
1869 | Sir Francis Galton applies Quetelet's work and the central limit theorem more broadly, to try to prove that intelligence is hereditary, and spends considerable time and energy investigating heredity and statistics. However, one of his primary motivations was to apply these to eugenics. |
1873 | C. S. Peirce, Galton, and Wilhelm Lexis refer to the Gaussian distribution as the "normal" distribution. |
1900 | Karl Pearson popularizes "normal distribution" as a shorter and uncredited name for the "Laplace-Gauss curve". |
1947-50 | P.G. Hoel and A.M. Mood publish popular textbooks that refer to the distribution with zero mean and unit variance as the "standard normal" distribution. |
Normalization
We wish to verify that the functions above are normalized, i.e. that they are proper probability density functions. To do this, let's evaluate the integral \[ I=\int_{-\infty}^{\infty}e^{-x^2}dx \] Recall the limit definition of the exponential: \(e^x=\underset{n \to \infty}{\lim}\left ( 1+\tfrac{x}{n} \right )^n\). Using this, the integral becomes the following limit: \[ I=\underset{n \to \infty}{\lim}\int_{-\sqrt{n}}^{\sqrt{n}}\left ( 1-\frac{x^2}{n} \right )^n dx =\underset{n \to \infty}{\lim}\sqrt{n}\int_{-1}^{1}\left ( 1-x^2 \right )^n dx \] Let's define \[ I_p=\int_{-1}^{1}\left ( 1-x^2 \right )^p dx \] Using integration by parts: \[ \begin{matrix} I_p & = & \left.\ x\left ( 1-x^2 \right )^p\right|_{x=-1}^{1}+2p\int_{-1}^{1}x^2\left ( 1-x^2 \right )^{p-1}dx \\ \\ & = &2p\left [\int_{-1}^{1}\left ( 1-x^2 \right )^{p-1}dx-\int_{-1}^{1}\left ( 1-x^2 \right )^{p}dx \right ] \end{matrix} \] It follows that \(I_p = 2p\left [ I_{p-1}-I_p \right ] \), and so \(I_p=\frac{2p}{2p+1}I_{p-1}\). As clearly \(I_0=2\), and \(I_{1/2}=\pi/2\), we have the two formulas, where m is some natural number: \[ I_m=2\prod_{k=1}^{m}\frac{2k}{2k+1} \\ I_{m+\tfrac{1}{2}}=\frac{\pi}{2(m+1)}\prod_{k=1}^{m}\frac{2k+1}{2k} \] Which reveals the relationship \[ I_{m+\tfrac{1}{2}}=\frac{\pi}{(m+1)I_m} \] Substituting this back in the original limit: \[ I=\underset{n \to \infty}{\lim}\sqrt{n+\tfrac{1}{2}}I_{n+\tfrac{1}{2}}= \underset{n \to \infty}{\lim}\frac{\pi\sqrt{n+\tfrac{1}{2}}}{(n+1)I_n}= \pi\frac{\underset{n \to \infty}{\lim}\frac{\sqrt{(n+\tfrac{1}{2})n}}{n+1}}{\underset{n \to \infty}{\lim}\sqrt{n}I_n}=\frac{\pi}{I} \] It follows that \( \bbox[5px,border:2px solid red] {I=\sqrt{\pi} }\). Note that we also obtain the following limit as a byproduct: \[ \bbox[5px,border:2px solid red] { \underset{n \to \infty}{\lim}2\sqrt{n}\prod_{k=1}^{n}\frac{2k}{2k+1}=\sqrt{\pi} } \] which can be seen as equivalent to the Wallis product. This suffices to show that the Gaussian functions defined above are indeed normalized, as expected.
General Integral and Corollaries
Using the result above, let us evaluate \[ \int_{-\infty}^{\infty}e^{-ax^2+bx+c}dx \] This is easily done by completing the square: \(-ax^2+bx+c=-a\left ( x-\tfrac{b}{2a} \right )^2+\tfrac{b^2}{4a}+c\). This immediately gives \[ \int_{-\infty}^{\infty}e^{-ax^2+bx+c}dx=\int_{-\infty}^{\infty}e^{-a\left ( x-\tfrac{b}{2a} \right )^2+\tfrac{b^2}{4a}+c}dx=e^{\tfrac{b^2}{4a}+c}\int_{-\infty}^{\infty}e^{-au^2}du \] \[ \bbox[5px,border:2px solid red] { \int_{-\infty}^{\infty}e^{-ax^2+bx+c}dx=e^{\tfrac{b^2}{4a}+c}\sqrt{\pi/a} } \] Based on this, we can easily find: \[ \bbox[5px,border:2px solid red] { \int_{-\infty}^{\infty}g_{\mu,\sigma^2}(x)e^{t x}dx=e^{\mu t+\tfrac{1}{2}\sigma^2t^2} } \] This is the same as the moment generating function for a Gaussian distribution. Several results can be deduced from this. For instance, the Fourier transform of a Gaussian function: \[ \bbox[5px,border:2px solid red] { \int_{-\infty}^{\infty}g_{\mu,\sigma^2}(x)e^{-i\omega x}dx=e^{-\tfrac{1}{2}\sigma^2\omega^2-i\mu\omega} } \] The Fourier transform of a Gaussian is another Gaussian, with variance equal to one over the input variance. By taking the real and imaginary parts, we find that: \[ \int_{-\infty}^{\infty}g_{\mu,\sigma^2}(x)\cos(\omega x)dx=e^{-\tfrac{1}{2}\sigma^2\omega^2}\cos(\mu\omega) \\ \int_{-\infty}^{\infty}g_{\mu,\sigma^2}(x)\sin(\omega x)dx=e^{-\tfrac{1}{2}\sigma^2\omega^2}\sin(\mu\omega) \] Substituting a complex number for \(a\) gives us: \[ \int_{0}^{\infty}e^{-\alpha e^{-i\theta}x^2}dx=\frac{1}{2}\sqrt{\frac{\pi}{\alpha e^{-i\theta}}}=\frac{1}{2}\sqrt{\frac{\pi}{\alpha }}e^{i\theta/2} \\ \\ \int_{0}^{\infty}e^{-\alpha \cos(\theta)x^2}\cos\left (\alpha\sin(\theta)x^2\right ) dx= \frac{1}{2}\sqrt{\frac{\pi}{\alpha }}\cos(\theta/2) \\ \\ \int_{0}^{\infty}e^{-\alpha \cos(\theta)x^2}\sin\left (\alpha\sin(\theta)x^2\right ) dx= \frac{1}{2}\sqrt{\frac{\pi}{\alpha }}\sin(\theta/2) \] Particularly, when \(\theta=\pi/2\), we have \[ \bbox[5px,border:2px solid red] { \int_{0}^{\infty}\cos\left (\alpha x^2\right ) dx= \int_{0}^{\infty}\sin\left (\alpha x^2\right ) dx= \frac{1}{2}\sqrt{\frac{\pi}{2\alpha }} } \] Let us evaluate the integrals \[ I_n=\int_{0}^{\infty}x^ne^{-\tfrac{1}{2\sigma^2}x^2}dx \] By integrating by parts we find \[ I_n=\frac{1}{\sigma^2(n+1)}\int_{0}^{\infty}x^{n+2}e^{-\tfrac{1}{2\sigma^2}x^2}dx=\frac{I_{n+2}}{\sigma^2(n+1)} \] Which can be written as \(I_{n+2}=\sigma^2(n+1)I_n\). As \(I_0=\sigma\sqrt{\pi}/2\) and \(I_1=\sigma^2\), we find that \[ \bbox[5px,border:2px solid red] { \int_{0}^{\infty}x^ne^{-\tfrac{1}{2\sigma^2}x^2}dx=\sigma^{n+1}\cdot\left\{\begin{matrix} \sqrt{\tfrac{\pi}{2}}1\cdot3\cdot5\cdots (n-1) & \: \: \: n\: \: \mathrm{even}\\ 2\cdot4\cdot6\cdots (n-1) & \: \: \: n\: \: \mathrm{odd}\\ \end{matrix}\right. } \] Finally, let us examine the following parametrized integral \[ I(a,b)=\int_{0}^{\infty}e^{-a^2x^2-\tfrac{b^2}{x^2}}dx \] Let us differentiate with respect to \(b\) \[ \frac{\partial }{\partial b}I(a,b)=\int_{0}^{\infty}\frac{\partial }{\partial b}e^{-a^2x^2-\tfrac{b^2}{x^2}}dx=-2b\int_{0}^{\infty}\frac{1}{x^2}e^{-a^2x^2-\tfrac{b^2}{x^2}}dx \] We can then make the substitution \(y=\tfrac{b}{ax}\) to find \[\frac{\partial }{\partial b}I(a,b)=-2a\int_{0}^{\infty}e^{-a^2y^2-\tfrac{b^2}{y^2}}dy=-2aI(a,b)\] Given that \(I(a,0)=\tfrac{\sqrt{\pi}}{2a}\), it follows that \[ \bbox[5px,border:2px solid red] { \int_{0}^{\infty}e^{-a^2x^2-\tfrac{b^2}{x^2}}dx=\frac{\sqrt{\pi}}{2|a|}e^{-2|ab|} } \] Additionally, if we set \(b\to b\sqrt{i}\), we find \[ \int_{0}^{\infty}e^{-ax^2-\tfrac{bi}{x^2}}dx=\frac{1}{2}\sqrt{\frac{\pi}{a}}e^{-2\sqrt{abi}} =\frac{1}{2}\sqrt{\frac{\pi}{a}}e^{-\sqrt{2ab}}e^{-i\sqrt{2ab}} \\ \\ \int_{0}^{\infty}e^{-ax^2}\cos\left ( \frac{b}{x^2} \right )dx=\frac{1}{2}\sqrt{\frac{\pi}{a}}e^{-\sqrt{2ab}}\cos({\sqrt{2ab}}) \\ \\ \int_{0}^{\infty}e^{-ax^2}\sin\left ( \frac{b}{x^2} \right )dx=\frac{1}{2}\sqrt{\frac{\pi}{a}}e^{-\sqrt{2ab}}\sin({\sqrt{2ab}}) \] Using the last expression and taking the limit as \(a\) goes to zero, we find \[ \bbox[5px,border:2px solid red] { \int_{0}^{\infty}\sin\left ( \frac{b}{x^2} \right )dx=\sqrt{\frac{b\pi}{2}} } \] Similar results follow using complex substitutions in the above general cases.
A remarkably related integral that we can evaluate with these results is \[ I(a,b)=\int_{-\infty}^{\infty}\frac{\cos(ax)}{b^2+x^2}dx \] We use the elementary fact that \(t^{-1}=\int_{0}^{\infty}e^{-xt}dx\). Namely, we write: \[ I(a,b)=\int_{-\infty}^{\infty}\int_{0}^{\infty}\cos(ax)e^{-t(b^2+x^2)}dtdx \] Interchanging the order of integration and using the general formula we derived above, we find \[ I(a,b)=\int_{0}^{\infty}e^{-tb^2}\int_{-\infty}^{\infty}\cos(ax)e^{-tx^2}dxdt =\int_{0}^{\infty}e^{-tb^2}\sqrt{\frac{\pi}{t}}e^{-\frac{a^2}{4t}}dt \] Letting \(t=u^2\), we put it into the form above \[ I(a,b)=2\sqrt{\pi}\int_{0}^{\infty}e^{-u^2b^2}e^{-\frac{a^2}{4u^2}}du \] \[ \bbox[5px,border:2px solid red] { \int_{-\infty}^{\infty}\frac{\cos(ax)}{b^2+x^2}dx=\frac{\pi}{|b|}e^{-|ab|} } \] Particularly, if \(a=b=1\) \[ \bbox[5px,border:2px solid red] { \int_{-\infty}^{\infty}\frac{\cos(x)}{1+x^2}dx=\frac{\pi}{e} } \]
Maximum Likelihood
Suppose we wish to find a distribution with two parameters (\(\mu\) and \(\sigma^2\)) with the property that the maximum likelihood estimators for \(\mu\) and \(\sigma^2\) are the sample mean and variance, respectively. Additionally, we require that the distribution be symmetric about the mean. We will suppose that \[ f(x;\mu,\sigma^2)=\tfrac{1}{\sigma}g\left ( \tfrac{x-\mu}{\sigma} \right ) \] Where \(g(x)\) is a distribution with zero mean and unit variance, such that \(g(-x)=g(x)\). Let us also define \(\eta(x)=g'(x)/g(x)\), which satisfies \(\eta(-x)=-\eta(x)\). Then the log-likelihood for samples \(x_1,x_2,...,x_N\) is given by: \[ \ell(\mu,\sigma^2;\mathbf{x})=\sum_{k=1}^{N}\ln\left (f(x_k;\mu,\sigma^2) \right )=\sum_{k=1}^{N}\ln\left (\tfrac{1}{\sigma}g\left ( \tfrac{x_k-\mu}{\sigma} \right ) \right ) \] The maximization constraints for each parameter are: \[ \frac{\partial }{\partial \mu}\ell(\mu,\sigma^2;\mathbf{x})= \tfrac{-1}{\sigma}\sum_{k=1}^{N}\eta\left (\tfrac{x_k-\mu}{\sigma} \right ) =0 \\ \frac{\partial }{\partial \sigma^2}\ell(\mu,\sigma^2;\mathbf{x})= -\frac{N}{2\sigma^2}-\frac{1}{2\sigma^3}\sum_{k=1}^{N}(x_k-\mu) \eta\left ( \tfrac{x_k-\mu}{\sigma} \right )=0 \] By the supposition, these conditions hold when \[ \mu=\tfrac{1}{N}\sum_{k=1}^{N}x_k=\overline{\mathbf{x}} \\ \sigma^2=\tfrac{1}{N}\sum_{k=1}^{N}(x_k-\overline{\mathbf{x}})^2=\overline{\mathbf{x}^2}-\overline{\mathbf{x}}^2=\mathrm{var}(\mathbf{x}) \] Now, suppose we set \(x_1=a\) and \(x_2=x_3=....=x_N=a-Nb\). Then \(\overline{\mathbf{x}}=a-(N-1)b\). We then have from the first condition: \[ 0=\sum_{k=1}^{N}\eta\left (\tfrac{x_k-(a-(N-1)b)}{\sigma} \right )=\eta\left (\tfrac{(N-1)b}{\sigma} \right )+(N-1)\eta\left (\tfrac{-b}{\sigma} \right ) \] So that \(\eta\left (\tfrac{(N-1)b}{\sigma} \right )=(N-1)\eta\left (\tfrac{b}{\sigma} \right )\). As \(\eta\) is continuous and must satisfy this for any choice of variables, it must be the case that \(\eta(x)=\frac{g'(x)}{g(x)}=Bx\). This differential equation can be easily solved to give \(g(x)=g(0) \cdot e^{\tfrac{B}{2}x^2}\). Requiring that this be a normalized function puts it in the form \(g(x)=\frac{b}{\sqrt{\pi}} \cdot e^{-b^2x^2}\) for some \(b>0\), which makes \(\eta(x)=-2b^2x\). Using the fact that \(g(x)\) has unit variance gives \(b=\tfrac{1}{\sqrt{2}}\). Now we turn to the second constraint \[ 0=-\frac{N}{2\sigma^2}+\frac{1}{2\sigma^4}\sum_{k=1}^{N}(x_k-\mu)^2 \] Which easily gives: \[ \sigma^2=\frac{1}{N}\sum_{k=1}^{N}(x_k-\overline{\mathbf{x}})^2=\mathrm{var}(\mathbf{x}) \] Thus the normal distribution has the required properties. Moreover, the sample mean and variance are the maximum likelihood estimators for the normal distribution. This is the original method Carl Friedrich Gauss used to derive the distribution.
Convolution
Recall that the convolution of two functions is defined as \[ a(x)*b(x)=\int_{-\infty}^{\infty}a(t)b(x-t)dt \] For the case of two general Gaussians, we use the general integral above: \[ \\ g_{\mu_1,\sigma_1}*g_{\mu_2,\sigma_2} =\frac{1}{\sqrt{2\pi\sigma_1^2}}\frac{1}{\sqrt{2\pi\sigma_2^2}} \int_{-\infty}^{\infty}e^{-\tfrac{1}{2\sigma_1^2}\left ( t-\mu_1 \right )^2}e^{-\tfrac{1}{2\sigma_2^2}\left ( x-t-\mu_2 \right )^2}dt \\ \\ a=\tfrac{1}{2\sigma_1^2}+\tfrac{1}{2\sigma_2^2},\: \: b=\mu_1\tfrac{1}{\sigma_1^2}+(x-\mu_2)\tfrac{1}{\sigma_2^2},\: \: c=\tfrac{\mu_1^2}{2\sigma_1^2}+\tfrac{(x-\mu_2)^2}{2\sigma_2^2} \\ \\ g_{\mu_1,\sigma_1^2}*g_{\mu_2,\sigma_2^2}=\frac{1}{\sqrt{2\pi\sigma_1^2}}\frac{1}{\sqrt{2\pi\sigma_2^2}}e^{\tfrac{b^2}{4a}+c}\sqrt{\pi/a} \\ \\ g_{\mu_1,\sigma_1^2}*g_{\mu_2,\sigma_2^2}=\frac{1}{\sqrt{2\pi(\sigma_1^2+\sigma_2^2)}}e^{-\tfrac{1}{2(\sigma_1^2+\sigma_2^2)}\left ( x-(\mu_1+\mu_2) \right )^2} \] Thus: \[ \bbox[5px,border:2px solid red] { g_{\mu_1,\sigma_1^2}*g_{\mu_2,\sigma_2^2}=\frac{1}{\sqrt{2\pi(\sigma_1^2+\sigma_2^2)}}e^{-\tfrac{1}{2(\sigma_1^2+\sigma_2^2)}\left ( x-(\mu_1+\mu_2) \right )^2}=g_{\mu_1+\mu_2,\sigma_1^2+\sigma_2^2} } \] That is, the convolution of two Gaussians is another Gaussian with mean and variance equal to the sum of the convolved means and variances. This could be much more easily seen from the convolution property of the Fourier transform.
Entropy
The entropy of a probability distribution \(f(x)\) is defined as \[ \bbox[5px,border:2px solid red] { H=-\int_{-\infty}^{\infty}f(x)\ln\left ( f(x) \right )dx } \] We wish to find a function that maximizes this value subject to the constraints that it is normalized, that it has mean \(\mu\) and variance \(\sigma^2\). This can be done using the method of Lagrange multipliers. Namely, we define the function \[ \begin{align*} L(f,a,b,c)=& -\int_{-\infty}^{\infty}f(x)\ln(f(x))dx+a\left [ 1-\int_{-\infty}^{\infty}f(x)dx \right ]\\ & +b \left[\mu-\int_{-\infty}^{\infty}xf(x)dx \right ]+c\left[ \sigma^2-\int_{-\infty}^{\infty}(x-\mu)^2f(x)dx \right ] \end{align*} \\ \\ L(f,a,b,c)=a+b\mu+c\sigma^2-\int_{-\infty}^{\infty}f(x)\left [\ln(f(x))+a+bx+c(x-\mu)^2 \right ]dx \] Using the Euler-Lagrange Equation, we find that \(f(x)\) must satisfy \[ \frac{\partial }{\partial f}f(x)\left [\ln(f(x))+a+bx+c(x-\mu)^2 \right ]=0 \\ \ln(f(x))+a+bx+c(x-\mu)^2+1=0 \\ f(x)=e^{-1-a-bx-c(x-\mu)^2} \] Combining this with the conditions from the other Lagrange factors, we immediately find: \[ f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\tfrac{1}{2\sigma^2}(x-\mu)^2}=g_{\mu,\sigma^2}(x) \] That is, a normal distribution has the maximal entropy of all distributions with a given mean and variance. The entropy is then given by \[ \bbox[5px,border:2px solid red] { H=\frac{1}{2}\ln \left( 2\pi e\sigma^2 \right ) } \] This can be seen as a way of understanding the central limit theorem, described below, as wel las the stability under convolution: summing independent random variables increases the entropy, and already maximum-entropy distributions can only add their variances.
Central Limit Theorem
Suppose X is a random variable with a well-defined mean and variance. That is, \(\mathrm{E}(X)=\mu\) and \(\mathrm{E}((X-\mu)^2)=\sigma^2\) both exist. We want to find the distribution of \[ \bbox[5px,border:2px solid red] { Z_n=\frac{\overline{X_n}-\mu}{\sigma/\sqrt{n}} } \] Where \[ \overline{X_n}=\frac{1}{n}\sum_{k=1}^{n}X_k \] and the \(X_k\) are all independent variables distributed in the same way as \(X\). It is easy to show that if \(A\) is a random variable with distribution \(a(x)\) and \(B\) is a random variable with distribution \(b(x)\), independent to \(A\). Then the distribution of \(A+B\) is \(a(x)*b(x)\). Given this, it is useful to use the Fourier transforms of the distributions, or something similar to it. In fact, it is most advantageous to use the moment generating function, where the moment generating function of \(X\) is defined as \(M_X(t)=\mathrm{E}(e^{Xt})\). This has the property that \(M_{A+B}(t)=M_{A}(t)M_{B}(t)\). Another useful property is \[ M_{aX+b}(t)=\mathrm{E}(e^{t(aX+b)})=e^{tb}\mathrm{E}(e^{atX})=e^{tb}M_X(at) \] Using this, let us find the moment generating function of \(Z_n\). \[ M_{Z_n}(t)=\mathrm{E}\left ( e^{tZ_n} \right ) =\mathrm{E}\left ( e^{t\frac{\overline{X_n}-\mu}{\sigma/\sqrt{n}}} \right ) \\ \\ M_{Z_n}(t)=e^{-t\sqrt{n}\frac{\mu}{\sigma}}\mathrm{E}\left ( e^{\frac{t}{\sigma\sqrt{n}}\sum_{k=1}^{n}X_k} \right )=e^{-t\sqrt{n}\frac{\mu}{\sigma}}M_X^n\left (\frac{t}{\sigma\sqrt{n}} \right ) \] Based on the definition of the moment generating function, we find \[ M_X(t)=\mathrm{E}\left ( e^{tX} \right )=\mathrm{E}\left ( 1+\frac{t}{1!}X+\frac{t^2}{2!}X^2+\frac{t^3}{3!}X^3+... \right ) \\ M_X(t)=1+t\mathrm{E}(X)+\frac{t^2}{2!}\mathrm{E}(X^2)+\frac{t^3}{3!}\mathrm{E}(X^3)+... \] Using this in our expression above, we get \[ M_{Z_n}(t)=e^{-t\sqrt{n}\frac{\mu}{\sigma}}\left ( 1+\frac{t}{\sigma\sqrt{n}}\mu+\frac{t^2}{2n\sigma^2}\left ( \mu^2+\sigma^2 \right )+O\left ( \sqrt{n^3} \right ) \right )^n \] Taking the limit as \(n\) goes to infinity, we get \[ \bbox[5px,border:2px solid red] { \underset{n \to \infty}{\lim} M_{Z_n}(t)=e^{t^2/2} } \] That is, the scaled mean is asymptotically a standard normal. Generally, the mean of \(n\) i.i.d. random variables with mean \(\mu\) and variance \(\sigma^2\) approaches a normal distribution with mean \(\mu\) and variance \(\sigma^2/n\).
Gaussian Limits of Other Distributions
For several families of probability distributions, both discrete and continuous, with both finite and infinite support, in the limit of certain parameters, and scaling for the mean and variance, the distribution converges to a Gaussian.
The general approach will be as follows: suppose the random variable is \(x\) and the distribution \(f(x;\mathbf{p})\) is parametrized by some list of parameters \(\mathbf{p}\). We will turn these parameters into a functions of \(n\) (which we will let tend to infinity), \(\mathbf{p}(n)\) and find the mean \(\mu(n)\) and variance \(\sigma^2(n)\) as functions of \(n\). We then rescale to obtain the new distribution \[ g(z;n)=\sigma(n)\cdot f(\mu(n)+z\cdot\sigma(n);\mathbf{p}(n)) \] This distribution will have zero mean and unit variance. Clearly if \(f\) is normalized, \(g\) will be as well. All we will be looking for is how \(g\) varies with \(z\). To this end, we will take \[ \underset{n \to \infty}{\lim}\frac{\partial }{\partial z} \ln \left (g(z;n) \right ) \] As we expect \(g\) to converge to a standard normal, we expect this limit to be \(-z\).
Another approach may be to use the moment generating function. In particular, we expect \[ \underset{n \to \infty}{\lim} e^{-\tfrac{\mu(n)}{\sigma(n)}t}M_{X;\mathbf{p}(n)}\left ( \frac{t}{\sigma(n)} \right )=e^{t^2/2} \]
Binomial Distribution
\[ f(x;n,p)=\binom{n}{x}p^x(1-p)^{n-x} \\ \mu(n)=np,\: \: \sigma(n)=\sqrt{np(1-p)} \\ M_{X;n}(t)=\left ( 1-p+pe^t \right )^n \] Using the distribution method is rather involved and a discussion of it can be seen in the article on Stirling's approximation. However, the moment generating function method is much simpler: \[ M_{Z;n}(t)=\left ( 1+p\left [e^{t/\sqrt{np(1-p)}}-1 \right ] \right )^ne^{-t\sqrt{\frac{np}{1-p}}} \\ M_{Z;n}(t)=\left ( 1+\frac{pt}{\sqrt{np(1-p)}}+\frac{t^2}{2n(1-p)}+O(n^{-3/2}) \right )^ne^{-t\sqrt{\frac{np}{1-p}}} \] Thus, in the limit of large n \[ \underset{n \to \infty}{\lim}M_{Z;n}(t)=e^{\frac{t^2}{2(1-p)}}e^{-\frac{pt^2}{2(1-p)}}=e^{t^2/2} \]
Poisson Distribution
\[ f(x;\lambda)=e^{-\lambda}\frac{\lambda^x}{x!} \\ \lambda(n)=n,\: \: \mu(n)=n,\: \: \sigma(n)=\sqrt{n} \\ M_{X;n}(t)=e^{n\left ( e^t-1 \right )} \] We will make use of the fact that \[ \ln(x!)=(x+\tfrac{1}{2})\ln(x)-x+O(1) \] Which follows from Stirling's approximation. It follows that \(\ln(g)\) and its derivative take the form \[ \ln(g(z;n))=(n+z\sqrt{n})\ln(n)-\left (n+z\sqrt{n}+\tfrac{1}{2} \right )\ln(n+z\sqrt{n})+z\sqrt{n}+O(1) \\ \frac{\partial }{\partial z}\ln(g(z;n))=\sqrt{n}\ln(n)-\sqrt{n}\ln\left (n+z\sqrt{n} \right )-\frac{1}{2\left (n+z\sqrt{n} \right )} \] And so it follows that \[ \underset{n \to \infty}{\lim}\frac{\partial }{\partial z}\ln(g(z;n))=-z \] This analysis is much more simple when approached with the moment generating function method \[ M_{Z;n}(t)=e^{n\left ( e^\frac{t}{\sqrt{n}}-1 \right )}e^{-t\frac{n}{\sqrt{n}}} =e^{\frac{t^2}{2}+O(n^{-1/2})} \] Clearly then, in the limit \[ \underset{n \to \infty}{\lim}M_{Z;n}(t)=e^{t^2/2} \]
Gamma Distribution
\[ f(x;n,\theta)=\frac{x^{n-1}}{(n-1)!\theta^n}e^{-x/\theta} \\ \mu(n)=n\theta,\: \: \sigma(n)=\sqrt{n}\theta \\ M_{X;n}(t)=\left ( 1-\theta t \right )^{-n} \] Proceeding as above \[ \ln(g(z;n))=(n-1)\ln(n\theta+z\sqrt{n}\theta)-(n\theta+z\sqrt{n}\theta)/ \theta \\ \frac{\partial }{\partial z}\ln(g(z;n))=\frac{n-1}{\sqrt{n}+z}-\sqrt{n} \\ \underset{n \to \infty}{\lim}\frac{\partial }{\partial z}\ln(g(z;n))=-z \] However, it is simpler to use \[ M_{Z;n}(t)=\left ( 1-\frac{t}{\sqrt{n}} \right )^{-n}e^{-t\sqrt{n}} \\ \underset{n \to \infty}{\lim}M_{Z;n}(t)=e^{t^2/2} \]
Beta Distribution
\[ f(x;a,b)=\frac{x^{a-1}(1-x)^{b-1}}{\mathrm{B}(a,b)} \\ a=\alpha n,\: \: b=\beta n,\: \: \mu(n)=\frac{a}{a+b}=\frac{\alpha}{\alpha+\beta}, \nu=\frac{\beta}{\alpha+\beta}, \\ \sigma(n)=\sqrt{\frac{ab}{(a+b)^2(a+b+1)}}=\frac{1}{\sqrt{n}}\sqrt{\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+\tfrac{1}{n})}}\approx\frac{1}{\sqrt{n}}s \] Where \(s=\sqrt{\tfrac{\alpha\beta}{(\alpha+\beta)^3}}\). This approximation is not significant since we will be taking the limit of \(n\), and the approximation is quite good even for moderately sized \(n\). \[ \ln(g(z;n))=(\alpha n-1)\ln\left ( \mu+\tfrac{s}{\sqrt{n}}z \right )+(\beta n-1)\ln\left ( \nu-\tfrac{s}{\sqrt{n}}z \right )-\ln(\mathrm{B}(\alpha n,\beta n)) \\ \frac{\partial }{\partial z}\ln(g(z;n))=s\left [\frac{\alpha n-1}{\mu\sqrt{n}+sz}-\frac{\beta n-1}{\nu\sqrt{n}-sz} \right ] \\ \underset{n \to \infty}{\lim}\frac{\partial }{\partial z}\ln(g(z;n))=-s^2z\frac{\alpha+\beta}{\mu\nu}=-z \]
Correlated Normal Variables
Often we encounter variables that are not independent. These can be normally distributed or closely modeled as normally distributed. Suppose \(X,Y,Z\) are independent standard normal random variables. Let us define: \[ V=aX+bY+c \\ W=dX+fZ+g \] As sums of Gaussian distributions, \(V\) and \(W\) will each be themselves Gaussians. Since \(X,Y,Z\) are independent with zero mean, it follows that \(\mathrm{E}(XY)=\mathrm{E}(XZ)=\mathrm{E}(YZ)=0\). We then find the means and variances of \(V\) and \(W\): \[ \mu_V=\mathrm{E}(V)=c,\: \:\: \: \: \sigma_V^2=\mathrm{E}((V-\mu_V)^2)=a^2+b^2 \\ \mu_W=\mathrm{E}(W)=g,\: \:\: \: \: \sigma_W^2=\mathrm{E}((W-\mu_W)^2)=d^2+f^2 \] The covariance of \(V\) and \(W\) is given by \[\sigma_{VW}=\mathrm{E}((V-\mu_V)(W-\mu_W))=ad\] Thus the two variables are correlated, and we can achieve any desired correlation. We can achieve the same effect using only two variables: \[ \begin{bmatrix} V\\ W \end{bmatrix}=\begin{bmatrix} \sigma_V\sqrt{1-\rho^2} & \rho\sigma_V\\ 0 & \sigma_W \end{bmatrix} \begin{bmatrix} X\\ Y \end{bmatrix}+\begin{bmatrix} \mu_V\\ \mu_W \end{bmatrix} \] Where \(\rho\) is the correlation coefficient between \(V\) and \(W\). When defined this way, the joint probability density function is given by \[ f(v,w)=\frac{1}{2\pi\sigma_V\sigma_W\sqrt{1-\rho^2}}e^{-\tfrac{1}{2(1-\rho^2)}\left [ \tfrac{(v-\mu_V)^2}{\sigma_V^2}+\tfrac{(w-\mu_W)^2}{\sigma_W^2}-2\rho\tfrac{(v-\mu_V)(w-\mu_W)}{\sigma_V\sigma_W} \right ]} \] More generally, suppose that we have \(n\) correlated Gaussian variables represented in a column matrix: \(\mathbf{x}\). Let \(\boldsymbol{\mu}=\mathrm{E}(\mathbf{x})\).The covariance matrix is defined as \(\mathbf{\Sigma}=\mathrm{E}\left (\mathbf{(x-\boldsymbol{\mu})}\mathbf{(x-\boldsymbol{\mu})}^T \right )\), so that \(\mathbf{\Sigma}_{a,b}=\mathrm{E}\left ((x_a-\mu_a)(x_b-\mu_b) \right ) \). Then the probability density function is given by: \[ \bbox[5px,border:2px solid red] { f(\mathbf{x})=\frac{1}{\sqrt{(2\pi)^n\left | \mathbf{\Sigma} \right |}}e^{-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})} } \] Let \(\mathbf{z}\) be a column matrix of \(n\) independent standard normal random variables. Let \(M\) be an \(n\) by \(n\) matrix such that \(MM=\mathbf{\Sigma}\). Then if we take \(\mathbf{x}=M\mathbf{z}+\boldsymbol{\mu}\), then \(\mathbf{x}\) will be distributed with the density function given just above.
Wiener Processes and Stochastic Differential Equations
Let \(z_k\) be independent standard normal random variables for each \(k\). We define a random variable \(W\) as a function of \(t \geq 0\) to be \[ \bbox[5px,border:2px solid red] { W(t)=\underset{n \to \infty}{\lim}\frac{1}{\sqrt{n}}\sum_{1 \leq k \leq nt} z_k } \] It can be seen from the proceeding sections that at each \(t\), \(W(t)\) is a normal random variable with zero mean and variance \(t\). Moreover, \(W(s+t)-W(s)\) can be seen to have the same distribution and is independent of \(W(s)\). This is the definition of a Wiener process, also known as Brownian motion. Note that the summed variables don't even need to be normal, but just zero-mean and unit-variance, and the central limit theorem will guarantee that the process still tend to the same distribution.
One way to understand this is using the notation \(dW=z_t\sqrt{dt}\) where \(z_t\) is understood as an independent standard normal random variable for each \(t\). That way \(dW\) is a normal random variable with variance \(dt\). Then \[W(t)=\int_{0}^{t}dW\] It is clear from the definition that, for \(c>0\): \(\frac{1}{\sqrt{c}}W(ct)\) is distributed in exactly the same way and has all the same properties as \(W(t)\) (It's another Wiener process). Thus \(W(t)\) exhibits self-similarity. It is also evident that, for the same process, the covariance of \(W(s)\) and \(W(t)\) is \(\min(s,t)\). Suppose that \(V(t)=W(f(t))\) where \(f(t)\) is a non-decreasing function. Then it follows that \(dV=\sqrt{f'(t)}dW\).
If we wish to simulate a Wiener process at the times \(t_k\). This can be done incrementally as: \[ W(t_{k+1})=W(t_k)+\sqrt{t_{k+1}-t_k}\cdot z_k \] Where the \(z_k\) are independent standard normal variables.
The Wiener process and its stochastic differential \(dW\) is fundamental to the study of stochastic differential equations. Stochastic differential equations have many applications in physics, finance, population growth, and econometrics. One common general form for these equations is \[ \bbox[5px,border:2px solid red] { dX=\mu(X,t)dt+\sigma(X,t)dW } \] In this equation, \(\mu(X,t)\) is the drift and \(\sigma(X,t)\) is the spread or diffusion. The density function for \(X\), \(f(x,t)\), satisfies the Fokker-Planck equation: \[ \frac{\partial }{\partial t}f(x,t)=-\frac{\partial }{\partial x}\left [ \mu(x,t)f(x,t) \right ]+\frac{1}{2}\frac{\partial^2 }{\partial x^2}\left [\sigma^2(x,t)f(x,t) \right ] \] Suppose we wish to find what happens to the new variable \(Y=g(X,t)\), where \(g(x,t\) is a smooth, multiply differentiable function. We can expand \(dY\) in a Taylor series: \[ dY=\frac{\partial g}{\partial t}dt+\frac{\partial g}{\partial x}dX+\frac{1}{2}\frac{\partial^2 g}{\partial x^2}dX^2+... \] Subsituting in the expansion for \(dX\), and taking advantage of the fact that \(dW^2=dt\) in the proper statistical sense: \[ \bbox[5px,border:2px solid red] { dY=\left [\frac{\partial g}{\partial t}+\mu\frac{\partial g}{\partial x}+\frac{\sigma^2}{2}\frac{\partial^2 g}{\partial x^2} \right ]dt+\sigma\frac{\partial g}{\partial x}dW } \] All the remaining terms are higher than first order and hence will not be significant. This is Ito's lemma and allows us to make changes of variable for various stochastic processes, which allows us to solve several types of stochastic differential equations.
General Integrated Brownian Motion
Let us define: \[ V(t)=\int_{0}^{t}f'(s)W(s)ds=\int_{0}^{t}(f(t)-f(s))dW \] A fact that follows easily from the definition of the Wiener process, known as Ito isometry, is: \[ \bbox[5px,border:2px solid red] { \mathrm{E}\left ( \left [\int_0^t F(s)dW \right ]^2 \right )=\mathrm{E}\left ( \int_0^t F(s)^2ds \right ) } \] Given this, it follows that (for \(a>0\)): \[ \mathrm{Var}\left ( V(t) \right)=\int_0^t \left ( f(t)-f(s) \right )^2ds \\ \mathrm{cov}\left ( V(t+a),V(t) \right)=\int_0^t \left ( f(t+a)-f(s) \right )\left ( f(t)-f(s) \right )ds \] For instance, when \(f(t)=t\), \(\mathrm{cov}\left ( V(t+a),V(t) \right)=t^2\tfrac{3a+2t}{6}\). Note that as the sum of zero-mean Gaussians, all of the random variables are Gaussian as well.
A way to simulate such a process at times \(t_k\) is to use the above to determine the covariance matrix \(\mathbf{\Sigma}\) of all the \(V(t_k)\). Then the process can be simulated by the method described in the section on correlated normal variables. Namely, we take a vector of iid standard normal variables \(\mathbf{z}\), determine a matrix \(M\) such that \(MM=\mathbf{\Sigma}\), then \(V(\mathbf{t})=M\mathbf{z}\). This technique applies to any correlated process.
Note also, from the note above, we easily derive the formula, for \(f(t)\) some function and constant \(A \geq 0\): \[ \bbox[5px,border:2px solid red] { \int_0^t A\cdot f(s)dW=A\cdot W\left ( \int_0^t f^2(s)ds \right ) } \]
Brownian Motion with Drift
Suppose that \[ \bbox[5px,border:2px solid red] { dX=\mu dt+\sigma dW } \] Where \(\mu\) and \(\sigma\) are constants. Let us define \(Y=g(X,t)\), where \(g(x,t)=\frac{x-\mu t}{\sigma}\). By Ito's lemma, the differential becomes: \[ dY=\left [\frac{-\mu}{\sigma}+\mu\frac{1}{\sigma}+0 \right ]dt+\sigma\frac{1}{\sigma}dW=dW \] Thus \(\bbox[5px,border:2px solid red] {X(t)=\sigma W(t)+\mu t+X_0}\). It also follows that X is a normal random variable with mean \(\mu t+X_0\) and variance \(\sigma^2 t\).
Geometric Brownian Motion
Suppose that \[ \bbox[5px,border:2px solid red] { dX=X\mu dt+X\sigma dW } \] Where \(\mu\) and \(\sigma\) are constants. This is a process where the percentage change \(\frac{dX}{X}\) follows a Brownian motion with drift. Let us define \(Y=\ln(X)\). By Ito's lemma, the differential becomes: \[ dY=\left [0+\mu X\frac{1}{X}-\frac{\sigma^2 X^2}{2}\frac{1}{X^2} \right ]dt+\sigma X\frac{1}{X}dW=\left ( \mu-\frac{\sigma^2}{2} \right )dt+\sigma dW \] But this has the same form as the case of Brownian motion with drift. Thus \(Y(t)=\sigma W(t)+\left (\mu-\frac{\sigma^2}{2} \right ) t+Y_0\), from which it follows that \[ \bbox[5px,border:2px solid red] { X(t)=X_0 \cdot e^{\left (\mu-\frac{\sigma^2}{2} \right ) t+\sigma W(t)} } \] Moreover, \(X\) will be log-normally distributed with \(\mu\)-parameter equal to \(\ln(X_0)+(\mu-\tfrac{\sigma^2}{2})t\) and \(\sigma^2\)-parameter equal to \(\sigma^2 t\).
Ornstein–Uhlenbeck process
Suppose that \[ \bbox[5px,border:2px solid red] { dX=-X\theta dt+\sigma dW } \] Where \(\theta\) and \(\sigma\) are constants. Let us define \(Y=g(X,t)\), where \(g(x,t)=x e^{\theta t}\). By Ito's lemma, the differential becomes: \[ dY=\left [\theta X e^{\theta t}-X\theta e^{\theta t}+0 \right ]dt+\sigma e^{\theta t}dW=\sigma e^{\theta t} dW \] It follows that \[ Y=Y_0+\frac{\sigma}{\sqrt{2\theta}} W(e^{2\theta t}-1) \] And so \[ \bbox[5px,border:2px solid red] { X=X_0e^{-\theta t}+\frac{\sigma}{\sqrt{2\theta}}e^{-\theta t} W(e^{2\theta t}-1) } \] Another common formulation has \[ \bbox[5px,border:2px solid red] { dX=(\xi-X)\theta dt+\sigma dW } \] Which has the solution \[ \bbox[5px,border:2px solid red] { X(t)=X_0e^{-\theta t}+(1-e^{-\theta t})\xi +\frac{\sigma}{\sqrt{2\theta}}e^{-\theta t} W(e^{2\theta t}-1) } \] This shows that \(X\) is at each time a Gaussian with mean \(X_0e^{-\theta t}+(1-e^{-\theta t})\xi\), and variance \(\tfrac{\sigma^2}{2\theta}(1-e^{-2\theta t})\). It follows that, asymptotically, the distribution tends toward a Gaussian with mean \(\xi\) and variance \(\tfrac{\sigma^2}{2\theta}\). Thus the process is mean-reverting. The covariance of \(X\) between times \(a\) and \(b\) is given by \[ \mathrm{cov}\left ( X(a),X(b) \right )=\frac{\sigma^2}{2\theta}\left ( e^{-\theta|a-b|}-e^{-\theta(a+b)} \right ) \]
Generating Gaussian Random Samples
Often we wish to simulate independent samples from a standard normal distribution. By the central limit theorem, one option is to simulate \(N\) uniform random variables, and take the normalized mean. However, this is quite inefficient. We descibe two methods below:
2D-Distribution Based
The typical method of evaluating the Gaussian integral is to square the integral and evaluate it as a two-dimensional integral by an advantageous change of variables. For example: \[ I=\int_{-\infty}^{\infty}e^{-x^2/2}dx \\ I^2=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}e^{-(x^2+y^2)/2}dxdy =\int_{0}^{2\pi}\int_{0}^{\infty}e^{-r^2/2}rdrd\theta \\ I^2=\left.\begin{matrix} \theta \end{matrix}\right|_0^{2\pi}\cdot \left [ -e^{-r^2/2} \right ]_0^\infty =2\pi \] This implies that \(\theta\) is uniformly distributed between \(0\) and \(2\pi\), and \(r\) is distributed with a CDF given by \(F(r)=1-e^{-r^2/2}\). Thus, if \[ \begin{align} x &=r\cos(\theta), &\: \: y &= r\sin(\theta)\nonumber\\ \theta &= 2\pi U_1, &\: \: r &= \sqrt{-2\ln(U_2)}\nonumber\\ \nonumber \end{align} \] Where \(U_1\) and \(U_2\) are uniform random variables on \((0,1]\), then \(x\) and \(y\) are independent standard normal variables. This is called the Box-Muller Method.
An equivalent method, Marsaglia's Polar Method, samples \((u,v)\) uniformly over the unit disk, then returns \[ x=u\sqrt{\frac{-2\ln(u^2+v^2)}{u^2+v^2}} \\ y=v\sqrt{\frac{-2\ln(u^2+v^2)}{u^2+v^2}} \] as two independent standard Gaussian variables.
Distribution-Geometry Based
This method, known as the Ziggurat Method is quite general: it can be applied to any random variable with a monotone decreasging density function, or one with a a symmetric one that is monotone decreasing around a central mode. The main principle we use is that the x-coordinate of a point uniformly selected under a given probability density function is distributed with that probability density. The Ziggurat algorithm cuts the distribution into N layers, stacked in the way a ziggurat appears. The less area there is in the tail and in the ziggurat but outside the curve, the more efficient the algorithm will be, as fewer resamplings will be needed.
For the specific case of a normal distribution, the algorithm proceeds as follows:
- Pre-compute layer limits and probabilities . We select N y-limits such that \(0 = y_0 < y_1 < y_2 < ... < y_{N-1} < y_N=1\). We then get the probabilities \[ p_k={\frac{2}{\sqrt{\pi}}}\int_{y_{k}}^{y_{k+1}}\sqrt{-\ln(x)}dx \] We also define \(x_k=-2\ln(y_k)\). Once we have the limits and probabilities, they can be stored and used for all further calculations. Note that the first layer includes the tail (\(x_0=\infty\)), which will have to be dealt with differently than the other layers.
- Randomly select a layer. From the N layers, randomly pick one with probability corresponding to the layer probability computed above: save the selected index \(k\) (layer \(k\) spans from \(y_k\) to \(y_{k+1}\)).
- Sample layer. If \(k>0\), pick a uniform random value \(u\) between 0 and 1. Then \(x=u \cdot x_{k}\). If \(k=0\), use the tail algorithm.
- Test sampled value. If \(x < x_{k+1}\), return \(y\). Otherwise, pick a uniform random value \(v\) between 0 and 1. define \(y=y_k+v \cdot (y_{k+1}-y_k)\). If \(y < e^{-x^2/2}\), return x. Otherwise, resample the layer (return to step 2).
- Tail algorithm. If \(k=0\), pick uniform random values \(u_1,u_2,u_3,u_4\) between 0 and 1. If \(u_1 < x_1\cdot y_1\), return \(x=x_1 \cdot u_2\). Otherwise, let \(\xi=-\ln(u_3)/x_1\) and \(y=-\ln(u_4)\). If \(2y > \xi^2\) return \(x=x_1+\xi\). Otherwise, resample the layer (return to step 2).
- Sign of value. Pick a random bit \(b\) which is equally likely to be 0 or 1. Set \(x \to (-1)^b x\).
Gaussian Kernel Smoothing and Density Estimation
Gaussian functions serve as very convenient and useful kernels for a number of applications. A kernel is a function that depends on a location and a value at that location and spreads that value to nearby locations. This is useful for smoothing, interpolating, and making discrete sets continuous. The Gaussian is an attractive kernel because it is normalized, it has natural location (\(\mu\)) and width (\(\sigma\)) parameters, tails off quickly, and is not abitrarily constrained to a finite window. Applying the kernel generally operates like a convolution: for discrete data \(\left \{ (x_1,y_1),(x_2,y_2),...,(x_N,y_N) \right \}\) we define a function as \[ F(x)=\nu(x)\cdot\sum_{k=1}^{N}y_k\cdot g_{0,\sigma^2}(x-x_k) \] Where \(\nu(x)\) is a normalizing or scaling function. For continuous data, \(f(x)\) defined for all \(x\), then the application of the kernel gives: \[ F(x)=\nu(x)\cdot [f*g_{0,\sigma^2}(x)] \] The parameter \(\sigma\) is the only one left unspecified, and allows us to have control over the degree of smoothing. In higher dimensions, it may be asymmetric, including off-diagonal terms. It is possible to vary it for different \(x_k\).
Gaussian Blur
Given an image, often it is desirable to remove noise, soften sharp edges, or remove detail. This is useful, for example, to make backgrounds draw less attention, or, in edge-detection, to reduce the number of detected features. Several processes tend to produce noise or sharp edges and so it is advantageous to follow them with a process to reduce such unwanted features. By using a Gaussian kernel, we can achieve this desired type of smoothing, with the degree of smoothing controlled by \(\sigma\) (in units of pixels). One major advantage of the Gaussian kernel, particularly for image processing, is that, for diagonal covariance matrices, it can be decomposed into sequential one-dimensional convolutions, and hence is generally much more efficient than other kernels. Note that, for color images, the process is done one each of the different RGB channels.
For practical purposes, the Gaussian kernels used are rarely infinite in extent, and need not be continuous. Rather, a \((2n+1)\times(2n+1)\) pixel kernel is used, where \(n\geq 3\sigma\), which is just as effective, nearly indistinguishable, and far more efficient. This kernel, for \(0\leq i,j\leq 2n\), can be given by: \[ K(i,j)=\frac{1}{\kappa^2}e^{-\tfrac{1}{2\sigma^2}\left [ \left (i-n \right )^2+\left (j-n \right )^2 \right ]} \] Where \[ \kappa=\sum_{j=-n}^{n}e^{-\tfrac{j^2}{2\sigma^2}} \]
Gaussian Kernel Smoothing
Given a discrete set of x-values and their corresponding y-values, we define a function as: \[ F(x)=\frac{\sum_{k=1}^{N}y_k \cdot g_{0,\sigma^2}(x-x_k)}{\sum_{k=1}^{N}g_{0,\sigma^2}(x-x_k)} \] This function has the property that points near \(x_k\) will have values close to \(y_k\). In fact, in the limit as \(\sigma^2\) goes to zero, the function converges to the nearest-neighbor or Voronoi map. By adjusting \(\sigma^2\), we can adjust how quickly the function transitions between values. This is most easily seen from a one-dimensional example: suppose we have the two data points \(\{(x_1,y_1),(x_2,y_2)\}\). Then the function above can be equivalently written as: \[ F(x)=\frac{y_1+y_2}{2}+\frac{y_2-y_1}{2}\tanh\left ( \left [ \frac{x_2-x_1}{2\sigma^2} \right ]\left ( x-\tfrac{x_1+x_2}{2} \right ) \right ) \] The rise span of this function is proportional to \(\frac{\sigma^2}{x_2-x_1}\). This behavior is somewhat opposite to what may be desired, namely, that more widely separated points have shorter rise spans. This can either be simply accepted as a feature of the method, or different \(\sigma\) can be used at different points, depending on the distance to their nearest-neighbors.
This type of smoothing has a great number of applications, as it can readily be used in multivariate or vector data (for both \(x\) and \(y\)). It even permits easy adaptation to non-euclidean spaces, e.g. over the surface of a sphere.
The method however has a few drawbacks. One is that care needs to be taken in cases where the numerator and denominator are very small. Another is that evaluating the function can be rather computationally intensive when needed to be evaluated at a large number of locations or when \(N\) is large. However, the method is quite powerful and admits for much flexibility to provide smooth interpolations and extrapolation for discrete data.
Kernel Density Estimation
Suppose we sample \(N\) times from an unknown probability distribution, and then wish to estimate the density function. If we had a guess as tothe form of the distribution, we could use date to estimate the underlying parameters. However, often the distribution is not of a known kind so this is not pragmatic. Instead we can estimate the density from the sample values \(x_1,x_2,...,x_N\) as \[ f(x)=\frac{1}{N}\sum_{k=1}^{N}g_{0,\sigma^2}\left ( x-x_k \right ) \] Note that this density function has mean \(\frac{1}{N}\sum_{k=1}^{N}x_k=\overline{x}\) and variance \(\sigma^2+\frac{1}{N}\sum_{k=1}^{N}(x_k-\overline{x})^2\).
This can be extended to multiple dimensions in a straightforward way, although in that case the variance may be replaced either by a matrix proportional to the sample covariance matrix, or by a diagonal matrix. For the one-dimensional case, a general rule of thumb for picking the variance is \(\sigma=\sqrt{\mathrm{var}\left ( \mathbf{x} \right )}\left ( \frac{4}{3n} \right )^{1/5}\). Another, rather aggressive approximation is \(\sigma=\tfrac{1}{2}\max\left ( \Delta x \right )\) where \(\Delta x\) is the difference between successive sorted sample values.
Normal Distributions in Non-Euclidean Spaces
The distributions we have so far looked at exist in a Euclidean space. That is, the space in which the random variables exist and interact is taken to be flat, having zero curvature everywhere. In general, for two vectors in an (N+1)-dimensional space, with constant curvature \(K\) (\(K=0\) means a flat or Euclidean space, \(K>0\) means a positively curved or elliptical space, and \(K<0\) means a negatively curved or hyperbolic space), we can define a sort of inner product, called a bilinear form: \[ \bbox[5px,border:2px solid red] { \mathbf{a}\cdot\mathbf{b}=a_0 b_0+K \sum_{j=1}^N a_jb_j } \] We will be looking exclusively as unit vectors, i.e. vectors that satisfy \(\mathbf{v}\cdot\mathbf{v}=|\mathbf{v}|^2=v_0^2+K \sum_{j=1}^N v_j^2=1\).
In general, the distance between the points corresponding to the unit vectors \(\mathbf{a}\) and \(\mathbf{b}\) is \[ \bbox[5px,border:2px solid red] { d(\mathbf{a},\mathbf{b})=\tfrac{1}{\sqrt{K}}\cos^{-1}\left ( \mathbf{a}\cdot\mathbf{b} \right )=\tfrac{1}{\sqrt{-K}}\cosh^{-1}\left ( \mathbf{a}\cdot\mathbf{b} \right ) } \] Where the second form is more easily applicable for \(K<0\). Note that Euclidean space must be approached as the limit as \(K \to 0\). Moreover, the zeroth dimension in Euclidean space is just a bookkeeping device: as only the zeroth dimension contributes to the magnitude of the vector, the other components are free to take on any magnitude. However, note that the Euclidean distance is indeed given as the limit of the general distance formula: \[ d(\mathbf{a},\mathbf{b})=\underset{K \to 0}{\lim} \frac{1}{\sqrt{K}}\cos^{-1}\left ( \sqrt{1-K\sum_{j=1}^{N}a_j^2} \sqrt{1-K\sum_{j=1}^{N}b_j^2}+K \sum_{j=1}^N a_jb_j \right ) \\ d(\mathbf{a},\mathbf{b})=\sqrt{\sum_{j=1}^N (a_j-b_j)^2} \] A more useful form for differential geometry is to impose the unit-vector condition directly, and use generalized polar coordinates use the parametrization \[ r=d(\mathbf{x},[1,\mathbf{0}]) \\ d\boldsymbol{\psi}^2=d\theta_1^2+\sin^2(\theta_1)d\theta_2^2+\sin^2(\theta_1)\sin^2(\theta_2)d\theta_3^2+... \] Where the \(\theta\)s are generalized angles. Then the distance metric (the differential distance between nearby points as a function of their coordinates) is given by \[ \bbox[5px,border:2px solid red] { ds^2=dr^2+\left [ \frac{\sin(r\sqrt{K})}{\sqrt{K}} \right ]^2 d\boldsymbol{\psi}^2 } \] For \(N=2\), if the meric is expressed as \(ds^2=A(v,w) dv^2+B(v,w)dw^2\), the Gaussian curvature \(G\) is given by \[ \bbox[5px,border:2px solid red] { G=\frac{-1}{2\sqrt{AB}}\left ( \frac{\partial }{\partial v}\left [ \frac{1}{\sqrt{AB}}\frac{\partial B}{\partial v} \right ]+\frac{\partial }{\partial w}\left [ \frac{1}{\sqrt{AB}}\frac{\partial A}{\partial w} \right ] \right ) } \] It is easy to verify that for our distance metric, we do indeed get \(G=K\).
Let us take the following probability distribution over unit vectors \(\mathbf{x}\): \[ \bbox[5px,border:2px solid red] { \bbox[5px,border:2px solid red] { f(\mathbf{x})=C\cdot \exp\left (\tfrac{1}{K \sigma^2}\left [\boldsymbol{\mu}\cdot\mathbf{x}-1 \right ] \right ) } } \] Where \(C\) is a normalization constant, and \(\boldsymbol{\mu}\) is a constant unit vector, and \(\sigma^2\) is some positive constant. This is the generalized Von Mises–Fisher distribution, which extends the normal distribution to non-Euclidean geometries. In the limit as \(K \to 0\): \[ f(\mathbf{x})=\frac{1}{(2\pi\sigma^2)^{N/2}} \exp\left (\frac{-1}{2 \sigma^2}\sum_{j=1}^{N}(x_j-\mu_j)^2 \right ) \] Which is the expected Euclidean distribution. The more general multivariate distribution could be recovered by modifying the definition of the bilinear form. In one dimension, the distributions take the form \[ \bbox[5px,border:2px solid red] { f(r)= \left\{\begin{matrix} \frac{\sqrt{K}}{2\pi e^{-\tfrac{1}{K\sigma^2}} I_0\left (\tfrac{1}{K\sigma^2} \right )} \exp\left ( \frac{\cos(r\sqrt{K})-1}{K\sigma^2} \right ) & \: \: \: \: \: \: \: \: K > 0\\ \\ \frac{1}{\sqrt{2\pi\sigma^2} }\exp\left ( -\frac{r^2}{2\sigma^2} \right ) & \: \: \: \: \: \: \: \: K = 0\\ \\ \frac{\sqrt{-K}}{2e^{-\tfrac{1}{K\sigma^2}} K_0 \left (\tfrac{1}{-K\sigma^2} \right )}\exp\left ( \frac{\cosh(r\sqrt{-K})-1}{K\sigma^2} \right ) & \: \: \: \: \: \: \: \: K < 0 \end{matrix}\right. } \] Where \(I_0(x)\) and \(K_0(x)\) are the zero-order modified bessel functions of the first and second kind. It is clear from this that \[ \underset{x\to \infty}{\lim}I_0(x)e^{-x}\sqrt{x}=\frac{1}{\sqrt{2\pi}} \\ \underset{x\to \infty}{\lim}K_0(x)e^{x}\sqrt{x}=\sqrt{\frac{\pi}{2}} \] Here we show two animations, one showing a centered, symmetric distribution with unit variance for different curvatures. Then we also show the same distribution, but off-center.
In two dimensions: Let \(\boldsymbol{\mu}=[1,0,0]\).
Let us find the probability \(\mathrm{P}(r=d(\mathbf{x},\boldsymbol{\mu}) < t)\). It follows from the above that \(\boldsymbol{\mu}\cdot\mathbf{x}=\cos\left (r\sqrt{K} \right )\). From the distance metric, it can easily be seen that a circle with a radius \(r\) will have circumference \(2\pi\tfrac{\sin(r\sqrt{K})}{\sqrt{K}}\). From this the probability can be seen to be given by: \[ \mathrm{P}(r < t)=2\pi C\int_{0}^{t}\frac{\sin(r\sqrt{K})}{\sqrt{K}} \exp \left (\frac{\cos(r\sqrt{K})-1}{K\sigma^2} \right )dr \\ \mathrm{P}(r < t)=2\pi C \sigma^2\left [ 1-\exp \left (\frac{\cos(t\sqrt{K})-1}{K\sigma^2} \right ) \right ] \] For \(t\) as large as possible, this must be one, and so the normalization constant is given by \[ C=\frac{1}{2\pi\sigma^2}\left\{\begin{matrix} \left ( 1-e^{-\tfrac{2}{K\sigma^2}} \right )^{-1} & K>0\\ 1 & K \leq 0 \end{matrix}\right. \] Which makes the probability density function \[ \bbox[5px,border:2px solid red] { f(\mathbf{x})=\frac{\exp\left (\tfrac{1}{K \sigma^2}\left [\boldsymbol{\mu}\cdot\mathbf{x}-1 \right ] \right )}{2\pi\sigma^2}\left\{\begin{matrix} \left ( 1-e^{-\tfrac{2}{K\sigma^2}} \right )^{-1} & K>0\\ 1 & K \leq 0 \end{matrix}\right. } \] The radial CDF is then given by \[ \mathrm{P}(r < t)=\left [ 1-\exp \left (\frac{\cos(t\sqrt{K})-1}{K\sigma^2} \right ) \right ]\left\{\begin{matrix} \left ( 1-e^{-\tfrac{2}{K\sigma^2}} \right )^{-1} & K>0\\ 1 & K \leq 0 \end{matrix}\right. \] If \(K=0\), this is just the usual 2-dimensional Gaussian distribution with equal variances and zero covariance.
If \(K>0\), this is an elliptical Von Mises-Fisher distribution.
If \(K<0\), this is a hyperbolic generalized Von Mises-Fisher distribution.
On the left we show a 2-D Generalized Von Mises-Fisher distribution with \(\mu=1\), \(\sigma^2=1\), for \(|K| \leq 3\).
For the special case of \(K=-1\), we can use the Poincare disk model: on the right we show the 2 dimensional generalized Von Mises-Fisher distribution with \(|\mu|\leq3\) and \(\sigma=0.65\).