Hyperphronesis: July 2014

There are many interesting problems that can be studied with probability theory. Here I will discuss a few of my favorites.

Maxima in Data

Suppose we have a sequence of values of random variables \(\left(X_{k} \right )_{k=1}^{n}\)that are independent and identically distributed with density function \(f_{X}(x)\). We wish to find the expected number of local maxima in the data, that is, values of \(X_{k}\) such that \(X_{k-1} < X_{k} > X_{k+1}\). Let \(y=X_k\). Then, the probability \(p\) that a certain value y is a maximum is that of having \(X_{k-1} < y > X_{k+1}\). As all the \(X\)s are independent and identically distributed, we can calculate this probability by finding \[p= \int_{-\infty}^{\infty}f_{X}(x)F_{X}^2(x)dx\] Where \(F_{X}(x)=\int_{-\infty}^{x}f_{X}(y)dy\) is the cumulative distribution function of \(f_{X}\). The form above is like \(\sum_y P(X_k=y)P(X_{k-1} < y)P(X_{k+1} < y)\), except we use the continuous formulation. However, clearly \(F_{X}'(x)=f_X(x)\), and so the formula becomes \[p= \int_{-\infty}^{\infty}F_{X}'(x)F_{X}^2(x)dx\] But, by elementary calculus, we can change this to \[p= \int_{0}^{1}F^2 dF=\frac{1}{3}\] (the change in bounds arises from the fact that \(F_{X}(-\infty)=0\) and \(F_{X}(+\infty)=1\)). Thus, we expect one-third of the values in the sequence to be local maxima. Likewise, we expect one-third to be local minima, and the remaining third to be neither.

By the same method, we can look for other patterns. For instance, the fraction of data points that are higher than all four of their closest neighbors is \(\frac {1}{5}\). The fraction of data points that are bigger than their closest neighbors and smaller than their next-closest neighbors is \(\frac {1}{30}\). In fact, all the calculations can be made by evaluating integrals of the form \( \int_{0}^{1} x^m (1-x)^n dx\). We can also use results like this to test for non-independent or non-identically distributed data. It may even be possible to use it in fraud or bias detection. Based on next-to-nothing-at-all, I would expect human generated data to fail some of these tests.

We can also find that the distribution of the number of maxima in \(n\) data points, regardless of the probability distribution \(f_{X}\), is approximately normally distributed with mean \(\frac{n}{3}\) and variance \(\frac{2 n}{45}\). Thus, if we found fewer than 2960 or more than 3040 maxima in a list of 9000 data points, we could be 95% confident that the list was not of independent and identically distributed values. We can also run the same test for minima, but for non-maxima-non-minima, the variance is instead \(\frac{2 n}{15}\).
The values for the variances were found empirically. I don't really know how one would go about finding them analytically.

We can also find the distribution of the values of the maxima, which is easily found to be \[g(x)=3 f_{X}(x)F_{X}^2\] Other distributions are similarly found.

Joint Lives

Suppose we stat with \(N\) couples (\(2N\) people), and at a later time, \(M\) of the original \(2N\) people remain. We want to find the expected number of intact couples remaining. Let \(C(M)\) and \(W(M)\) be the expected remaining number of couples and widows respectively when M total people are left. We then note that, as any remaining person is equally likely to be eliminated next, we have: \[ C(M-1)=C(M)-2 \frac{C(M)}{M} \\ W(M-1)=W(M)-\frac{W(M)}{M}+2 \frac{C(M)}{M}\] We can solve this recurrence relation, subject to the constraints \(W(M)+2 C(M)=M\) and \(C(2N)=N, W(2N)=0\), and find that \[ C(M)=\frac{M(M-1)}{2(2N-1)} \\ W(M)=\frac{M(2N-M)}{2N-1} \] If we express M as a fraction of the total starting population: \(M=2xN\), and express \(C\) and \(W\) as fractions of the total population, we find, for \(N\) big: \[ C(x)=x^2 \\ W(x)=x(1-x) \] Also, for the general case of starting out with \(kN\) \(k\)-tuples, the expected number of intact \(k\)-tuples when \(M\) individuals remain is given by: \[K(M)=N \frac{\binom{M}{k}}{\binom{kN}{k}}\] For the case of triples, we have the number of triples, doubles and singles when M individuals remain is given by: \[K_3 (M)= \frac{M(M-1)(M-2)}{3(3N-1)(3N-2)} \\ K_2 (M)= \frac{M(M-1)(3N-M)}{(3N-1)(3N-2)} \\ K_1 (M)= \frac{M(3N-M)(3N-M-1)}{(3N-1)(3N-2)} \] Generally, with the same sense as discussed above, the fraction of the population in a m-tuple, beginning with only k-tuples, when fraction \(x\) of the population remains, is given by: \[K_m (x)=\binom{k-1}{m-1} x^m (1-x)^{k-m}\] In fact, the general form for the expected number can be given as \[ K_m (M)= N \frac{\binom{M}{m}\binom{kN-M}{k-m}}{\binom{kN}{k}} \]

Random Finite Discrete Distribution

Suppose we have a discrete random variable that can take on the values \(1,2,3,...,n\) with probabilities \(p_1,p_2,p_3,...p_n\) respectively, subject to the constraint \(\sum_{k=1}^n p_k=1\). Let \(p\) be an arbitrary value among the \(p\)s. We will take any combination of values for the \(p\)s as equally likely. By looking at the cases of n=2 and n=3, we find that the probability density function of \(p\) is given by \[ f_P(p)=(n-1)(1-p)^{n-2} \] And the cumulative distribution function is given by \[ F_P(p)=1-(1-p)^{n-1} \] The average value of \(p\) is then \[\int_{0}^{1} p(n-1)(1-p)^{n-2}dp=\frac{1}{n}\] And the variance is \[\int_{0}^{1} p^2 (n-1)(1-p)^{n-2}dp-\frac{1}{n^2}=\frac{n-1}{n^2 (n+1)}\] We thus find that the chance that \(p\) is above the average value is \[P\left ( p > \frac{1}{n} \right )=\left ( 1-\frac{1}{n} \right )^{n-1}\] In the limit as n becomes large, this value tends to \(\frac{1}{e}\).
A confidence interval containing fraction x of the total probability, for large n, is given by: \[ \frac{1}{n} \ln \left(\frac{e}{e x +1-x} \right) \leq p \leq \frac{1}{n} \ln \left(\frac{e}{1-x} \right) \] For instance, a \(50 \%\) confidence interval is given by \(\frac{1}{n}\ln \left(\frac{2 e}{1+e}\right) \leq p \leq \frac{1}{n}\ln(2 e)\).

We can also extend this to continuous distributions with finite support if we only consider the net probability of landing in equally-sized bins. While the calculation may break down if the number of possible values is actually infinite, it can be used to get some information about distributions with an arbitrarily large number of possible values.

Maximum of Exponential Random Variables

Suppose we have \(N\) independent and identically distributed exponential random variables \(X_1,X_2,...X_N\) with means \(\mu\). That is, \(f_X (x_k)=\frac{1}{\mu} e^{-\frac{x}{\mu}}\) when \(x \geq 0\) and zero otherwise. Let us interpret the random values as lifetimes for \(N\) units. The exponential distribution has the interesting property of memorylessness, which means that \(P(x > a+b| x > b)=P(x > a)\). We can show this by using the definition: \[ P(x>a+b|x>b)=\frac{P(x>a+b \cap x>b)}{P(x>b)}=\frac{P(x>a+b)}{P(x>b)} \\ P(x>a+b|x>b)=\frac{\int_{a+b}^{\infty}e^{-\frac{x}{\mu}}dx}{\int_{b}^{\infty}e^{-\frac{x}{\mu}}dx}=\frac{e^{-\frac{a+b}{\mu}}}{e^{-\frac{b}{\mu}}}=e^{-\frac{a}{\mu}}=P(x>a) \] In other words, given that a unit lasted \(b\) minutes, the chance that it will last another \(a\) minutes is the same as that it would last \(a\) minutes. We now calculate the probability distribution of the minimum of the \(N\) random variables. The probability that the minimum of \(X_1,X_2,...X_N\) is no less than \(x\) is the same as the probability that \(X_1 \geq x \cap X_2 \geq x \cap...X_N \geq x \). As all the \(X\)s are independent, this can be simplified to a product, and as all the \(X\)s are identically-distributed, we can simplify this further: \[ P(\min(X_1,X_2,...)\geq x)=\left ( P(X_1 \geq x) \right )^N= \left ( \int_{x}^{\infty}\frac{1}{\mu} e^{-\frac{x}{\mu}}\right )^N \\ P(\min(X_1,X_2,...)\geq x)= e^{-\frac{xN}{\mu}} \\ P(\min(X_1,X_2,...)\leq x)= 1-e^{-\frac{xN}{\mu}} \\ f_{\min(X)}(x)=\frac{N}{\mu}e^{-\frac{xN}{\mu}} \] Thus, the average of the minimum of \(X_1,X_2,...X_N\) is \(\frac{\mu}{N}\). We now combine these two facts, the mean minimum vale and the memorylessness. We start with all units operational, and we have to wait an average of \(\frac{\mu}{N}\) until the first one fails. However, given that the first one fails, the expected additional wait time until the next one fails is just \(\frac{\mu}{N-1}\), that is, the expected minimum of \(N-1\) units. Thus, the expected time that the \(m\)th unit fails is given by \[\mu\sum_{k=0}^{m-1}\frac{1}{N-k}\] Thus, the expected maximum time, when the \(N\)th unit fails is \[\mu\sum_{k=1}^{N}\frac{1}{k}\]

More generally, we can look at the distributions of the kth order statistic of \(X_1,X_2,...X_N\). The kth order statistic, denoted \(X_{(k)}\), is defined as the kth smallest value, so that \(X_{(1)}\) is the smallest (minimum) value, and \(X_{(N)}\) is the largest (maximum) value. The pdf is easily found to be: \[ f_{X_{(k)}}(x)=k {N \choose k} F_X^{k-1}(x)\left[1-F_X(x)\right]^{N-k}f_X(x) \] Where \(F_X(x)\) is the cdf of X, and \(f_X(x)\) is the pdf of X. So, in this case, \[ f_{X_{(k)}}(x)=\frac{k}{\mu} {N \choose k} e^{-(N-k+1)x/\mu}\left[1-e^{-x/\mu}\right]^{k-1} \] Thus, the moment generating function is given by: \[ g(t)=\frac{k}{\mu} {N \choose k} \int _0 ^\infty e^{-(N-k+1-\mu t)x/\mu}\left[1-e^{-x/\mu}\right]^{k-1} dx \] By a simple transformation, we find that: \[ g(t)=k {N \choose k} \int _0 ^1 u^{N-k-\mu t}(1-u)^{k-1} du \] This puts the integral in a well-known form, which has the value \[ g(t)=\frac{N!}{\Gamma(N+1-\mu t)}\frac{\Gamma(N-k+1-\mu t)}{(N-k)!} \] By a simple calculation, the cumulants are then given by the surprisingly simple form: \[ \kappa_n=\mu^n(n-1)!\sum_{j=N-k+1}^{N} \frac{1}{j^n} \] Several interesting results follow from this:

For the Nth order statistic (the maximum), we already know that the mean value goes as \(\sum_{j=1}^{N} \frac{1}{j}\). But now we see that the other cumulants go as \((n-1)!\sum_{j=1}^{N} \frac{1}{j^n}\). Thus, the variance converges, in the limit, to \(\mu^2 \frac{\pi^2}{6}\). The skewness converges, in the limit, to \(\frac{12 \sqrt{6}}{\pi^3}\zeta(3)\), and the excess kurtosis converges to \(\frac{12}{5}\). In fact, if we shift to take into account the unbounded mean, the distribution of the maximum converges to a Gumbel distribution. This is a special case of a fascinating result known as the extreme value theorem.
For any given, fixed, finite \(k\geq 0\), \(X_{(N-k)}\) converges, as N goes to infinity, to a non-degenerate distribution with finite, positive variance, if we shift it to account for the unbounded mean.
For k of the form \(k=\alpha N\) (or the nearest integer thereto), for some fixed alpha between 0 and 1, for \(\alpha \neq 1\), in the limit at N goes to infinity, the distribution of \(X_{(\alpha N)}\) become degenerate distributions with all the probability density located at \(\mu\ln\left(\frac{1}{1-\alpha}\right)\). These are, of course, the locations of the \(100\alpha \%\) quantiles, and so \(X_{(\alpha N)}\) is a consistent estimator for the \(100\alpha \%\) quantile.

As a more general result, let us find the cdf of \(X_{(\alpha N)}\) for an arbitrarily distributed X, in the limit as N goes to infinity. The cdf of \(X_{(\alpha N)}\) is given by: \[ F_{X_{(\alpha N)}}(y)=\alpha N {N \choose \alpha N} \int _{-\infty} ^{y} F_X^{\alpha N-1}(x)\left[1-F_X(x)\right]^{N-\alpha N}f_X(x) dx \] As \(f_X(x)=\frac{d}{dx}F_X(x)\), we then have, by a simple substitution: \[ F_{X_{(\alpha N)}}(y)=\alpha N {N \choose \alpha N} \int _{0} ^{F_X(y)} u^{\alpha N-1}\left[1-u\right]^{N-\alpha N} du \] This is the cdf of a Beta distributed random variable, with mean \(\mu=\frac{\alpha N}{N+1}\) and variance \(\sigma^2=\frac{\alpha N (N-\alpha N+1)}{(N+1)^2(N+2)}\). Thus, as N goes to infinity, this will converge in distribution to a degenerate distribution with all the density at \(y=F_X^{-1}(\alpha)\), that is, at the \(100\alpha \%\) quantile of the distribution.

Choosing a Secretary

Suppose we need to hire a secretary. We have \(N\) applicants arrive and we interview them sequentially: once we interview and dismiss an applicant, we cannot hire her. The applicants all have differing skill levels, and we want to pick as qualified an applicant as we can. We want to find the optimal strategy for choosing whom to hire. We easily see that the optimal strategy is something like the following. We consider and reject the first \(K\) applicants. We then choose the first applicant who is better than all the preceding ones. Thus, our problem reduces to finding the optimal value for \(K\). We will do so in a way that maximizes the probability that the most qualified secretary is selected. We thus have the probability: \[ P(\mathrm{best\, is\, chosen})=\sum_{n=1}^{N}P(\mathrm{n^{th}\, is\, chosen} \cap \mathrm{n^{th}\, is\, best}) \\ P(\mathrm{best\, is\, chosen})=\sum_{n=1}^{N}P(\mathrm{n^{th}\, is\, chosen}| \mathrm{n^{th}\, is\, best})P(\mathrm{n^{th}\, is\, best}) \] We then note that each applicant in line is the best applicant with equal probability. That is, \(P(\mathrm{n^{th}\, is\, best})=\frac{1}{N}\). Also, we can find the conditional probabilities. If \(M \leq K\), then \(P(\mathrm{M^{th}\, is\, chosen}| \mathrm{M^{th}\, is\, best})=0\). If the \((K+1)\)th applicant is best, she will certainly be chosen, that is \(P(\mathrm{(K+1)^{th}\, is\, chosen}| \mathrm{(K+1)^{th}\, is\, best})=1\). Also, we find that \(P(\mathrm{(K+m)^{th}\, is\, chosen}| \mathrm{(K+m)^{th}\, is\, best})=\frac{K}{K+m}\), as that is the chance that the second-best applicant among the first \(K+m\) applicants is in the first \(K\) applicants. We thus have \[ P(\mathrm{best\, is\, chosen})=\frac{K}{N}\sum_{n=K+1}^{N}\frac{1}{n} \] Let us assume we are dealing with a relatively large number of applicants. In that case, we can approximate \(\sum_{n=A+1}^{B}\frac{1}{n} \approx \ln \left(\frac{B}{A} \right )\). Thus \[ P(\mathrm{best\, is\, chosen})=\frac{K}{N}\ln \left(\frac{N}{K}\right )=-\frac{K}{N}\ln \left(\frac{K}{N}\right ) \] If we then let \(x=\frac{K}{N}\), we just need to maximize \(-x\ln(x)\), which happens at \(x=e^{-1}\). From this, we find that \(P(\mathrm{best\, is\, chosen})=e^{-1}\). Thus, the best strategy is to interview and reject the first \(36.8 \%\) of the applicants, and then choose the next applicant who is better than all the preceding ones. This will get us the best applicant with a probability of \(36.8 \%\).

A related problem involves finding a strategy that minimizes the expected rank of the selected candidate (the best candidate has rank 1, the second best rank 2, etc.). Chow, Moriguti, Robbins and Samuels have found that the optimal strategy involves the following (in the limit of large \(N\)): skip the first \(c_0 N\) applicants, then, for all applicants before the number \(c_1 N\), we stop looking if the applicant is the best so far. If we have not yet selected an applicant, we choose the best or second best so far before the number \(c_2 N\). If we have not yet selected an applicant, we choose the best or second best or third best so far before the number \(c_3 N\). And so on. By choosing the \(c_n\) optimally, we can get an expected rank of \(3.8695\). This is quite surprising: we can expect an applicant in the top 4, even among billions of applicants!
The optimal values for the \(c_n\) are \[ c_0=0.2584... \\ c_1=0.4476... \\ c_2=0.5639... \] The general formula for \(c_n\) is \[ c_n=\prod_{k=n+2}^{\infty}\left ( \frac{k-1}{k+1} \right )^{1/k}=\frac{1}{3.86951924...}\prod_{k=2}^{n+1}\left ( \frac{k+1}{k-1} \right )^{1/k} \]

Hyperphronesis

Tuesday, July 1, 2014

Probability Problems

Maxima in Data

Joint Lives

Random Finite Discrete Distribution

Maximum of Exponential Random Variables

Choosing a Secretary