Tuesday, July 1, 2014

Probability Problems

      There are many interesting problems that can be studied with probability theory. Here I will discuss a few of my favorites.

Maxima in Data

Suppose we have a sequence of values of random variables \(\left(X_{k} \right )_{k=1}^{n}\)that are independent and identically distributed with density function \(f_{X}(x)\). We wish to find the expected number of local maxima in the data, that is, values of \(X_{k}\) such that \(X_{k-1} < X_{k} > X_{k+1}\). Let \(y=X_k\). Then, the probability \(p\) that a certain value y is a maximum is that of having \(X_{k-1} < y > X_{k+1}\). As all the \(X\)s are independent and identically distributed, we can calculate this probability by finding \[p= \int_{-\infty}^{\infty}f_{X}(x)F_{X}^2(x)dx\] Where \(F_{X}(x)=\int_{-\infty}^{x}f_{X}(y)dy\) is the cumulative distribution function of \(f_{X}\). The form above is like \(\sum_y P(X_k=y)P(X_{k-1} < y)P(X_{k+1} < y)\), except we use the continuous formulation. However, clearly \(F_{X}'(x)=f_X(x)\), and so the formula becomes \[p= \int_{-\infty}^{\infty}F_{X}'(x)F_{X}^2(x)dx\] But, by elementary calculus, we can change this to \[p= \int_{0}^{1}F^2 dF=\frac{1}{3}\] (the change in bounds arises from the fact that \(F_{X}(-\infty)=0\) and \(F_{X}(+\infty)=1\)). Thus, we expect one-third of the values in the sequence to be local maxima. Likewise, we expect one-third to be local minima, and the remaining third to be neither.

By the same method, we can look for other patterns. For instance, the fraction of data points that are higher than all four of their closest neighbors is \(\frac {1}{5}\). The fraction of data points that are bigger than their closest neighbors and smaller than their next-closest neighbors is \(\frac {1}{30}\). In fact, all the calculations can be made by evaluating integrals of the form \( \int_{0}^{1} x^m (1-x)^n dx\). We can also use results like this to test for non-independent or non-identically distributed data. It may even be possible to use it in fraud or bias detection. Based on next-to-nothing-at-all, I would expect human generated data to fail some of these tests.

We can also find that the distribution of the number of maxima in \(n\) data points, regardless of the probability distribution \(f_{X}\), is approximately normally distributed with mean \(\frac{n}{3}\) and variance \(\frac{2 n}{45}\). Thus, if we found fewer than 2960 or more than 3040 maxima in a list of 9000 data points, we could be 95% confident that the list was not of independent and identically distributed values. We can also run the same test for minima, but for non-maxima-non-minima, the variance is instead \(\frac{2 n}{15}\).
The values for the variances were found empirically. I don't really know how one would go about finding them analytically.

We can also find the distribution of the values of the maxima, which is easily found to be \[g(x)=3 f_{X}(x)F_{X}^2\] Other distributions are similarly found.

Joint Lives

Suppose we stat with \(N\) couples (\(2N\) people), and at a later time, \(M\) of the original \(2N\) people remain. We want to find the expected number of intact couples remaining. Let \(C(M)\) and \(W(M)\) be the expected remaining number of couples and widows respectively when M total people are left. We then note that, as any remaining person is equally likely to be eliminated next, we have: \[ C(M-1)=C(M)-2 \frac{C(M)}{M} \\ W(M-1)=W(M)-\frac{W(M)}{M}+2 \frac{C(M)}{M}\] We can solve this recurrence relation, subject to the constraints \(W(M)+2 C(M)=M\) and \(C(2N)=N, W(2N)=0\), and find that \[ C(M)=\frac{M(M-1)}{2(2N-1)} \\ W(M)=\frac{M(2N-M)}{2N-1} \] If we express M as a fraction of the total starting population: \(M=2xN\), and express \(C\) and \(W\) as fractions of the total population, we find, for \(N\) big: \[ C(x)=x^2 \\ W(x)=x(1-x) \] Also, for the general case of starting out with \(kN\) \(k\)-tuples, the expected number of intact \(k\)-tuples when \(M\) individuals remain is given by: \[K(M)=N \frac{\binom{M}{k}}{\binom{kN}{k}}\] For the case of triples, we have the number of triples, doubles and singles when M individuals remain is given by: \[K_3 (M)= \frac{M(M-1)(M-2)}{3(3N-1)(3N-2)} \\ K_2 (M)= \frac{M(M-1)(3N-M)}{(3N-1)(3N-2)} \\ K_1 (M)= \frac{M(3N-M)(3N-M-1)}{(3N-1)(3N-2)} \] Generally, with the same sense as discussed above, the fraction of the population in a m-tuple, beginning with only k-tuples, when fraction \(x\) of the population remains, is given by: \[K_m (x)=\binom{k-1}{m-1} x^m (1-x)^{k-m}\] In fact, the general form for the expected number can be given as \[ K_m (M)= N \frac{\binom{M}{m}\binom{kN-M}{k-m}}{\binom{kN}{k}} \]

Random Finite Discrete Distribution

Suppose we have a discrete random variable that can take on the values \(1,2,3,...,n\) with probabilities \(p_1,p_2,p_3,...p_n\) respectively, subject to the constraint \(\sum_{k=1}^n p_k=1\). Let \(p\) be an arbitrary value among the \(p\)s. We will take any combination of values for the \(p\)s as equally likely. By looking at the cases of n=2 and n=3, we find that the probability density function of \(p\) is given by \[ f_P(p)=(n-1)(1-p)^{n-2} \] And the cumulative distribution function is given by \[ F_P(p)=1-(1-p)^{n-1} \] The average value of \(p\) is then \[\int_{0}^{1} p(n-1)(1-p)^{n-2}dp=\frac{1}{n}\] And the variance is \[\int_{0}^{1} p^2 (n-1)(1-p)^{n-2}dp-\frac{1}{n^2}=\frac{n-1}{n^2 (n+1)}\] We thus find that the chance that \(p\) is above the average value is \[P\left ( p > \frac{1}{n} \right )=\left ( 1-\frac{1}{n} \right )^{n-1}\] In the limit as n becomes large, this value tends to \(\frac{1}{e}\).
A confidence interval containing fraction x of the total probability, for large n, is given by: \[ \frac{1}{n} \ln \left(\frac{e}{e x +1-x} \right) \leq p \leq \frac{1}{n} \ln \left(\frac{e}{1-x} \right) \] For instance, a \(50 \%\) confidence interval is given by \(\frac{1}{n}\ln \left(\frac{2 e}{1+e}\right) \leq p \leq \frac{1}{n}\ln(2 e)\).

We can also extend this to continuous distributions with finite support if we only consider the net probability of landing in equally-sized bins. While the calculation may break down if the number of possible values is actually infinite, it can be used to get some information about distributions with an arbitrarily large number of possible values.

Maximum of Exponential Random Variables

Suppose we have \(N\) independent and identically distributed exponential random variables \(X_1,X_2,...X_N\) with means \(\mu\). That is, \(f_X (x_k)=\frac{1}{\mu} e^{-\frac{x}{\mu}}\) when \(x \geq 0\) and zero otherwise. Let us interpret the random values as lifetimes for \(N\) units. The exponential distribution has the interesting property of memorylessness, which means that \(P(x > a+b| x > b)=P(x > a)\). We can show this by using the definition: \[ P(x>a+b|x>b)=\frac{P(x>a+b \cap x>b)}{P(x>b)}=\frac{P(x>a+b)}{P(x>b)} \\ P(x>a+b|x>b)=\frac{\int_{a+b}^{\infty}e^{-\frac{x}{\mu}}dx}{\int_{b}^{\infty}e^{-\frac{x}{\mu}}dx}=\frac{e^{-\frac{a+b}{\mu}}}{e^{-\frac{b}{\mu}}}=e^{-\frac{a}{\mu}}=P(x>a) \] In other words, given that a unit lasted \(b\) minutes, the chance that it will last another \(a\) minutes is the same as that it would last \(a\) minutes. We now calculate the probability distribution of the minimum of the \(N\) random variables. The probability that the minimum of \(X_1,X_2,...X_N\) is no less than \(x\) is the same as the probability that \(X_1 \geq x \cap X_2 \geq x \cap...X_N \geq x \). As all the \(X\)s are independent, this can be simplified to a product, and as all the \(X\)s are identically-distributed, we can simplify this further: \[ P(\min(X_1,X_2,...)\geq x)=\left ( P(X_1 \geq x) \right )^N= \left ( \int_{x}^{\infty}\frac{1}{\mu} e^{-\frac{x}{\mu}}\right )^N \\ P(\min(X_1,X_2,...)\geq x)= e^{-\frac{xN}{\mu}} \\ P(\min(X_1,X_2,...)\leq x)= 1-e^{-\frac{xN}{\mu}} \\ f_{\min(X)}(x)=\frac{N}{\mu}e^{-\frac{xN}{\mu}} \] Thus, the average of the minimum of \(X_1,X_2,...X_N\) is \(\frac{\mu}{N}\). We now combine these two facts, the mean minimum vale and the memorylessness. We start with all units operational, and we have to wait an average of \(\frac{\mu}{N}\) until the first one fails. However, given that the first one fails, the expected additional wait time until the next one fails is just \(\frac{\mu}{N-1}\), that is, the expected minimum of \(N-1\) units. Thus, the expected time that the \(m\)th unit fails is given by \[\mu\sum_{k=0}^{m-1}\frac{1}{N-k}\] Thus, the expected maximum time, when the \(N\)th unit fails is \[\mu\sum_{k=1}^{N}\frac{1}{k}\]

More generally, we can look at the distributions of the kth order statistic of \(X_1,X_2,...X_N\). The kth order statistic, denoted \(X_{(k)}\), is defined as the kth smallest value, so that \(X_{(1)}\) is the smallest (minimum) value, and \(X_{(N)}\) is the largest (maximum) value. The pdf is easily found to be: \[ f_{X_{(k)}}(x)=k {N \choose k} F_X^{k-1}(x)\left[1-F_X(x)\right]^{N-k}f_X(x) \] Where \(F_X(x)\) is the cdf of X, and \(f_X(x)\) is the pdf of X. So, in this case, \[ f_{X_{(k)}}(x)=\frac{k}{\mu} {N \choose k} e^{-(N-k+1)x/\mu}\left[1-e^{-x/\mu}\right]^{k-1} \] Thus, the moment generating function is given by: \[ g(t)=\frac{k}{\mu} {N \choose k} \int _0 ^\infty e^{-(N-k+1-\mu t)x/\mu}\left[1-e^{-x/\mu}\right]^{k-1} dx \] By a simple transformation, we find that: \[ g(t)=k {N \choose k} \int _0 ^1 u^{N-k-\mu t}(1-u)^{k-1} du \] This puts the integral in a well-known form, which has the value \[ g(t)=\frac{N!}{\Gamma(N+1-\mu t)}\frac{\Gamma(N-k+1-\mu t)}{(N-k)!} \] By a simple calculation, the cumulants are then given by the surprisingly simple form: \[ \kappa_n=\mu^n(n-1)!\sum_{j=N-k+1}^{N} \frac{1}{j^n} \] Several interesting results follow from this:
  • For the Nth order statistic (the maximum), we already know that the mean value goes as \(\sum_{j=1}^{N} \frac{1}{j}\). But now we see that the other cumulants go as \((n-1)!\sum_{j=1}^{N} \frac{1}{j^n}\). Thus, the variance converges, in the limit, to \(\mu^2 \frac{\pi^2}{6}\). The skewness converges, in the limit, to \(\frac{12 \sqrt{6}}{\pi^3}\zeta(3)\), and the excess kurtosis converges to \(\frac{12}{5}\). In fact, if we shift to take into account the unbounded mean, the distribution of the maximum converges to a Gumbel distribution. This is a special case of a fascinating result known as the extreme value theorem.
  • For any given, fixed, finite \(k\geq 0\), \(X_{(N-k)}\) converges, as N goes to infinity, to a non-degenerate distribution with finite, positive variance, if we shift it to account for the unbounded mean.
  • For k of the form \(k=\alpha N\) (or the nearest integer thereto), for some fixed alpha between 0 and 1, for \(\alpha \neq 1\), in the limit at N goes to infinity, the distribution of \(X_{(\alpha N)}\) become degenerate distributions with all the probability density located at \(\mu\ln\left(\frac{1}{1-\alpha}\right)\). These are, of course, the locations of the \(100\alpha \%\) quantiles, and so \(X_{(\alpha N)}\) is a consistent estimator for the \(100\alpha \%\) quantile.

As a more general result, let us find the cdf of \(X_{(\alpha N)}\) for an arbitrarily distributed X, in the limit as N goes to infinity. The cdf of \(X_{(\alpha N)}\) is given by: \[ F_{X_{(\alpha N)}}(y)=\alpha N {N \choose \alpha N} \int _{-\infty} ^{y} F_X^{\alpha N-1}(x)\left[1-F_X(x)\right]^{N-\alpha N}f_X(x) dx \] As \(f_X(x)=\frac{d}{dx}F_X(x)\), we then have, by a simple substitution: \[ F_{X_{(\alpha N)}}(y)=\alpha N {N \choose \alpha N} \int _{0} ^{F_X(y)} u^{\alpha N-1}\left[1-u\right]^{N-\alpha N} du \] This is the cdf of a Beta distributed random variable, with mean \(\mu=\frac{\alpha N}{N+1}\) and variance \(\sigma^2=\frac{\alpha N (N-\alpha N+1)}{(N+1)^2(N+2)}\). Thus, as N goes to infinity, this will converge in distribution to a degenerate distribution with all the density at \(y=F_X^{-1}(\alpha)\), that is, at the \(100\alpha \%\) quantile of the distribution.

Choosing a Secretary

Suppose we need to hire a secretary. We have \(N\) applicants arrive and we interview them sequentially: once we interview and dismiss an applicant, we cannot hire her. The applicants all have differing skill levels, and we want to pick as qualified an applicant as we can. We want to find the optimal strategy for choosing whom to hire. We easily see that the optimal strategy is something like the following. We consider and reject the first \(K\) applicants. We then choose the first applicant who is better than all the preceding ones. Thus, our problem reduces to finding the optimal value for \(K\). We will do so in a way that maximizes the probability that the most qualified secretary is selected. We thus have the probability: \[ P(\mathrm{best\, is\, chosen})=\sum_{n=1}^{N}P(\mathrm{n^{th}\, is\, chosen} \cap \mathrm{n^{th}\, is\, best}) \\ P(\mathrm{best\, is\, chosen})=\sum_{n=1}^{N}P(\mathrm{n^{th}\, is\, chosen}| \mathrm{n^{th}\, is\, best})P(\mathrm{n^{th}\, is\, best}) \] We then note that each applicant in line is the best applicant with equal probability. That is, \(P(\mathrm{n^{th}\, is\, best})=\frac{1}{N}\). Also, we can find the conditional probabilities. If \(M \leq K\), then \(P(\mathrm{M^{th}\, is\, chosen}| \mathrm{M^{th}\, is\, best})=0\). If the \((K+1)\)th applicant is best, she will certainly be chosen, that is \(P(\mathrm{(K+1)^{th}\, is\, chosen}| \mathrm{(K+1)^{th}\, is\, best})=1\). Also, we find that \(P(\mathrm{(K+m)^{th}\, is\, chosen}| \mathrm{(K+m)^{th}\, is\, best})=\frac{K}{K+m}\), as that is the chance that the second-best applicant among the first \(K+m\) applicants is in the first \(K\) applicants. We thus have \[ P(\mathrm{best\, is\, chosen})=\frac{K}{N}\sum_{n=K+1}^{N}\frac{1}{n} \] Let us assume we are dealing with a relatively large number of applicants. In that case, we can approximate \(\sum_{n=A+1}^{B}\frac{1}{n} \approx \ln \left(\frac{B}{A} \right )\). Thus \[ P(\mathrm{best\, is\, chosen})=\frac{K}{N}\ln \left(\frac{N}{K}\right )=-\frac{K}{N}\ln \left(\frac{K}{N}\right ) \] If we then let \(x=\frac{K}{N}\), we just need to maximize \(-x\ln(x)\), which happens at \(x=e^{-1}\). From this, we find that \(P(\mathrm{best\, is\, chosen})=e^{-1}\). Thus, the best strategy is to interview and reject the first \(36.8 \%\) of the applicants, and then choose the next applicant who is better than all the preceding ones. This will get us the best applicant with a probability of \(36.8 \%\).

A related problem involves finding a strategy that minimizes the expected rank of the selected candidate (the best candidate has rank 1, the second best rank 2, etc.). Chow, Moriguti, Robbins and Samuels have found that the optimal strategy involves the following (in the limit of large \(N\)): skip the first \(c_0 N\) applicants, then, for all applicants before the number \(c_1 N\), we stop looking if the applicant is the best so far. If we have not yet selected an applicant, we choose the best or second best so far before the number \(c_2 N\). If we have not yet selected an applicant, we choose the best or second best or third best so far before the number \(c_3 N\). And so on. By choosing the \(c_n\) optimally, we can get an expected rank of \(3.8695\). This is quite surprising: we can expect an applicant in the top 4, even among billions of applicants!
The optimal values for the \(c_n\) are \[ c_0=0.2584... \\ c_1=0.4476... \\ c_2=0.5639... \] The general formula for \(c_n\) is \[ c_n=\prod_{k=n+2}^{\infty}\left ( \frac{k-1}{k+1} \right )^{1/k}=\frac{1}{3.86951924...}\prod_{k=2}^{n+1}\left ( \frac{k+1}{k-1} \right )^{1/k} \]

Monday, June 16, 2014

Some Set Theory

      Set theory is a fundamental branch of mathematics that studies sets, that is, collections of objects that can be treated as wholes.
"A set is a Many that allows itself to be thought of as a One."
- Georg Cantor.

A set has various elements or members, which may either be individual objects or themselves sets. An extension to this is the inclusion of the empty set (the set containing no objects, symbolized \(\emptyset\) or \(\left \{\right\}\)) as an allowable set. If set A has a, b and c as elements, we symbolize that as A={a,b,c}. Note that, in set theory, order and multiplicity don't matter, so A is also equal to {b,c,a,a,b}.

Definitions and Notation

\( x \in A \)         x is an element/member of A.
\( x \notin A \)                 x is not an element/member of A.
\( A \subseteq B \)                 Set A is a subset of B. That is, for all x, if x is an element of A, then x is an element of B. Note that "A is a subset of B" is equivalent to "B is a superset of A".
\( A = B \)                 A is equivalent to B. That is, for all x, x is an element of A if and only if x is an element of B.
\( A \neq B \)                 A is not equivalent to B.
\( A \subset B \)                 \( A \subseteq B \) and \( A \neq B \). If \( A \subset B \), we say A is a proper subset of B.
\(A \cap B \)                 The intersection of sets A and B. That is, the set that contains all of the elements that are in both A and B. Sets A and B are said to be disjoint if \(A \cap B = \emptyset\).
\(A \cup B \)                 The union of sets A and B. That is, the set that contains all of the elements that are in either A or B.
\(A \setminus B \)                 The set difference of sets A and B. That is, the set that contains all of the elements in A that are not in B.
\(\mathbb{Z} =\left \{ ...,-2,-1,0,1,2,... \right \} \)                 The set of all integers. That is, the set that includes 0 and every other number that can be achieved by successively adding or subtracting 1 from 0.
\(\mathbb{N} =\left \{1,2,3,... \right \} \)                 The set of all natural numbers. That is, the set that includes 1 and every other number that can be achieved by successively adding 1 to 1.
\(\mathbb{W} =\left \{0,1,2,3,... \right \} \)                 The set of all whole numbers. That is, the set that includes 0 and every other number that can be achieved by successively adding 1 to 0.
\( \left\{\begin{matrix} x \end{matrix}\right|\left.\begin{matrix} Px \end{matrix}\right\} \)                 The set of all elements of the form \(x\) such that \(Px\) is true of x. For instance \(\mathbb{N}=\left\{\begin{matrix} x \end{matrix}\right|\left.\begin{matrix} x \in \mathbb{Z} \wedge x>0 \end{matrix}\right\} \).
\( \mathbb{Q}= \left\{\begin{matrix} \frac{x}{y} \end{matrix}\right|\left.\begin{matrix} x \in \mathbb{Z} \wedge y \in \mathbb{N} \end{matrix}\right\} \)                 The set of all rational numbers. That is, the set of all real numbers expressible as a ratio of integers with non-zero denominator.
\( \mathbb{N}_{n}=\left\{1,2,3,...,n\right\} \)                 The truncated set of the natural numbers up to \(n\). Another way of expressing it would be: \( \mathbb{N}_{n}=\left\{\begin{matrix} x \end{matrix}\right|\left.\begin{matrix} x \in \mathbb{N} \wedge x \leq n \end{matrix}\right\} \).
\( \mathbb{R}=\left (-\infty, \infty \right) \)                 The set of all real numbers. As a note, we often see the shorthand \((a,b)\) (not to be confused with an ordered pair!) to mean \( \left\{\begin{matrix} x \end{matrix}\right|\left.\begin{matrix} x \in \mathbb{R} \wedge a < x \wedge x < b \end{matrix}\right\} \). We may also see the shorthand \([a,b]\) to mean \( \left\{\begin{matrix} x \end{matrix}\right|\left.\begin{matrix} x \in \mathbb{R} \wedge a \leq x \wedge x \leq b \end{matrix}\right\} \). The combined notation \((a,b]\) should be clear enough.
\(\mathbb{W} =\left \{0,1,2,3,... \right \} \)                 The set of all whole numbers. That is, the set that includes 0 and every other number that can be achieved by successively adding 1 to 0.
\(\mathbb{P} =\left \{2,3,5,7,11,... \right \} \)                 The set of all prime numbers. That is, the subset of the natural numbers that have exactly two factors: 1 and themselves. By a well-known theorem, we know that there are infinitely many prime numbers. They are often represented as a sequence of integers in ascending order: \(\mathbb{P}=\{p_{0},p_{1},p_{2},p_{3},...\}\).
\(\mathbb{A} \)                 The set of all algebraic numbers. That is, the set of numbers that solve polynomials with integer coefficients. We can also introduce the notation \(\mathbb{A}_{n}\) which is the set of all algebraic numbers of order \(n\), that is the set of numbers that solve polynomials of order \(n\) with integer coefficients. In other words: \( \mathbb{A}_n=\left\{\begin{matrix} x \\ \; \end{matrix}\right|\left.\begin{matrix} x \in \mathbb{R} \wedge \exists A = \{a_{0},a_{1},a_{2},...a_{n}\} \sim \mathbb{N}_{n+1} \\ \wedge c \in A\Rightarrow c \in \mathbb{Z} \wedge 0=\sum_{k=0}^{n} a_{k} x^{k} \end{matrix}\right\} \).
We then take the union: \(\mathbb{A}=\mathbb{A}_{1} \cup \mathbb{A}_{2} \cup \mathbb{A}_{3} \cup ... \). A number in the set \(\mathbb{R} \setminus \mathbb{A}\) is called a transcendental number.
\( (a,b) \)                 An ordered pair, that is a pairing of mathematical objects in which the order matters. We can define them in the following way, for technical purposes: \( (a,b) = \left \{ \left \{ a \right \}, \left \{ a,b \right \} \right \} \).
\( A \times B \)                 The cartesian product of sets A and B. That is, the set of all ordered pairs \( (a,b) \) such that \( a \in A \) and \( b \in B \).
\( \mathcal{P}(A) \)                 The power set of set A. That is, the set of all subsets of /(A/). We can write this as \( \mathcal{P}(A)=\left\{\begin{matrix} X \end{matrix}\right|\left.\begin{matrix} X \subseteq A \end{matrix}\right\} \).
\( f: A \rightarrow B \)                 f is a function from A to B. That is, f takes an element of A and returns an element of B. Strictly speaking, a function from A to B is a subset of \( A \times B \) with the properties
  • \( \left\{\begin{matrix} a \end{matrix}\right|\left.\begin{matrix}(a,b)\in f \end{matrix}\right\} = A \)
  • If \((a,b)\in f \) and \((a,c)\in f \), then \(b=c \).
A is called the domain of f and B is called the range or codomain of f. Often we will see the notation \(y=f(x)\), which means that f associates y with x, or \((x,y)\in f\). In this case, y is called the image of x under f. More generally, if \(X \subseteq A \), then \(f(X)=Y= \left\{\begin{matrix} y \end{matrix}\right|\left.\begin{matrix} y=f(x) \wedge x \in X \end{matrix}\right\}\). Y is likewise called the image of X under f. Clearly \(Y \subseteq B \).
\( f: A \rightarrowtail B \)                 f is an injective function from A to B. That is, the function is one-to-one, or, if \(f(a)=f(b)\) then \(a=b\). Every injective function f has an inverse function, \(f^{-1}: f(A) \rightarrow A \). We can say \(f^{-1}=\left\{\begin{matrix} (b,a) \end{matrix}\right|\left.\begin{matrix} (a,b) \in f \end{matrix}\right\} \). Clearly the inverse of an injective function is injective. We can also see that \(f(f^{-1}(a))=a\), and \(f^{-1}(f(a))=a\).
\( f: A \twoheadrightarrow B \)                 f is a surjective function from A to B. That is, the function is onto. In other words, \(f(A)=B\), i.e. \( \left\{\begin{matrix} b \end{matrix}\right|\left.\begin{matrix}(a,b)\in f \end{matrix}\right\} = B \).
\( f: A \leftrightarrow B \)                 f is a bijective function from A to B. That is, f is both injective and surjective. The inverse of a bijective function is also bijective.
\( A \preceq B \)                 There exists an injective function from \(A\) to \(B\).
\( A \sim B \)                 There exists a bijective function from \(A\) to \(B\).
\( A \nsim B \)                 There does not exist a bijective function from \(A\) to \(B\).
\( A \prec B \)                 There exists an injective function from \(A\) to \(B\), but no bijective function from \(A\) to \(B\). That is, \( A \preceq B \) and \( A \nsim B \).
\( \left | A \right | \)                 The cardinality of \(A\). That is, a value associated with a set that can be used to compare sets with respect to some sense of magnitude.
\( \left | A \right | \leq \left | B \right |\)                 \( A \preceq B \). We say this as "the cardinality of A is less than or equal to the cardinality of B", though this is only something of a manner of speaking.
\( \left | A \right | = \left | B \right |\)                 The cardinality of \(A\) is the same as the cardinality of \(B\), meaning there exists a bijective function from \(A\) to \(B\). In other words, \( A \sim B \). When this is the case, it is often convenient to say "A and B are equinumerous", but this is a technical term, and should not be naively interpreted.
\(\aleph_{0}\)                 \(\aleph_{0}=\left | \mathbb{N} \right |\). \(\aleph_{0}\) is the "smallest" infinite cardinal. In other words, if \( \left | A \right |< \aleph_{0}\), then A is a finite set. Also, a set A is said to be countable iff \( \left | A \right | \leq \aleph_{0}\). If \(A \sim \mathbb{N} \), then A is said to be countably infinite.
\(\mathbb{S}_{n}\)                 \(\mathbb{S}_{n}=\left | \mathbb{N}_{n}\right |\). A is a finite set iff for some \(n\), \(\left | A \right |= \mathbb{S}_{n} \), where \(n \in \mathbb{N} \). Typically, if \(\left | A \right |= \mathbb{S}_{n} \), where \(n \in \mathbb{N} \), we say "A has n elements", though we will here avoid all notions of the "number of elements".
\( \left | A \right |+\left |B \right |\)                 \( \left | A \right |+\left |B \right |=\left |A \cup \tilde{B} \right |\), where \(\tilde{B} \sim B \) and \(A \cap \tilde{B}=\emptyset \).
\( \left | A \right |\times\left |B \right |\)                 \( \left | A \right |\times\left |B \right |=\left | A \times B \right |\).
\( \max \left (\left | A \right |,\left |B \right | \right)\)                 \( \max \left (\left | A \right |,\left |B \right | \right) = \left\{\begin{matrix} \left | A \right |, \\ \left | B \right |, \end{matrix}\right. \begin{matrix} B \preceq A \\ A \preceq B \end{matrix}\) .
\( \min \left (\left | A \right |,\left |B \right | \right)\)                 \( \min \left (\left | A \right |,\left |B \right | \right) = \left\{\begin{matrix} \left | A \right |, \\ \left | B \right |, \end{matrix}\right. \begin{matrix} A \preceq B \\ B \preceq A \end{matrix}\) .

Axioms of Set Theory

The most typical axiomatization of set theory is that of Zermelo-Fraenkel with the axiom of choice (ZFC). We will here list these axioms:
Extensionality                 Sets A and B are equivalent (symbolized \(A=B\)), iff, for all x, \(x \in A\) iff \(x \in B\).
Regularity                 If A is a non-empty set, then, for some element x of A, \(A \cap x=\emptyset\).
Schema of Specification                 If, for any x, we can determine whether or not Px is true of x (P being some predicate), then we can always construct the set \( \left\{\begin{matrix} x \end{matrix}\right|\left.\begin{matrix} Px \end{matrix}\right\} \).
Pairing                 If A and B are sets, then there exists a set C such that \(A \in C\) and \(B \in C\).
Union                 If A is a set, then there exists a set B such that, if \(Y \in A\) and \(x \in Y\) then \(x \in B\). More generally, the union of any collection of sets exists.
Schema of Replacement                 If \(f\) is a definable function, and A is a set in the domain of \(f\), then, \(f(A)\) is a set.
Infinity                 There exists a set X such that \(\emptyset \in X\), and, if \(y \in X\) then \(\left \{y \right \} \in X\). That is, \(X=\left \{\emptyset, \left \{\emptyset\right \}, \left \{\left \{\emptyset\right \}\right \},...\right \} \). In particular, X contains infinitely many elements.
Power Set                 If A is a set, then \(\mathcal{P}(A) \) is also a set.
Choice                 If \(A_{1}\), \(A_{2}\), \(A_{3}\), ... is a sequence of non-empty sets, then there exists a set \(X=\left \{x_{1},x_{2},x_{3},...\right \}\), such that \(x_{n} \in A_{n}\).

Some Elementary Theorems

We will now discuss a few theorems. Theorems stated without proof are beyond the level of our discussion or are too involved. The reader is directed to a more technical discussion if more detail is desired.
Theorem: If \(A \subseteq B \) then \(A \preceq B \).
Proof: Let \(f(x)=x\). Then \(f\) is injective from A to B.

Theorem: If \(f: A \rightarrowtail B\), \(C \subseteq A\), \(D \subseteq A\) and \(C \cap D=\emptyset\), then \(f(C) \cap f(D)=\emptyset\).
Proof: Assume \(C \cap D=\emptyset\) and \(f(C) \cap f(D)=F\neq\emptyset\). Let \(\phi \in F\) then, for some \(c \in C\) and some \(d \in D\), \(f(c)=f(d)=\phi\). However, \(f\) is injective, and therefore, if \(f(c)=f(d)\), then \(c=d\). Therefore, \(c \in C \cap D\), and therefore \(C \cap D \neq \emptyset\), which contradicts our assumption. Therefore \(f(C) \cap f(D)=\emptyset\).
Corollary: If \(f: A \rightarrowtail B\), then \(f(C \cap D)=f(C) \cap f(D)\).
Corollary: If \(f: A \rightarrowtail B\), then \(f(C \cup D)=f(C) \cup f(D)\).

Theorem: \(A \sim A \).
Proof: The identity function \(f(x)=x\) is a bijection from A to A.

Theorem: If \(A \sim B \) then \(B \sim A \).
Proof: Let \(f\) be a bijective function from A to B. As all bijective functions have inverses, \(f^{-1}\) is a bijection from B to A.

Theorem: If \(A \sim B \) and \(B \sim C \) then \(A \sim C \).
Proof: Let \(f\) be a bijective function from A to B, and \(g\) be a bijective function from B to C. Then the composition \(g \circ f\) is a bijective function from A to C.

Theorem: If \(a \notin A \) and \(A \sim \mathbb{N}_{n} \) then \(A \cup \left\{a \right\} \sim \mathbb{N}_{n+1}\).
Proof: Let \(f\) be a bijective function from A to \(\mathbb{N}_{n}\). Let \(g(x)=\left\{\begin{matrix} f(x), \;\; \;\; \\ n+1, \;\; \;\; \end{matrix}\right. \begin{matrix} x \in A \\ x=a \end{matrix}\).
Then g is bijective from \(A \cup \left\{a \right\}\) to \(\mathbb{N}_{n+1}\).
Corollary: If \(B \cap A=\emptyset \) and \(B \sim \mathbb{N}_{m} \) and \(A \sim \mathbb{N}_{n} \) then \(A \cup B \sim \mathbb{N}_{n+m}\).

Theorem: If \(a \in A \) and \(A \sim \mathbb{N}_{n} \) then \(A \setminus \left\{a \right\} \sim \mathbb{N}_{n-1}\).
Proof: Let \(f\) be a bijective function from \(\mathbb{N}_{n}\) to A. Suppose \(a=f(m)\). Let \(g(x)=\left\{\begin{matrix} f(x), \;\; \;\; \\ f(x+1), \;\; \;\; \end{matrix}\right. \begin{matrix} x < m \\ x \geq m \end{matrix}\).
Then g is bijective from \(\mathbb{N}_{n-1}\) to \(A \setminus \left\{a \right\}\).
Corollary: If \(B \subseteq A \) and \(B \sim \mathbb{N}_{m} \) and \(A \sim \mathbb{N}_{n} \) then \(A \setminus B \sim \mathbb{N}_{n-m}\).

Theorem: If \(A \sim B \) and \(a \in A \) and \(a \in B \) then \(A \setminus \left\{a \right\} \sim B \setminus \left\{a \right\} \).
Proof: Let \(f\) be a bijective function from A to B. Suppose \(f(c)=a\). Define \(g(x)=\left\{\begin{matrix} a, \\ c, \\ x, \;\; \end{matrix}\right. \begin{matrix} x=c \;\;\;\; \;\; \\ x=a \;\;\;\; \;\; \\ x \neq a \wedge x \neq c \end{matrix}\).
Clearly g is bijective (in fact, it is its own inverse) and \(g(A)=A\). Also, \((f \circ g)(a)=a\), and \(f \circ g\) is bijective. Therefore, \(f \circ g\) is a bijection from \(A \setminus \left\{a \right\}\) to \(B \setminus \left\{a \right\}\).
Corollary: If \(A \sim B \), and \(C \sim \mathbb{N}_{n}\) for some \(n \in \mathbb{N}\), where \(C \subseteq A \) and \(C \subseteq B \), then \(A \setminus C \sim B \setminus C \).
Proof: Apply the theorem above \(n\) times. Note that the set subtracted must be finite. The theorem is inapplicable for infinite sets. In fact, as a counterexample for infinite sets: \(\mathbb{Z} \setminus \mathbb{N} \nsim \mathbb{W} \setminus \mathbb{N}\) even though \(\mathbb{Z} \sim \mathbb{W}\).

Theorem: \(\mathbb{N}_{n} \sim \mathbb{N}_{m} \) if and only if \( {n = m} \).
Proof: Suppose that \({n \neq m}\) and \(\mathbb{N}_{n} \sim \mathbb{N}_{m} \). Without loss of generality, \(n < m \). By a theorem proved above, \(\mathbb{N}_{n} \setminus \mathbb{N}_{n} \sim \mathbb{N}_{m} \setminus \mathbb{N}_{n} \). But \(\mathbb{N}_{n} \setminus \mathbb{N}_{n} = \emptyset \). However, \(m \in \mathbb{N}_{m} \setminus \mathbb{N}_{n} \), so \(\mathbb{N}_{m} \setminus \mathbb{N}_{n} \neq \emptyset\). Clearly, then, it is impossible that \(\mathbb{N}_{n} \setminus \mathbb{N}_{n} \sim \mathbb{N}_{m} \setminus \mathbb{N}_{n} \), and so our assumption that \({n \neq m}\) must be false. Conversely, if \({n=m} \) then \(\mathbb{N}_{n}= \mathbb{N}_{m} \), and thus \(\mathbb{N}_{n} \sim \mathbb{N}_{m} \).

Theorem: For any sets \(A\) and \(B\), there is an injection from \(A\) to \(B\) if and only if there is a surjection from \(B\) to \(A\).
Proof: Suppose \(f: A \rightarrowtail B\). Then, let \(a \in A\) be any element in \(A\). Now, define \(g(x)=\left\{\begin{matrix} f^{-1}(x), \;\; \;\; \\ a, \;\; \;\; \end{matrix}\right. \begin{matrix} x \in f(B) \\ x \notin f(B) \end{matrix}\)
Clearly \(g\) is surjective from \(B\) to \(A\).
Now suppose \(f: A \twoheadrightarrow B\). That means that, for every \(b\in B\) there is at least one \(a \in A\) such that \(f(a)=b\). Now define \(g(x)=a\), where \(a\) is arbitrarily chosen for any \(x\), from among those \(a \in A\) such that \(f(a)=x\). Clearly \(g\) is injective from \(B\) to \(A\).
Corollary: "There is a surjection from \(A\) to \(B\) " \( \Leftrightarrow B \preceq A\).

Theorem: For any sets \(A\) and \(B\), there exists either an injection from \(A\) to \(B\) or a surjection from \(A\) to \(B\).
Proof (Informal): If there is no function from \(A\) to \(B\) that is either an injection or an injection, then, for any function \(f\), there are \(a_1,a_2 \in A\) such that \(a_1 \neq a_2\) and \(f(a_1)=f(a_2)\) and a \(b \in B\) such that \(b \notin f(A)\). But clearly we can construct a new function such that \(f(a_2)=b\). Moreover, we can continue this for an arbitrary set of \(\left\{\begin{matrix} b \end{matrix}\right|\left.\begin{matrix} b \in B \wedge b \notin f(A) \end{matrix}\right\}\) and/or \(\left\{\begin{matrix} a \end{matrix}\right|\left.\begin{matrix} a \in A \wedge \exists a' \in A, a' \neq a, f(a)=f(a') \end{matrix}\right\}\). Thus, we can always exhaust at least one set, as each set has a set of larger cardinality, and we can continue this up to any cardinality. Thus, we can exhaust either the domain (which would make the function injective) or the co-domain (which would make the function surjective).
Corollary: For any sets A and B, either \( A \preceq B \) or \( B \preceq A \).
Corollary: For any sets A and B, exactly one of the following is true:
  • \(A \prec B\)
  • \(B \prec A\)
  • \(A \sim B\)

Theorem: If \( A \preceq B \) and \( B \preceq A \) then \(A \sim B \).
Proof: Let \( f: A \rightarrowtail B \) and \( g: B \rightarrowtail A \) and \(h=g \circ f\). Let us introduce the notation \(h^{k}(x)\) defined by \(h^{0}(x)=x\) and \(h^{n+1}(x)=h(h^{n}(x))\). We now let \( C=A \setminus g(B)\) and \( X=h^{0}(C) \cup h^{1}(C) \cup h^{2}(C) \cup ...\).
We then define \(k(x)=\left\{\begin{matrix} f(x), \\ g^{-1}(x), \end{matrix}\right. \begin{matrix} x \in X \;\;\;\; \\ x \notin X \;\;\;\; \end{matrix}\).
We can then verify that \(k\) is bijective from \(A\) to \(B\):
  • Let \(b \in B\). If \(b \in f(X)\), then, for some \(a \in X \subseteq A\), \(k(a)=b\). Otherwise, we can find an \(a\) such that \(a=g(b)\). By construction, \(a \notin C\), and \(b \notin f(X)\), and since \(f(h^{n}(X)) \subseteq f(X)\), we find \(a \notin X\) and so \(k(a)=b\). Thus, for any \(b \in B\), \(b=k(a)\) for some \(a\in A\). Therefore \(k\) is surjective.
  • Assume, for some \(c \in X\) and some \(d \notin X\), \(f(c)=g^{-1}(d)\), or \(d=g(f(c))\). Therefore, for some \(n \in \mathbb{W}\), \(c \in h^{n}(C)\). But then \(d \in g(f(h^{n}(C)))=h^{n+1}(C) \subseteq X\). But that contradicts our assumption that \(d\notin X\), and thus, if \(k(c)=k(d)\), then \(c=d\). Therefore, \(k\) is injective.

Theorem: For any set A, \(A \prec \mathcal{P}(A) \).
Proof: Clearly \(f(x)=\left\{x \right\}\) is an injection from A to \(\mathcal{P}(A)\). Let \(f: A \rightarrow \mathcal{P}(A)\) be any function. Define \(S= \left\{\begin{matrix} x \end{matrix}\right|\left.\begin{matrix} x \notin f(x) \end{matrix}\right\} \). Clearly, then, \(x \in S \Leftrightarrow x \notin f(x)\). Assume that, for some \(k \in A\), \(S=f(k)\). In that case, \(k \in f(k) \Leftrightarrow k \notin f(k)\), which is clearly impossible. Thus, our assumption fails, and there is no k such that \(k \in A\), \(S=f(k)\). Therefore, as \(S \in \mathcal{P}(A)\), \(f\) is not surjective, and thus not bijective.
Corollary: For any set A, \(A \prec \mathcal{P}(A) \prec \mathcal{P}(\mathcal{P}(A)) \prec ... \).
Corollary: \(\mathbb{N} \prec \mathcal{P}(\mathbb{N}) \prec \mathcal{P}(\mathcal{P}(\mathbb{N})) \prec ... \).

Theorem: \(\mathbb{N} \sim \mathbb{Z}\).
Proof: Let \(f(n)=\left\{\begin{matrix} \frac{n-1}{2} \;\; n \textrm{ odd} \\ -\frac{n}{2} \;\; n \textrm{ even} \end{matrix}\right.\). Then \(f\) can be seen to be bijective from \(\mathbb{N}\) to \(\mathbb{Z}\).

Theorem: \(\mathbb{N} \sim \mathbb{Q}\).
Proof: As \(\mathbb{N} \subset \mathbb{Q}\), we have \(\mathbb{N} \preceq \mathbb{Q}\). Let us now take \(f \left( \frac{a}{b} \right)= 2^{\textrm{sgn}(\frac{a}{b})}3^{a}5^{b}\). Clearly \(f(\mathbb{Q}) \subset \mathbb{N}\), and so \(f\) is an injection from \(\mathbb{Q}\) to \(\mathbb{N}\), and thus \(\mathbb{Q} \preceq \mathbb{N}\). Therefore \(\mathbb{N} \sim \mathbb{Q}\).

Theorem: \(\mathbb{N} \sim \mathbb{A}\).
Proof: As \(\mathbb{N} \subset \mathbb{A}\), we have \(\mathbb{N} \preceq \mathbb{A}\). Let us now take \(f \left( x \right)\) to be defined as \(f(x)=p_{1}^{k}p_{2}^{a_{0}}p_{3}^{a_{1}}...p_{n+2}^{a_{n}}\) where \(x\) is the kth real solution of \(0=\sum_{k=0}^{n} a_{k} x^{k}\), sorted according to increasing absolute magnitude, and \(p_{k}\) is the kth prime number. Then, \(f(\mathbb{A}) \subset \mathbb{Q} \sim \mathbb{N}\), and \(f\) is injective. Thus \(\mathbb{A} \preceq \mathbb{N}\), and thus \(\mathbb{N} \sim \mathbb{A}\).

Theorem: \((0,1) \sim \mathbb{R}\).
Proof: The function \(f=\frac{2 x-1}{x-x^{2}}\) is bijective from \((0,1)\) to \(\mathbb{R}\).

Theorem: \((0,1) \sim [0,1]\).
Proof: Let \(f(x)=\left\{\begin{matrix} \frac{1}{2}, \\ \frac{1}{2^{n-2}}, \\ x, \;\; \end{matrix}\right. \begin{matrix} x=0 \;\;\;\;\;\;\;\;\;\;\;\;\; \\ x=\frac{1}{2^{n}}, n \in \mathbb{W} \\ x \neq 0, x \neq \frac{1}{2^{n}} \;\; \end{matrix}\). Then f is bijective from \([0,1]\) to \((0,1)\).

Theorem: \([0,1) \sim [0,1]\).
Proof: Let \(f(x)=\left\{\begin{matrix} \frac{1}{2^{n-1}}, \\ x, \;\; \end{matrix}\right. \begin{matrix} x=\frac{1}{2^{n}}, n \in \mathbb{W} \\ x \neq \frac{1}{2^{n}} \;\; \end{matrix}\). Then f is bijective from \([0,1]\) to \([0,1)\).

Theorem: \([0,1) \sim (0,1]\).
Proof: Let \(f(x)= 1-x \). Then f is bijective from \((0,1]\) to \([0,1)\).

Theorem: \((a,b) \sim (0,1)\).
Proof: Let \(f(x)= \frac{x-a}{b-a} \). Then f is bijective from \((a,b)\) to \((0,1)\). Relations of the form \([a,b) \sim [0,1) \) can be identically proven.

Theorem: \((a,\infty) \sim (0,1)\).
Proof: Let \(f(x)=\frac{|x-a|}{|x-a|+1}\). Then f is bijective from \((a,\infty)\) to \((0,1)\). This function can also be used to prove \([a,\infty) \sim [0,1)\), \((-\infty, a) \sim (0,1)\) and \((-\infty, a] \sim [0,1)\).

Theorem: If \(a \in (0,1)\), then \([0,a)\cup (a,1]=[0,1]\setminus \left \{a \right \} \sim [0,1]\).
Proof: Let \(f(x)=\left\{\begin{matrix} a^{n}, \\ x, \;\; \end{matrix}\right. \begin{matrix} x=a^{n+1}, n \in \mathbb{N} \\ x \neq a^{n+1} \;\;\;\; \;\; \end{matrix}\).
Then f is bijective from \([0,1]\setminus \left \{a \right \}\) to \([0,1)\).

Theorem: \(\mathbb{R} \sim \mathbb{R} \setminus \mathbb{A}\)
Proof: Let \( P= \left\{\begin{matrix} \frac{1}{p} \end{matrix}\right|\left.\begin{matrix} p \in \mathbb{P} \end{matrix}\right\}\). Clearly \(P \sim \mathbb{N}\), as \(P \sim \mathbb{P}\) and \(\mathbb{P} \sim \mathbb{N}\). Moreover, as \(\mathbb{N} \sim \mathbb{A}\), we have \(P \sim \mathbb{A}\), and since \((0,1) \sim \mathbb{R}\), we just need to show that \((0,1) \sim (0,1) \setminus P \). Let
\(f(x)=\left\{\begin{matrix} \frac{1}{p_k^{n+1}}, \\ x, \;\; \end{matrix}\right. \begin{matrix} x=\frac{1}{p_k^{n}}, n,k \in \mathbb{N} \\ x \neq \frac{1}{p_k^{n}} \;\;\;\;\;\;\;\;\;\;\; \end{matrix}\).
Then \(f\) is bijective from \((0,1)\) to \((0,1) \setminus P \). Therefore, it follows that \(\mathbb{R} \sim \mathbb{R} \setminus \mathbb{A}\).

Theorem: \(\mathcal{P}(\mathbb{N}_{n})\sim \mathbb{N}_{2^{n}}\).
Proof: We use the fact that every number has a unique binary representation. Let \(A \in \mathcal{P}(\mathbb{N}_{n}) \). We then let \(f(A)=1+\sum_{x \in A} 2^{x-1}\) and \(f(\emptyset)=1\). Then f is bijective from \(\mathcal{P}(\mathbb{N}_{n})\) to \(\mathbb{N}_{2^{n}}\).

Theorem: If \(A \sim B\) then \(\mathcal{P}(A) \sim \mathcal{P}(B)\).
Proof: Let \(f\) be a bijective function from A to B. Let \(X \in \mathcal{P}(A)\). Then \(Y= f(X)\) will be a bijective function from \(\mathcal{P}(A)\) to \(\mathcal{P}(B)\). However, by Easton's Theorem, the converse is only true if we assume the generalized continuum hypothesis, which is the proposition that, if \(X\) is an infinite set, there is no set \(Y\) such that \(X \prec Y \prec \mathcal{P}(X)\).

Theorem: \(\mathbb{N}_{m} \times \mathbb{N}_{n} \sim \mathbb{N}_{mn}\).
Proof: Let \((a,b)\in \mathbb{N}_{m} \times \mathbb{N}_{n}\). Let \(f((a,b))= (a-1)n+b\), then f is bijective from \(\mathbb{N}_{m} \times \mathbb{N}_{n}\) to \(\mathbb{N}_{mn}\).

Theorem: \(\mathcal{P}(\mathbb{N}) \sim (0,1)\).
Proof: We use the fact that every number has a binary representation. However, there are equivalent representations for numbers with finite binary representations. For instance, \(\frac{3}{4}=0.11_{2}=0.101111..._{2}\). Let \(A\in \mathcal{P}(\mathbb{N})\). We the define
\(f(A)=\left\{\begin{matrix} \frac{1}{8}+\frac{1}{4}\sum_{k \in A} 2^{-k}, \\ \frac{1}{2}+\frac{1}{4}\sum_{k \in A} 2^{-k}, \end{matrix}\right. \begin{matrix} A \prec \mathbb{N} \\ A \sim \mathbb{N} \end{matrix}\).
Then we can see that \(f\) is injective from \(\mathcal{P}(\mathbb{N})\) to \((0,1)\). We then define \(g(x)= \left\{\begin{matrix} n \end{matrix}\right|\left.\begin{matrix} n\in \mathbb{N} \wedge \left \lfloor 2^n x \right \rfloor \equiv 1 \pmod 2 \end{matrix}\right\}\). Then we can see that \(g\) is injective from \((0,1)\) to \(\mathcal{P}(\mathbb{N})\). Thus, as \(\mathcal{P}(\mathbb{N}) \preceq (0,1)\) and \((0,1) \preceq \mathcal{P}(\mathbb{N})\), we find that \((0,1) \sim \mathcal{P}(\mathbb{N})\).
Corollary: \(\mathcal{P}(\mathbb{N}) \sim \mathbb{R}\).

Theorem: \(\mathbb{N} \times \mathbb{N} \sim \mathbb{N}\).
Proof: The function \(f(n)=(n,1)\) is injective from \(\mathbb{N}\) to \(\mathbb{N} \times \mathbb{N}\). Therefore \(\mathbb{N} \preceq \mathbb{N} \times \mathbb{N}\). The function \(f((a,b))=2^{a}3^{b}\) is injective from \(\mathbb{N} \times \mathbb{N}\) to \(\mathbb{N}\). Therefore \(\mathbb{N} \times \mathbb{N} \preceq \mathbb{N}\). Therefore \(\mathbb{N} \times \mathbb{N} \sim \mathbb{N}\).

Theorem: \([0,1] \times [0,1] \sim [0,1]\).
Proof: We use the fact that ever number has a unique binary representation. Let \(a,b \in [0,1]\), and \(a=\sum_{k=1}^{\infty} a_{k} 2^{-k}\) and \(b=\sum_{k=1}^{\infty} b_{k} 2^{-k}\), where \( a_k , b_k \in \left \{ 0,1 \right \} \). Then, let \(f((a,b))=\sum_{k=1}^{\infty} \left ( 2 a_k+b_k \right ) 4^{-k}\) and then we can see that \(f\) is bijective from \([0,1] \times [0,1]\) to \([0,1]\).
Corollary: \(\mathbb{R} \times \mathbb{R} \sim \mathbb{R}\).

Theorem: If \(\mathbb{N} \preceq A\) or \(\mathbb{N} \preceq B\), then \(|A|+|B|=|A|\times|B|=\max(|A|,|B|)\).

Theorem: \(\mathbb{S}_{m}+\mathbb{S}_{n}=\mathbb{S}_{m+n} \) and \(\mathbb{S}_{m}\times\mathbb{S}_{n}=\mathbb{S}_{mn} \).

Theorem: The following proposition is undecidable (cannot be proved or disproved) in ZFC: "There does not exist a set \(X \subseteq \mathbb{R}\) such that \(\mathbb{N} \prec X \prec \mathbb{R}\)."

Friday, May 2, 2014

Syllogisms and Probability

A syllogism is a systematic way of laying out a sequence of logical steps. They are very useful in formulating arguments, constructing proofs, or checking the coherence of a certain train of thought. A syllogism is a systematic way of laying out an argument. A syllogism consists of three parts: two premises (which can sometimes be construed as major and minor) and a conclusion. A syllogism is valid if the conclusion follows logically from the premises, whereas is it invalid if it does not. A syllogism is sound if both its premises are true, or at least plausibly true, and the syllogism is valid. Note that a syllogism may be invalid and yet have a true conclusion: it may even have true premises and a true conclusion and still be invalid.

We can also draw the distinction between a propositional and a categorical syllogism. In a propositional syllogism, we deal only with bare propositions connected by logical operators like "or", "and" "implies" etc. In a categorical syllogism, we deal only with statements of inclusion or exclusion in certain categories that are quantified, using such terms as "all" "some" and "no/none".

For example:
1) If Socrates is a man then Socrates is mortal.
2) Socrates is a man.
3) Therefore Socrates is mortal.

This is an example of a propositional syllogism. We can also write this in a categorical form:
1) All men are mortal.
2) Socrates is a man
3) Therefore Socrates is mortal.
We will henceforth focus on the propositional variety.

There are a number of standard propositional syllogisms (also called propositional forms)
  • Modus Ponens: If A, then B. A. Therefore B.
  • Modus Tollens: If A then B. Not B. Therefore, not A.
  • Hypothetical Syllogism: If A then B. If B then C. Therefore, if A then C.
  • Disjunctive Syllogism: Either A or B. Not A. Therefore B.
  • Conjunctive Syllogism: A. B. Therefore A and B.

We can further pare these down by noting that modus ponens and modus tollens are both actually disjunctive syllogisms when we state them using material conditionals. Remember that, in a material conditional, "if A then B" is equivalent to "Either not A, or B".

In a valid argument, if we grant that the premises are true, we must grant that the conclusion is also true. However, we typically are required to evaluate premises of which we are not certain. Namely, the premises have epistemic probabilities that are less than one. In that case, we can use our knowledge of probabilities to give bounds on the probability of the conclusion, and possibly even an estimate of the probability of the conclusion. That is, we can give limits to how plausible we are to take the conclusion to be. We will merely state it as a chart. In each case x is the probability of the first premise (major premise), y is the probability of the second premise (minor premise), and z is the probability of the conclusion. That is, \(P(p1)=x\), \(P(p2)=y\), \(P(c)=z\).

Bounds on Syllogism Probabilities

Name Form Bounds on Conclusion Estimate of Conclusion
Disjunctive Syllogism \(p1)\> A \cup B\)
\(p2)\> \sim A\)
\(c)\>\> \therefore B\)
\(x+y-1 \leq z \leq x\) \(z=x+\frac {y-1}{2}\)
Conjunctive Syllogism \(p1)\> A\)
\(p2)\> B\)
\(c)\>\> \therefore A \cap B\)
\(x+y-1 \leq z \leq \min(x,y)\) \(z=xy\)
Hypothetical Syllogism \(p1)\> A \Rightarrow B\)
\(p2)\> B \Rightarrow C\)
\(c)\>\> \therefore A \Rightarrow C\)
\(x+y-1 \leq z \leq 1\) \(z= \frac {x+y}{2}\)

For example, suppose we say "If it is raining, then Bob will bring an umbrella. It is raining. Therefore Bob will bring an umbrella." We are 80% sure that Bob will bring an umbrella if it is raining and 70% sure that it is raining. Thus, the probability that Bob does bring an umbrella is somewhere between 50% and 80%, with an estimate of 56%. That is not very confident.
Suppose we have two propositions of which we are 70% certain of each. Then the probability that both are true can be anywhere from 40% to 70% with an estimated value of 49%. Thus it is not enough that each of the premises be more plausible than their opposites: we must demand more in order that an argument be a good one.

Sometimes when people say "if A then B", they mean something different than the material conditional ("Either not A, or B"). Frequently, they mean something like "Given A, B will happen/be true". Thus the probability of "if A, then B" wouldn't be \(P(\sim A \cup B)\), but rather \(P(B|A)\). We will here give the table for the conditional probability case. That is, in every case that a conditional statement is given a probability, that probability is of the probability of the consequent given the antecedent.

Bounds on Syllogism Probabilities

Name Form Bounds on Conclusion Estimate of Conclusion
Modus Ponens \(p1)\> A \Rightarrow B\)
\(p2)\> A\)
\(c)\>\> \therefore B\)
\(xy \leq z \leq xy+1-y\) \(z=xy+\frac {1-y}{2}\)
Modus Tollens \(p1)\> A \Rightarrow B\)
\(p2)\> \sim B\)
\(c)\>\> \therefore \sim A\)
\(\frac {x+y-1}{x} \leq z < 1\) \(z=\frac {2x+y-1}{2x}\)

For a hypothetical syllogism with conditional probabilities, we can give no bounds or estimates.

Wednesday, April 23, 2014

Rosmarus et Naupegus

The Walrus and the Carpenter is another poem from Lewis Carroll's Through the Looking-Glass. It is the semi-nonsensical story of a walrus, a carpenter and some oysters. The poem is one of my favorites, and thus an obvious candidate to be translated into Latin. A literal de-translation is given below the parallel translation.

Rosmarus et Naupegus

          A Luviso Carollo
          Latinatum a Nadavo Cravito

Sol nitebat in mare,
Nitens summa cum vire
Fluctus conans facere
Blandos et clarosque
Visu mirabilissime
In media nocte

Maeste luna nitebat,
Putabat nam soli
Haud opus esse adesse
Post finem quin diei.
Dicit "venisse, irruere
Improba sunt mihi,"

Mare umet udissimum,
Litus aridissimum
Nimbus nullus videatur
Abest nam videndum:
Super volat avis nulla—
Abest nam volandum.

Rosmarus et Naupegus
Juxtim ambulabant;
Moles harenae videre
Quam misere flebant
"Si modo hoc depurgatum,
Quam bonum." dicebant

"Si servae septem scopis quot
Averrent quot menses,
"Num possent" dicit Rosmarus
"Purgarene putes?"
Naupegus dicit "Dubito,"
Flens lacrimas tristes.

"O Ostreae, sequamini"
Obsecrat Rosmarus
“Loquamur salsam per actam
atque ambulemus:
Sed summum quattuor sumamus,
Manum cuique demus.”

Ostrea maxima spicit,
Nulla verba dicit:
Ostrea maxima nictat,
Grave caput quatit—
Linquere ostrearium
Videlicet nolit.

Minores quattuor festinant,
Muneri stundentes:
Aliclae tersae, frontes lautae,
Calcei elegantes—
Sane mirabilissime,
Non pedes habentes.

Ostreae quattuor sequuntur,
Et quattuor aliae.
Veniunt iam et gregatim,
Et crescenti multae.
Per undas spumas saliunt,
Petunt portum harenae.

Rosmarus et Naupegus
Stadium gradiuntur
Tunc commode in scopulo
Demisso nituntur
Ordineque ostreulae
Nunc opperiuntur.

"Adest tempus," dicit Rosmarus
"Semonis de multis:
Naves, calcei, cera sigilli,
Et reges et caulis
Ac porci num alarentur,
Causa fervoris maris."

“Mora sodes,” testae clamant
“Antequam loquaris;
Exanimatae et pingues
Sunt namque e nobis!”
“Festina lente!” Naupegus
Dicit valde gratis.

“Massae panis” dicit Rosmarus
“Nobis opus maximum”
“Enimvero optimi sunt
Piper et acetum—
Ostreulae, si paratae,
Inchoamus comesum.”

Ostreae “Atqui non nostrum!”
Clamant livescentes.
“Factum atrum sit nobis
Post gratias tales!”
Rosmarus dicit “Bella nox,
Miramini species?”

“Quam comes sunt quod venisitis,
Et estis quam belli!”
Tacebat Naupegus nisi
“Seca frustum mihi:
Opto ut minus surdus sis—
Iam bis te poposci!”

“Turpe eas dolo capere
Est” dicit Rosmarus
“Post tam longe eduximus
Tam cursim egimus!”
Tacebat Naupegus nisi
“Butyrum crassius!”

“Vos lacrimo” ait Rosmarus
“Vostrum me miseret.”
Singultibus et lacrimis
Maximas diribet
Mucinium ante oculos
Fundentes praetendet.

“Ostreae,” dicit Naupegus,
“Bene cucurrerunt!
Debemus domum redire?”
At voces nullae reddunt—
Et haud mirabilissime,
Quod quamque ederunt.
The Walrus and the Carpenter

          By Lewis Carroll

The sun was shining on the sea,
Shining with all his might:
He did his very best to make
The billows smooth and bright--
And this was odd, because it was
The middle of the night.

The moon was shining sulkily,
Because she thought the sun
Had got no business to be there
After the day was done--
"It's very rude of him," she said,
"To come and spoil the fun!"

The sea was wet as wet could be,
The sands were dry as dry.
You could not see a cloud, because
No cloud was in the sky:
No birds were flying overhead--
There were no birds to fly.

The Walrus and the Carpenter
Were walking close at hand;
They wept like anything to see
Such quantities of sand:
"If this were only cleared away,"
They said, "it would be grand!"

"If seven maids with seven mops
Swept it for half a year.
Do you suppose," the Walrus said,
"That they could get it clear?"
"I doubt it," said the Carpenter,
And shed a bitter tear.

"O Oysters, come and walk with us!"
The Walrus did beseech.
"A pleasant walk, a pleasant talk,
Along the briny beach:
We cannot do with more than four,
To give a hand to each."

The eldest Oyster looked at him,
But never a word he said:
The eldest Oyster winked his eye,
And shook his heavy head--
Meaning to say he did not choose
To leave the oyster-bed.

But four young Oysters hurried up,
All eager for the treat:
Their coats were brushed, their faces washed,
Their shoes were clean and neat--
And this was odd, because, you know,
They hadn't any feet.

Four other Oysters followed them,
And yet another four;
And thick and fast they came at last,
And more, and more, and more--
All hopping through the frothy waves,
And scrambling to the shore.

The Walrus and the Carpenter
Walked on a mile or so,
And then they rested on a rock
Conveniently low:
And all the little Oysters stood
And waited in a row.

"The time has come," the Walrus said,
"To talk of many things:
Of shoes--and ships--and sealing-wax--
Of cabbages--and kings--
And why the sea is boiling hot--
And whether pigs have wings."

"But wait a bit," the Oysters cried,
"Before we have our chat;
For some of us are out of breath,
And all of us are fat!"
"No hurry!" said the Carpenter.
They thanked him much for that.

"A loaf of bread," the Walrus said,
"Is what we chiefly need:
Pepper and vinegar besides
Are very good indeed--
Now if you're ready, Oysters dear,
We can begin to feed."

"But not on us!" the Oysters cried,
Turning a little blue.
"After such kindness, that would be
A dismal thing to do!"
"The night is fine," the Walrus said.
"Do you admire the view?

"It was so kind of you to come!
And you are very nice!"
The Carpenter said nothing but
"Cut us another slice:
I wish you were not quite so deaf--
I've had to ask you twice!"

"It seems a shame," the Walrus said,
"To play them such a trick,
After we've brought them out so far,
And made them trot so quick!"
The Carpenter said nothing but
"The butter's spread too thick!"

"I weep for you," the Walrus said:
"I deeply sympathize."
With sobs and tears he sorted out
Those of the largest size,
Holding his pocket-handkerchief
Before his streaming eyes.

"O Oysters," said the Carpenter,
"You've had a pleasant run!
Shall we be trotting home again?'
But answer came there none--
And this was scarcely odd, because
They'd eaten every one.

The Walrus and the Carpenter (De-translated)

The sun was shining on the sea
shining with [his] greatest might.
Trying to make the billows
Both smooth and bright
[A sight] most astonishing to see
in the middle of the night

The moon was shining gloomily
For she thought, for the sun
it was not [his] business to be present
Indeed, after the end of the day.
She says “To have come and to intrude
are rude [acts] to me.”

The very wet sea is wet
And the shore is most dry
No cloud might be seen
For anything to be seen is absent
No bird flies above—
For anything to be flown is absent.

The Walrus and the [ship-wright] Carpenter
Were walking together
To see such masses of sand,
how wretchedly they wept.
“If only this might be cleaned away
How good [it would be]” they said.

“If seven maids with as many brooms
were to sweep for as many months,
don’t you think” says the Walrus
“they couldn’t clean [it]?”
The Carpenter says “I doubt [it]”,
Weeping sad tears.

“O oysters, let you follow [us]”
Entreats the Walrus
“Let us talk along the briny beach
and also let us walk:
But we might only select at most four,
that we might give a hand to each.”

The oldest oyster looks
[and] says no words:
The oldest oyster winks,
[and] gravely shakes [her] head—
To leave the oyster-bed
evidently, she would be unwilling.

Four younger [oysters] hurry,
eager for [the] gift
[Their] child’s-cloaks [were] wiped, [their] faces washed
[And their] shoes were handsome—
Certainly most astonishingly,
[as they] had no feet.

Four [more] oysters follow,
and four others.
Now they come even in flocks
and to an increasing[ly] many [oysters].
Through waves [and] foam they leap
Seeking the refuge of the sand.

The Walrus and the Carpenter
walk for a furlong
Then lean upon a rock
conveniently low.
And in a line, the little oysters
now wait.

“The time has arrived” says the Walrus
“of speaking concerning many things:
Ships, shoes, wax of a seal,
both kings and cabbage
And whether pigs be winged,
and the cause of the boiling of the sea.”

“A pause, please” the shellfish cry
Before you might speak;
For exhausted and fat
are there among us!”
“Hasten slowly!” said the Carpenter
To the greatly thankful [oysters].

“A lump of bread,” the Walrus says
“is our greatest need:
Certainly, pepper and vinegar
are also very good—
Little oysters, if [you are] ready,
we begin the eating.”

The oysters cry “But not of us!”
becoming blue.
“[That] would be a terrible deed,
after so great kindnesses!”
The Walrus says “The night [is] pretty,
Do you admire the sights?”

How kind you are that you came
And you are so charming!”
The Carpenter was silent except [for saying]
“Cut a scrap for me:
I wish that you might be less deaf—
I have already asked you twice!”

“A disgraceful [thing] it is
to catch them with a trick” says the Walrus
“After we have led them out so far
and drove them so swiftly.”
The Carpenter was silent except [for saying]
“The butter is too thick!”

“I weep for you” says the Walrus
“I feel sorry for you.”
With sobbing and with tears
he sorted out the biggest ones.
He extended a handkerchief
before his pouring eyes.

“Oysters,” says the Carpenter
“You have run well!
Ought we return home?”
But no voices return—
And hardly is it most astonishing
because they had eaten every one.

Monday, April 14, 2014


      Probability is a concept that is intuitively fairly easy to understand, yet difficult to give a comprehensive, universally acceptable interpretation. In general, probabilities are given with respect to events or propositions and give a way of quantitatively answering such questions as “How certain are we that X will occur/is true?”. Probabilities range from 0, (almost*) certain not to happen/be true, to 1, (almost*) certain to happen/be true.

There is division as to whether probabilities are objective facts or only subjective. Some say the probability of an event is a measure of the propensity of a certain situation to yield a certain outcome, while others say that the probability of an event is the relative frequency of that event in the limit of a large number of relevantly identical cases, or trials. Those who say it is subjective give, for instance, the conception that the probability of an event can be defined as “the price (in arbitrary units) at which you would buy or sell a bet that paid 1 unit if the event occurred and 0 if it did not occur”.

One way to circumvent all of these is to leave probability somewhat vague and give it a thorough mathematical basis. This can readily be done. We will deal with the probability of some event which will occur as a result of some experiment in individual trials. What is needed for a probabilistic model are two things:
  • The sample space \(\Omega\): the set of all possible outcomes of the experiment.
  • The probability law \(P\). This is a function that takes a subset of the sample space and returns a real number. This law, to qualify as a proper probability law, must satisfy three conditions. Let \(A\) be some subset of \(\Omega\).
    1. Non-negativity: For any set \(A\) subset of \(\Omega\), \(P(A) \geq 0\).
    3. Countable Additivity: Let \(A_{1}, A_{2}, ...\) be a countable sequence of mutually disjoint sets (that is, no element in one set is in any other set), each a subset of \(\Omega\). Then \(P(A_{1} \cup A_{2} \cup ...)=P(A_{1})+P(A_{2})+...\).
    5. Normalization: the probability of some event in the space is unity, that is \(P(\Omega)=1\).
If the model satisfies these conditions, it is at least admissible, though typically we have other considerations that help us choose a model, such as simplicity. These conditions imply that the empty set, that is, the set containing no elements, has probability zero.

Very typical in probability theory is the use of set-theoretic or logical-operator notation. While notation varies, the fundamental concepts remain consistent. When we want the probability that events \(A\) and \(B\) will both happen (e.g. a die lands on an even number and on a number above three), we ask for the probability of their conjunction, represented as \(P(A \cap B)\) or \(P(A \& B)\) or \(P(A \wedge B)\). When we want the probability that at least one event of the events \(A\) and \(B\) will happen (e.g. a die lands on an even number or on a number above three, or both) we ask for the probability of their disjunction, represented as \(P(A \cup B)\) or \(P(A \vee B)\). When we want the probability that some event will not happen (e.g. a die does not land on an even number), we ask for the probability of the complement of the event, represented \(P(\sim A)\) or \(P(\bar{A})\) or \(P(A^{c})\) or \(P(\neg A)\). The empty set is symbolized as \(\varnothing\) and represents the set with no elements. Thus, taking the union of \(\varnothing\) with any other set gives the latter set, and taking the intersection yields the empty set. In addition, we can say \(\varnothing=\sim \Omega \) and \(\sim \varnothing= \Omega \). Lastly, a partition of set \(C\) is a countable sequence of sets such that no two sets in the partition share an element (the sets are mutually exclusive) and every element in \(C\) is in some set (the collection is collectively exhaustive).

Also important in the study of probability is the concept of conditional probability. Thus is the measure of probability based on some information: assuming that something is the case, what is the chance that some event will occur? For instance, we could ask what the chance is that a die landed six, given that it landed on an even number. While a more thorough discussion of conditional probability can be found elsewhere, we will here merely give the formula. \(P(A|B)\), the probability that \(A\) will occur given \(B\) (read “the probability of \(A\) given \(B\)” or “the probability of \(A\) on \(B\)”), is given by the expression \[ P(A|B)=\frac{P(A \cap B)}{P(B)}\] whenever \(P(B) \neq 0\). Sometimes it is possible to assign a meaningful value to \(P(A|B)\) when \(P(B)=0\). For instance, suppose we ask “what is the probability that a homogeneous, spherical marble, when rolled, will land on point A, given that it landed either on point A or point B?” The answer then seems clearly to be 0.5. A good interpretation of the conditional is a change in the sample space: when we condition on \(B\), we are changing the sample space from \(\Omega\) to \(B\). We find that all the axioms and theorems are consistent with this view. We can also mention here the notion of independence. Two events \(A\) and \(B\) are independent iff \(P(A \cap B)=P(A) P(B)\). This implies that \(P(A|B)=P(A)\) and \(P(B|A)=P(B)\). This means that, given the one, we gain no information about the other: it remains just as probable.

While the probability of events is relatively easy to understand, the probability of propositions is not as easy, as propositions can have only two values: true and false. How is it that we can say “The probability that you are female is 51%” when you are either definitely male or definitely female? This is where the notion of epistemic probability comes into play. Epistemic probability has to do with how likely something seems to us, or some other (rational) person, given some set of background information. For instance, in some murder, given that we see Joe’s fingerprints on the murder weapon, we deem it likely that Joe committed the murder. Though it is very difficult to give a good account, a rough way to quantify it would be in the following sense:
\(X\) has epistemic probability \(p\) given background information \(B\) (i.e. \(P(X|B)=p\)) iff the following is true: supposing we encountered this set of information \(B\) in many scenarios, we would expect \(X\) to be true in fraction \(p\) of those scenarios.
Again, this may not be a perfect analysis, but it does give a rough way to understand it. However, we must note that epistemic probability is of a significantly inferior sort than, say, experimental probability (observing that \(X\) happens in fraction p of experimental cases), or even a good theoretical probability (theory predicts that a homogeneous cube of material will, when haphazardly tossed, land on any given face with equal probability). There is a principle called the principle of indifference that says one should assign equal epistemic probabilities to two events or propositions when we have no justification to prefer one to the other. That may be a good principle as far as epistemic probability goes, but it is very deeply restricted by background information (clearly: lacking any background information to prefer one possibility to another, we are to assign them equal probabilities), and at least somewhat subjective. It is thus greatly limited by what we know: in fact, what we think is a possibility, based on our background information, may not be a possibility at all (it could be what is called an epistemic possibility). Thus, while epistemic probability may be the best we can do, given our background information, it may not be very good at all.

Statistical probability is of the epistemic sort: suppose that fraction p of population S has property X. We then come across a member M of S. Suppose we have no way to tell immediately whether M has property X, but we know M comes from S. We therefore say that M has property X with (epistemic) probability p. This is a statistical probability: based on facts about the population, we deduce a probability as regards a given individual, even though, if we had more information, we could say that M had X with probability either zero or one. This is to be contrasted with what we might call stochastic probability. If we have a perfect coin, and flip it fairly, before we do so, there is no information anywhere, even possibly, as to what its outcome will be. We don't know what will happen when we flip it, not because we aren't privy to some information, but because there is no information to be had. This will be the case with any genuinely indeterministic event. We might demonstrate the difference between statistical and stochastic probabilities as between a coin that was flipped but is hidden from view and a coin yet to be flipped, respectively. Most physicists believe many quantum processes are genuinely stochastic, and some philosophers believe free will is also stochastic in some sense ("You will probably choose X" does not mean that based on what I know now, there is a pretty high epistemic probability that you will choose X, but if I knew more, I would be able to predict with certainty whether you will choose X or not (e.g. you chose X most of the time when you are in certain circumstances). Instead, it is that you are more disposed to choose X).

We will here give a few theorems of probability theory. We will try to present them such that their derivation is clear, but if not, then any introductory text on probability theory can give a more thorough exposition. \(A\) and \(B\) are some subsets of \(\Omega\):
\[P(\Omega)=1;\;\;\;\ P(\varnothing)=0\] \[P(A \cap \Omega)=P(A);\;\;\;\ P(A \cap \varnothing)=P(\varnothing)=0\] \[P(A \cup \Omega)=P(\Omega)=1;\;\;\;\ P(A \cup \varnothing)=P(A)\] \[0 \leq P(A) \leq 1\] \[0 \leq P(A|B) \leq 1\] \[P(A \cup \sim A)=P(A)+P(\sim A)=P(\Omega)=1\] \[P(A \cup B)+P(A \cap B)=P(A)+P(B)\] \[P(A \cap B)\leq \min(P(A),P(B))\] \[P(A\cup B) \leq P(A)+P(B)\] \[P(A \cup B) \geq \max(P(A),P(B))\] \[P(A \cap B)+P(A \cap \sim B)=P(A)\] \[P(A \cap B)=P(A)P(B|A)\] \[P(A|B)=\frac{P(B|A)P(A)}{P(B)}\] \[\frac{P(A|B)}{P(A)}=\frac{P(B|A)}{P(B)}\]

Let \(B_{1},B_{2},…\) be a partition of \(C\), then: \[P(A \cap B_{1})+P(A \cap B_{2})+…=P(A \cap C) \] \[P(A \cap C)=P(A|B_{1})P(B_{1})+ P(A|B_{2})P(B_{2})+…\] \[P(C|A)=P(B_{1}|A)+ P(B_{2}|A)+…\] Particularly, \[1=P(B|A)+P(\sim B|A)\]

De Morgan’s Laws

\[P(\sim A \cap \sim B)+P(A \cup B)=1\]\[P(\sim A \cup \sim B)+P(A \cap B)=1\]

Bayes’ theorem

Let \(H_{1}, H_{2}, …\) be a partition of \(\Omega\). Then: \[P(H_{m}|E)=\frac{P(H_{m}) P(E|H_{m})}{P(H_{1})P(E|H_{1})+ P(H_{2})P(E|H_{2})+…}\] This is typically applied to choosing a hypothesis to explain a certain fact, or given a certain set of evidence. \(P(E|H_{m})\) is the (epistemic) probability that we would get evidence \(E\) supposing hypothesis \(H_{m}\) is true, and \(P(H_{m}|E)\) is the (epistemic) probability that hypothesis \(H_{m}\) is true, given evidence \(E\). Thus, hypothesis \(H_{m}\) becomes more likely on evidence \(E\) the more probable it is without the evidence, the more likely the evidence would be on that hypothesis, the less likely the evidence would be on alternate hypotheses, and the less likely the alternate hypotheses are without the evidence.

Probability of a Union

We can here give a useful formula for determining the probability of the union of events, which we can deduce from DeMorgan’s laws: suppose we want to find the probability of the union of some events \(Q=P(A_{1} \cup A_{2} \cup …)\).
We take the product \(Q'=1-\prod_{n}(1-A_{n})\)
We then replace every occurrence of \(A_{m_{1}}A_{m_{2}}…\) with \(P(A_{m_{1}} \cap A_{m_{2}}…) \). For instance, to find \(P(A \cup B \cup C) \), we take \(1-(1-A)(1-B)(1-C)=A+B+C-AB-AC-BC+ABC\)
We then make replacements as described to get \[P(A \cup B \cup C)=P(A)+P(B)+P(C)-P(A \cap B) -P(A \cap C) -P(B \cap C)+ P(A \cap B \cap C) \] If the events are all independent, we can simplify the formula to: \[Q=1-\prod \nolimits_{n}(1-P(A_{n}))\] If \(P(A_{m})=p\) for all m, we can further simplify: \[Q=1-(1-p)^{N}\] Where \(N\) is the number of events. For p fairly small, we can approximate this as \[Q \approx 1-e^{-pN}\] And from this we can rearrange to get \[N \approx \frac{-\ln(1-Q)}{p}\] This gives the number of independent trials necessary to get a probability Q of at least one success, if the probability of success in each trial is p.
As an application, we can ask what is the probability that an event will happen on a given day if it has a 50% probability of happening in a year? In this case, we want to solve for \(p\) given \(Q=0.5\) and \(N=365\). We find that \(p \approx \frac{-\ln(1-Q)}{N}=0.19\%\).

Using this, we can also show that improbable events are likely in a collection of many trials. Suppose we have \(N\) trials, in each of which X happens with probability \(p\). We then have the probability that X never happens as given by \((1-p)^{N} \approx e^{-pN}\). We thus see that, as N increases, the probability of no X occurring tends to zero; in fact, it tends to zero exponentially. Thus, given enough trials we would expect to see the individually improbable: long strings of all heads while flipping a coin, the same person winning the lottery multiple times, someone has two unrelated rare diseases, etc. Coincidences will always crop up given enough opportunities. These coincidences combined with confirmation bias--remembering the hits and forgetting the misses--result in muddled thinking. A coincidence happens and it is interpreted as a sign from on high, even though they ignored the hundreds of other times no coincidence happened. It is important to remember that coincidences are basically inevitable in large enough samples: if something has a one in a billion chance of happening to any given person on any given day, we can expect it will happen seven times per day worldwide.

Implication and Conditional Probability

We can also prove the following interesting theorem. Note that “if \(A\), then \(B\)” or “\(A\rightarrow B\)” is logically equivalent to “\( \sim A \) or \(B\)” or “\( \sim A \cup B\)”. Thus \(P(A \rightarrow B)=P(\sim A \cup B)\). We then have
\(1.\;\; 1 \geq P(A)\)
\(2.\;\; P(\sim B|A) \geq P(\sim B|A)P(A)=P(\sim B \cap A)\)
\(3.\;\; 1-P(\sim B|A) \leq 1-P(A \cap \sim B)\)
\(4.\;\; P(B|A) \leq P(\sim A \cup B)=P(A \rightarrow B)\)
That is, the probability of “if \(A\) then \(B\)” is not less than that of “\(B\), given \(A\)”.

We can also note that, as \(P(A)=P(A|B)P(B)+P(A|\sim B)(1-P(B))\), then \[\min(P(A|B),P(A|\sim B)) \leq P(A) \leq \max(P(A|B),P(A|\sim B))\]

Conditional Changes in Probability and How it Relates to Evidence

We can demonstrate the following:
Suppose \(P(A|B)>P(A)\). Then \(P(A \cap B)>P(A)P(B)\) and \(P(B|A)>P(B)\). In fact, all three are equivalent. In that case:
\( 1.\;\; P(A \cap B)> P(A)P(B) \)
\( 2.\;\; P(A) - P(A \cap B)< P(A)- P(A)P(B)=P(A)(1-P(B)) \)
\( 3.\;\; P(A \cap \sim B) < P(A)P(\sim B) \)
\( 4.\;\; P(A | \sim B) < P(A) \)
We can easily prove the greater-than case in the same way.
In English: "if A is more probable on B, B is more probable on A" and "if A is more probable on B, A is less probable on not-B".

An important consequent of this theorem is in discerning what counts as evidence. In a loose sense, we can say that \(A\) provides evidence for \(B\) if \(P(B|A)>P(B)\). We thus see that a necessary and sufficient condition for \(A\) providing evidence for \(B\) is that \(\sim A\) would need to provide evidence against \(B\). Thus, if we do some experiment to test a claim, we must be willing to accept failure as evidence against the claim if we would be willing to accept success as evidence for the claim, and vice versa. We must be willing to accept the possibility of weakening the claim if we are willing to accept the possibility of strengthening it by some test. It is often said that "absence of evidence is not evidence of absence", but this needs some qualification. Suppose we want to test the claim that there is life on Mars. We then do some test, like looking at a sample of martian soil under a microscope, and it comes up negative: is that evidence against life on Mars? Certainly, albeit very weak evidence. If we had found microbes in that sample, we would certainly have said that was evidence for life on Mars, therefore we must necessarily admit that the lack of microbes is evidence against life on Mars. It may only reduce the (epistemic) probability that there is life on Mars by something like a millionth of a percent, but if we do a million tests, that amounts to about a whole percent. If we do a hundred million tests, that amounts to over 60%.

In short, absence of evidence does count as evidence of absence in any and every instance where a presence of evidence would count as evidence of presence.

"Extraordinary Claims Require Extraordinary Evidence"

This phrase is declared nearly as often as it is denounced. However, it is clearly not specific enough to be definitively evaluated. One way of interpreting it is to say "Initially improbable hypotheses require improbable evidence to make them probable". This formulation is relatively easy to demonstrate as being true: \[ P(E)=\frac{P(H \cap E)}{P(H|E)} \leq \frac{P(H)}{P(H|E)} \] For example, if \(P(H)=1 \%\) and \(P(H|E)=75 \%\) then \(P(E) \leq 1.33 \%\).
If \(P(H|E) \geq 0.5\), then \(P(E) \leq 2 P(H)\). Thus, it is clear that the evidence required to make an initially improbable hypothesis probable must be comparably improbable.

Inscrutable Probabilities, Meta-probabilities and Widening Epistemic Probability

Sometimes, in cases of certain probabilities, we cannot estimate the probabilities, either at all, or to an adequate degree. We call such probabilities inscrutable. For all we know, these probabilities could have any value. We can use the concept of inscrutable probabilities to improve the descriptive accuracy of our epistemic probability judgements. For instance, suppose we have a die, and we are \(90\%\) sure that it is fair. We then want to find the probability that a six will be rolled. We make use of the formula: \[P(A)=P(A|B)P(B)+P(A|\sim B)P(\sim B)\] In this case, A is the event "a six is rolled" and B is the event "the die is fair". In this case, \(P(A|\sim B)\) is inscrutable: given that the dies is not fair, we cannot predict what the outcome will be. However, we do know that this probability, like all probabilities, is between zero and 1. Thus: \[P(A|B)P(B) \le P(A) \le P(A|B)P(B)+P(\sim B)\] In this case, we find \[P(6|\text{fair})P(\text{fair}) \le P(6) \le P(6|\text{fair})P(\text{fair})+P(\sim \text{fair})\] \[\frac{1}{6} \cdot 0.9 \le P(6) \le \frac{1}{6} \cdot 0.9 + 0.1\] \[0.15 \le P(6) \le 0.25\] Here we may introduce the concept of meta-probabilities. These take the form of the probability that something is true about a probability, for instance \(P(P(X) \ge \alpha)\) is the probability that the probability of X is not less than \(\alpha\). Returning to our example, suppose we are only \(80\%\) confident that the probability that the die is fair is \(90\%\). Applying the above formula: \[P(\text{fair}|P(\text{fair})=0.9)P(P(\text{fair})=0.9) \le P(\text{fair}) \le P(\text{fair}|P(\text{fair})=0.9)P(P(\text{fair})=0.9)+P(P(\text{fair}) \neq 0.9)\] \[0.9 \cdot 0.8 \le P(\text{fair}) \le 0.9 \cdot 0.8+0.2\] \[0.72 \le P(\text{fair}) \le 0.92\] This then implies \(0.08 \le P(\sim \text{fair}) \le 0.28\).
Returning to our former equation, we then have: \[P(6|\text{fair})P(\text{fair}) \le P(6) \le P(6|\text{fair})P(\text{fair})+P(\sim \text{fair})\] \[\frac{1}{6} \cdot 0.72 \le P(6) \le \frac{1}{6} \cdot 0.92 + 0.28\] \[0.12 \le P(6) \le 0.433...\] We thus see that adding in our meta-probabilistic uncertainty in our estimate for \(P(\text{fair})\) has further widened our uncertainty in the likelihood of rolling a six. This highlights the importance of both accounting for and minimizing any potential sources of uncertainty. We must factor in our confidence in a model in assessing the results it predicts to be likely or unlikely, if we are to use that model to form our epistemic probabilities of the predicted results.

*       \(P(X)=0\) does not mean that X can and will never happen. If you roll a marble, the chance of it landing on any given point is zero (ideally), and yet you will have it land on some point. What \(P(X)=0\) means is specific: it means that the measure of the space in which X holds is zero relative to the measure of \(\Omega\). There may still be a possibility that X happens, just that the region in which X happens is of zero "area", compared to \(\Omega\) (e.g. it is a point, and \(\Omega\) is a line segment). If \(P(X)=0\) we say that X will almost surely not happen, as opposed to \(\varnothing\), which will surely not happen.

Tuesday, March 18, 2014

Language as a Tool for Contemplation and Communication

I. Language Development and Conventions

You awake one day in a new and strange place. You have no memories of how you got there, indeed, few or no memories. All you can tell is that you are having certain experiences: experiences that do not yet make much sense to you. You feel certain instinctive urges. You are confused and scared and you do not understand what is happening.

You are an infant.

We have all gone through something like this in the earliest stages of our lives. Our life begins as a submersion into the new and unfamiliar. What we are forced to do, to make sense of all of our experiences, is something like the following: from our experiences, we form perceptions; from these perceptions, we get impressions; and from these impressions, we form concepts. As an example, from a certain visual experience, I form an image (of a tree), the perception, from this image, I get certain impressions (it has a certain texture and color, it has various dimensions, some parts seem farther away than others, etc.). Finally, from all these impressions, I form the concept (of a tree standing before me).

Once we have concepts, what we need is some way to understand them, some way to handle them so as to get something from them. We thus embark on a project of organization and systematization. What we want, or what would be useful, is a way to categorize, generalize, simplify and synthesize these concepts into something meaningful.

This system is the seed of a language, which we use to handle the concepts we have obtained, and derive new concepts. Were we never introduced to formal, communicative languages (were we, for instance, stranded from birth on a desert island, what philosophers call a lifelong Crusoe), we might possibly develop a rudimentary language for our own use.

Shakespeare, through Juliet, famously said: “What’s in a name? that which we call a rose/ By any other name would smell as sweet”. We can take this to mean that a name is an arbitrary label we attach to a thing. The thing a name refers to is some set of particular set or class of objects, descriptions or designations: whatever its name, the thing remains constant, as the name itself is arbitrary and variable. In some sense, the name is almost abstract.

Every language has a system by which it attaches names to various concepts, and these names constitute the units of the language: the words. We call this system of naming the nominal convention of the language, which comprises a large part, if not most, of it. Let us call the entirety of all the concepts a person knows his concipulary, as contrasted with and similar to a vocabulary. A nominal convention is then a two-way mapping from the concipulary to a vocabulary: this concept of a tree corresponds to the word “tree”, and vice versa. However, not every concept might be mapped to a name, some concepts might be mapped to more than one name, and some names to more than one concept. It is then clear that for a nominal convention to be fully grasped, the user must have a certain concipulary as well, from which and to which to map. This instantiation of the requisite concipulary is one of the important aspects of learning a language, and learning in general.

Using a nominal convention has a great number of advantages and uses. It permits us to identify something and refer back to it: we label it, and can then keep it for later consideration. The name itself can serve as a mnemonic, as with onomatopoetic names. Names also allows us easily to refer to and handle various concepts, consider how they relate to each other, construct new ideas, and, importantly, open up the possibility of communication.

The rest of a language can be loosely broken up into two more components: a syntactical convention and an expressive/social convention. The syntactical convention delineates how the concepts are to be arranged into propositions, including grammar and punctuation. It describes how the various names and designations can be put together into a complete idea. These units of ideas, composed of concepts, are sentences or propositions. The syntactical convention is then the system by which words are formed into sentences, or by which concepts are formed into propositions. Note that a given nominal convention can have multiple syntactic conventions: for example, we can consider standard English and Yoda English (“John is a boy” vs. “A boy John is”).

Lastly, the expressive/social convention encompasses the various ways of expressing or conveying the ideas. This also covers the various social norms expected in communication. For instance, this would include spelling and pronunciation, tone and accent, but also social aspects, such as context-dependent vocabularies, honorifics, euphemisms, etc. Punctuation falls somewhere between this convention and the syntactical convention. Thus, this convention is an important component of language, as it constitutes the system whereby it is applied: it establishes the language as a communal effort. In addition, it establishes the language in the context of a culture, in which language has always played an important role.

Note that the various conventions have different roles in conveying meaning. The nominal convention specifies the referents in the communication, that is, what is being talked about. The syntactical convention specifies the content of the communication, that is, what relation or information is being conveyed. The expressive/social convention specifies how it is being communicated: what the context is and the means by which communication is taking place.

II. Communication

Communication is one of the main purposes of a language: language is the means by which the concepts in one head can appear in another head, so to speak. We will call the “speaking” party the sender and the “listening” party the recipient. There are essentially three components in communication: composition, transmission, and interpretation. Composition involves the translation of concepts by the sender into a message via a language. Transmission involves moving the message from the sender to the recipient. Interpretation involves the recipient translating the message back into concepts using a language. Note that the interpretation we speak of is only first level; it is only the translation of words into concepts and sentences into propositions and ideas, not the full extraction of implications. (For instance, “John washed his hands before lunch” could be interpreted to mean that John takes care to have sterile hands before having a meal, or it could be interpreted to mean that John had dirty hands sometime before lunch, or that John wanted to get something incriminating off his hands before a social occasion, etc. All the interpretation stage pertains to is the literal, first-level meaning of the sentence.) All of these steps are necessary and are found in any type of communication we recognize.

It is thus key that both communicators have a mutual agreement on the nominal, syntactical and expressive/social conventions being used. If the nominal convention is not fixed, the recipient may get a message in characters she recognizes, with a syntax that she can comprehend, but full of words that do not mean anything to her. If the syntactical convention is not fixed, the recipient may get a message with words she understands, in characters she recognizes, but in a structure she can’t understand or is ambiguous (“Tom saw Joe” means something different than “Joe saw Tom”). If the expressive/social convention is not fixed, the recipient may get a message in a way she can’t understand (mispronounced, written in strange characters, using obscure jargon, etc.).

The conventions being unspecified leads, then, mainly to difficulty for the recipient. If, however, the recipient is versatile enough, she may be able to infer some of the conventions (a new word, an unusual sentence structure, a new euphemism or honorific, etc.), but some background knowledge is essential, and a thorough knowledge of all the conventions makes communication much easier. We may here differentiate between coreferential communication and areferential communication. The former is distinguished from the latter by the use of a common reference or experience. For instance, in coreferential communication, the sender can point out something with just a demonstrative (“that thing in the sky” “that sound”), while in areferential communication, whatever is specified must be done so using only language (“the bright, yellow orb in the sky” “the harsh, ringing clang”). One of the goals of language is to move away from coreferential communication toward areferential communication. We will return to this later.

III. Errors and Obstacles to Communication

In attempts to communicate and achieve mutual understanding, many problems can arise, both in language and in communication. In a previous section, we divided communication into three stages: composition, transmission, and interpretation. Errors can occur at every stage, but there are even problems with the idea of communication.

The trouble with communication is that before communication can even take place, there has to be some base understanding of what will be communicated and how. For instance, before I tell someone something in English, there must be an implicit understanding that the communication will be done using spoken English (I can’t walk up to a Russian and begin speaking English, assuming that he will understand). Before I use a word, I have to assume that the word will be understood, and if the recipient does not, she may be able to infer the meaning from context, or else I will have to provide an explanation. This requisite preceding step we will call priming and poses some serious difficulties for communication, as it seems impossible to prime the conversation without communication. Since priming precedes communication, it seems priming must precede itself. As we will discuss later, priming serves as one of the main difficulties of communication. For now, we will assume priming has taken place.

An error in composition is one of the more trivial ones. It amounts to an incompetence on the part of the sender, in which the sender has either a malformed thought or has mistranslated that thought into language. A similar error can occur in interpretation due to incompetent translation of the message into concepts. This error is easily dealt with through diligence and care. We will thus henceforth assume both communicators are both competent once primed.

An error in transmission is simply where the message cannot get from sender to recipient either wholly or in part. It amounts to nothing more than an engineering problem: setting up an adequate system for relaying the message, intact, from sender to recipient. An error in this case would be a downed telegraph line, wax in one’s ear, a crummy postal service: something at least in principle correctable. There may be other, complex cases, but these are mere physical limitations. The problem is either physically irresolvable, which would preclude communication altogether, or it is resolvable, in which case a suitable apparatus would be sufficient.

Lastly, if priming is successful, the only thing that can impede interpretation is competence, and thus we can move straight into priming. Clearly, a proper priming will prevent any misunderstandings in the various conventions. But how can we prime at all? It seems that to prime, we need to communicate something, and to communicate anything, we first need to prime. So communication seems impossible, unless there is some sort of priming that comes standard, and allows further communication to be built up from it. As we shall see, there is indeed such a priming.

IV. Priming

As noted above, priming is the required step preceding any communication of establishing the various conventions that will be used. In effect, priming is the declaration of the language in which the rest of communication will be conducted. But how can we do this at all? I can’t say to someone who doesn’t understand English “this conversation will be conducted in English” as the recipient won’t even understand this establishing message: in order to understand the message, the recipient must already know that we are communicating in English, and, moreover, have an adequate knowledge of the English language.

So how can we prime at all? It seems that to prime, we need to communicate something, and to communicate anything, we first need to prime. Communication from the ground up thus seems impossible, unless there is some sort of priming that comes “built-in”, and allows further communication to be built up from it. As it happens, there is such a priming: it is the sort of fundamental understanding we would have of certain sorts of communication, which we shall call the natural priming. For instance, when someone points to something (or holds up an object, or draws a picture of something) and articulates a sound, we take that to mean that they are referring to the thing by the name given by the sound. Or, if the sender points to a certain sort of action, or imitates it, we can understand the name to refer to the action. The nominal convention is of gestures (gesturing in such and such a way means this), and seems innate: children and even animals seem to have such an understanding. The syntactical convention is merely in connecting the uttered noise or attached characters to the gesture.

The basis of natural priming is repetition and induction. Given that the sender does something repeatedly, the recipient infers that such an action means a certain thing. This is a sort of pavlovian response: the sender repeatedly says “tree” while pointing at a tree, and the recipient learns that, when the sender says “tree”, he is still referring to a physical tree, though he does not point to it. In essence, this is the installation of a certain nominal convention, whereby a meaning (the natural understanding of what is meant by pointing at a tree) is attached to a word (the association of the word/sound “tree” with the pointing to a tree).

Thus, clearly, this sort of communication is wholly coreferential: there must be some shared experience for establishing this sort of understanding. Moreover, this sort of priming is mostly if not only of the nominal convention, and is restricted by particular instances. For example, I can point to an object and say “tree”, but the recipient cannot understand that I mean “this, and everything like it in certain respects, is called ‘tree’,” but only “this particular thing is called ‘tree’.” We can perhaps ameliorate this by giving several different instances and using the same word to describe them all, thereby setting the stage for a generalization, but we cannot directly impart this generalization from examples. The generalization must be made by the recipient. Once a nominal convention basis has been established, the other conventions can be established pretty easily, either by imitation or instruction.

We can also establish a primitive basis, that is, a basis of concepts and their nominal designations derived from experience alone that themselves do not depend on other concepts. From this primitive basis we can construct other concepts and attach other names to them. This is the way language and its understanding develop, and it will feature importantly in our discussion of definitions.

Thus, as far as priming goes, if we have a reliable way to engage in coreferential communication, using nothing but natural priming, we can then build up communication to be quite widely useful. The goal is to be able to communicate areferentially, such that the sender and receiver can communicate with language alone. However, as we will see, even coreferential communication faces some problems.

V. Problems with Coreferential Communication

It seems one would think that this coreferential communication based solely on natural priming would be a sound foundation to begin establishing communication. We convey to the recipient the name of certain objects or actions by gestures, imitation or depiction, and thence to generalizations. This can be used to generate the primitive basis, from which higher levels of complexity and abstraction can be developed, leading to all concepts and all language. Everything seems to fall into place.

However, let us take a specific example: the sender points to a tree, says “tree” and the recipient from this learns that the thing pointed to is called a tree. But all this means is that the concept formed of the thing experienced has the same name according to both, but how can we be sure that both experiences are the same and that both concepts are the same? It certainly seem intuitive to think that they would be, and, lacking any reason to think otherwise, it seems reasonable to think they are, but how can we be sure? This is the classic problem of qualia, the individual instances of subjective experience. Two men can both agree that a thing is green (they both agree to call it that), but how can the one man be sure that the other man’s green is identical to his own? Perhaps what is red through the eyes of one is green through the eyes of the other, though the two both agree to give it the same name. Though it may in theory be possible to transmit or check qualia directly (some sort of device to transmit qualia as-is from one mind to another), it seems that it is a far cry in practice. This difficulty, which rarely causes problems, is thus correctable in theory, though intractable in practice.

But there is also trouble with the fact that each individual must (either by necessity or practicality), or at least often does, make her own generalizations or categorizations. A generalization or categorization is the procedure of taking a group of things all called by the same name and extracting common features so that new things can be included. This entails equating the categorization to a list of necessary and sufficient criteria, whether vague or specific. This set of criteria is used as a definition of what it means to be in that group, and will be very important in our later discussion of definitions.

However, often there is no specific set of criteria that is transmitted or even one that has widespread acceptance. This can allow even an individual’s understanding of a category to be vague: based on unknown or uncertain criteria. This often makes it difficult to discern the boundaries of a category (the necessary or sufficient conditions to qualify for inclusion). An example of this is the famous sorites problem of determining what a heap is. Namely, it asks what is the number of grains in the smallest heap. Other cases are the height of the tallest “short” man, or the hairiness of the hairiest “bald” man. This vagueness also leads to many disputes over border cases, in which the vagueness of common understanding allows for multiple conceptualizations that cover the same typical cases, but differ with respect to bordering cases. This shows why priming is so important, and also why it can be so difficult.

Some coreferential communication can be quite reliable in that we can pretty accurately check some concepts. It seems mathematical or geometric concepts can be generally well checked based on the exactness of what they correspond to and the precision that must hold for the many theorems that result from their application. In particular, numbers seem to be perfectly communicable (or at least nearly so), and thus anything that can be quantified can be perfectly well communicated as well. This plays into the importance of quantification in specificity, exactitude and understanding. But nevertheless there are, it seems quite many qualitative properties that cannot be exactly communicated.

VI. Definitions

As we have been discussing, prior to any communication, a stage of priming must take place, wherein the stage is set for communication. This involves specifying the various conventions (nominal, syntactical, expressive/social) such that the message sent by the sender can be properly understood by the recipient. A very important part of priming is the introduction of definitions, that is, explanations in other terms of what a word or term means, what it will be used to mean, or how it will be used. We will call the term being defined the definiendum, and the collection of other terms used to define it the definiens. The terms used in the definiens we will call subterms (note how even now I am priming the discussion by introducing definitions!). There are various sorts of definitions that we will discuss, but to begin, we will introduce a basic theory of definitions.

For a definitions to be useful, all the subterms in the definiens must be understood: if we know the subterms, we can know the new term, given the definition. Thus, for the definition to be useful, its subterms cannot include the definiendum (as then we would need to know the term before we could understand the term, which is obviously not possible). Sometimes even subterms will need to be defined, and so on. But clearly this cannot go on forever: there must be some point of understanding at which we can stop. If not, then either this presents a vicious infinite regress, wherein to understand a term we first need to understand another term, and so on ad infintum, or becomes circular, wherein we need to understand A to understand B, B to understand C, and C to understand A.

Thus, there must be terms that do not require other terms to be understood. These are called primitives or primitive notions, and serve as the axioms of the definition system (examples typically given are color, being, number, and duration). Typically, however, it is unnecessary to reduce all definitions to the level of primitives, and so we merely must reduce the definitions to the level of acceptability: the point at which the meaning of the terms is mutually accepted as understood. It is possible that two conversers could hold some mutual level of acceptability and then a third comes along with a lower level and definitions would need to be added to get everyone on the same page (this often happens when a new person is introduced into a technical field). In whole, our terminology forms, as it were, a pedigree, with each of the various terms tracing its lineage back to some combination of primitives.

Let us now distinguish between various sorts of definitions. The classic distinction is between intensive and extensive definitions. An intensive definition gives necessary and sufficient conditions: what, categorically, would include something under the term. For instance, we can define a “square” to be “a closed planar figure of four congruent straight line segments and all angles equal”. In contrast, rather than provide the conditions for inclusion and exclusion, an extensive definition gives an exhaustive list of all the things in the specified category. For example, we can define a continent to be any one of the landmasses Africa, Asia, Europe, Antarctica, Australia, North America, or South America. Ideally, an intensive definition is always preferred, as it allows us to understand what it means to be in a category, and is in no way contingent upon extant examples. A third kind of definition is called an ostensive definition which is formed through coreferential communication. Essentially, it is what we have already described: we gesture toward an object and say a name and thereby define that name by that object (or name that object by that word).

We may also distinguish between definitions with different functions. The first is stipulative definition, in which a term is applied to a novel concept as a way to refer to it. This is typical in such fields as mathematics and philosophy (“stipulative definition”, for instance, is here given a stipulative definition). Another sort is a descriptive definition, which attempts to clarify or specify a common term that does not have a robust definition. This sort of definition can be disproven by showing that there is something the definition includes or excludes that the common notion does not. A precising definition is something of an intersection of the two, in which we take a subset of what is commonly considered part of the meaning of a term and specify that we will use the term to mean that subset. For instance, though “ball” can often be used to describe many sorts of round objects (e.g. footballs), we can say that, in our discussion, we will only use them to refer to the spherical variety.

The general form of a definition is a genus with differentia, that is, the class to which it belongs and what distinguishes it within that class. For example, we earlier defined a square as “a closed planar figure of four congruent straight line segments and all angles equal”. In this case, the genus is closed planar figures, and the differentia are that it has four congruent straight sides. We can also define terms as the intersection of genera. For instance, we can define squares as the intersection of regular polygons and quadrilaterals, or as the intersection of rectangles and rhombi, etc. (this is still basically using the genus/differentia approach, as the differens in this case would be the inclusion in the other genus) In this way, we can keep generalizing, using classes that take more and more elements, but we will somewhere need to stop, and this is arguably the point at which basic concepts come in (such as “entity” or “class”, etc.).

In forming definitions, there are several precepts that serve as generally if not universally helpful. As we have discussed, whenever possible, it is best to define terms using simpler or more well-understood subterms. If the subterms are as complex or obscure (or more complex or obscure) than the definiendum, then the definition is clearly counterproductive, and doing so is said to be obscurum per obscuris ([clarifying] the obscure through the obscure). The definition should not be circular, either individually, or globally. The definition should be intensive wherever possible (that is, it should delineate the essence of the definiendum). The definition should be adequate, in the sense of including all the things to which the term applies, and exclude all the things to which the term does not apply. It is typically preferable, where possible, not merely to recast a definition in terms of a synonym or the negation of an antonym.

As communication and discussion hinges on mutual understanding, defining terms is always necessary (unless the terms fall into the “acceptable” category already discussed). If mutual understanding is the goal, definitions are indispensable. Thus, in discussion, we should always be able and willing to define our terms down to the acceptable level if needed, and we should preemptively define new, vague, or unusual terms, and communicate how they will be used. A failure to do so is a failure to prime the communication, and leaves communication gravely at risk.

VII. Problems with Poorly Primed Communication

As we have seen, priming is an essential prerequisite for successful communication. Priming mainly involves, in practice, introducing the requisite definitions (mainly stipulative, descriptive and precising) so that both communicators are on the same page. But too often it happens that priming does not take place, or it is inadequate (or even that it is impossible or impracticable), and this can lead to serious misunderstanding and miscommunication.

As we have mentioned, an important part of priming is the fixing of the nominal convention that will be used, which is mainly done through definitions. Also as we have mentioned, the nominal convention is the two-way mapping between the concipulary (set of concepts) and the vocabulary (set of names). A definition is merely the introduction or specification of the correspondence of one concept to one name and vice versa, that is, a word (this may be a nonstandard usage, in that “word” typically refers to the name with potentially multiple meanings, but I will use it to mean a name-meaning pair. Thus “can” meaning a “cylindrical metal container” is a different word than “can” meaning “be able to”).

Let us examine the case of one word used in communication: the sender has a concept, attaches to it a name, and sends this (via some medium) to the recipient, who takes the name and finds what concept it maps to in her vocabulary. All we will assume is that the two communicators are not using the same nominal convention. Let us call the concept of the sender’s word the intended concept. We can distinguish five cases, not all of which are mutually exclusive: (1) recipient connects the name to the intended concept (2) the recipient connects the name to a different concept (3) the recipient connects the name to no concept (4) the recipient attaches a different name to the intended concept (5) the recipient does not have the intended concept in her concipulary.

Case (1) is clearly a case of successful communication, priming notwithstanding.

Case (2) is a conceptual confusion, in which a name is mistaken to mean something other than the intended concept. This is a language glut: the same name is doing too much. We will call such a case a homonymopathy (a mistaken use of the same name to mean something unintended: the name is a pathological homonym).

Case (3) is a nominal failure, in which the recipient receives what is to her nonsense. Perhaps she then tries to assign it a concept by guessing, in which case it would become (1) or (2) if she guessed rightly or wrongly respectively.

Case (4) is a nominal confusion, in which, had the user used a different name, communication could have been successful. This may easily be combined with the preceding three cases, but the point is that communication was unsuccessful because of a poor choice of words. This is a language gap: the same concept can be reached, but the two users are separated by different names for it (both an Englishman and a Frenchman know what water is, but “water” means nothing to one and “eau” means nothing to the other). We will call such a case a synonymopathy (a mistaken attempt to communicate the a concept by an unknown name: the two names are pathologically synonymous).

Case (5) is merely a case of conceptual failure: priming and education must precede in order to get the recipient to apprehend the intended concept, much less recognize it by name.

We thus see that the five cases amount to one success, two failures and two confusions. The success obviously needs no mending, and the failures are ameliorable, in that they serve to halt communication, not lead it astray. However, even a synonymopathy can halt communication, so the major troublemaker is a homonymopathy. As I will later argue, I think homonymopathy is one of the greatest impediments to successful communication.

As discussed, nominal and conceptual failures can be remedied by definitions and explanations respectively. But what about nominal and conceptual confusions? The mythical case of the tower of Babel in the Old Testament is a clear case of nominal confusion: in order to halt the progress of the construction of a tower, god says “let us go down, and there confound their language, that they may not understand one another's speech.” That is, formerly, they had well-synchronized nominal conventions, but god perturbs these conventions, so that, though they are referring to the same thing (the building of the tower) they cannot understand one another. Thus, we may say that a name for such a case of nominal confusion is a Babel, or a Babel phenomenon. The case of conceptual confusion is somewhat similar, and so we will give it the inventive name of inverse Babel, or an inverse Babel phenomenon. Again, an inverse Babel is a case in which the users use names that one another recognizes, but the meaning to one is different than the meaning to another. For example, both an Englishman and Frenchman recognize the word “pain”, but the meaning to one is not the meaning to another: if one said “pain” to another, the second might give the first a baguette when he wanted an aspirin.

Before we address solutions to each, it is important to identify how we can recognize each sort of confusion. A Babel can be identified if two people point to the same things and say something different, or if the explanation for the term in other language is the same “here, we call that…”. Another, simpler way is if we have some background knowledge: if you are an architect and you go to a painting conference in a foreign country, you will expect the people there to be discussing architecture, even if you don’t know what their language means when you encounter them. Much more simply, if you believe the sender is competent, but can’t understand what he is sending you, there is probably a language gap (though there may also be a conceptual failure). The ways to correct for this are well-studied and well-documented (it basically involves developing a method of translation, or choosing a standard nominal convention to use). It is possible that the nominal conventions are so dissimilar that coreferential priming is required.

An inverse Babel is much trickier to discern. Typically the sender will send you something but the way he uses a term will be jarring or strange. This can be verified by asking “what do you mean?” and seeing if his reply matches your understanding. The lower down in the conceptual hierarchy the homonymopathy lies, the more difficult it is to remedy: this was raised in the section on problems with priming and coreferential priming. It may be the case, as with a vague or contested concept, that his concept and yours will be similar, but nevertheless, you will be using the same word to mean different things. Sometimes a loose, imperfect definition will do, as we have discussed, but one should not expect these vague conceptions to overlap everywhere. In general, once a homonymopathy has been identified, all that is needed is to assign each concept a different, mutually agreed-upon name. If your conception of a “pile” is different than mine, we need not call them by the same name.

We can here identify a sort of fallacy which I will call the nominal fallacy (note that there is already a fallacy of this name but of different meaning). One commits a nominal fallacy when one thinks that two words must mean different things because they have different names (synonymopathy) or two words must mean the same thing because they have the same name (homonymopathy). The fallacy can be easily demonstrated by showing that “Little” means the same thing as “small” and “bow (archery)” has the same name as “bow (knot)”. Though this fallacy can be easily addressed, it can also be quite tenacious, especially when the difference in meaning is slight or when personal stock is put in the names.

Priming is a vital and important step which cannot be foregone. Failing to prime adequately leads to such problems as synonymopathies and homonymopathies, in which the communicators fail to use language the other understands, or worse, uses language the other thinks he understands. Avoiding these takes care but whatever the cost in preventing them greatly outweighs the hazards of ignoring them. In the final section, we will look at the relevance and applications of the preceding discussion.

VIII. Relevance and Applications to Discussion

Discussion and dialogue depend upon communication. Communication, as we have been discussing, depends crucially on having both communicators first agreed on the conventions that will be used, particularly the nominal convention. If communicators fail to prime the conversation by establishing the requisite conventions, the discussion can devolve into synonymopathies and homonymopathies, such that the communicators are merely speaking past one another. In many areas, these can manifest as major controversies.

Armed with our previous discussion, we can distinguish two different sorts of disputes that arise: terminological disputes: disputes over the meaning and use of terminology; and factual disputes: disputes over what is the case, as described by the agreed-upon terminology (some may add to this other disputes, such as normative disputes). Clearly any (or at least most) terminological dispute can be dealt with through proper priming and clarification, or at least any dispute arising.

Philosophy is the study of ideas generally, though even this definition is not uncontroversial. It is known more for providing questions and perspectives than hard answers, for broadening perspectives rather than narrowing them. Indeed, there is just about no area of philosophy that is not at least somewhat controversial. Philosophy is also known as being home to many disputes which have persisted since antiquity. I hypothesize, however, that many of these disputes are terminological, perhaps even a majority, or at least arise therefrom.

Vagueness is a common source of terminological dispute. Two people have two different criteria for what qualifies for inclusion in a certain category, with the vast majority of cases mutually agreed upon. However, certain marginal cases are included in one but excluded in another. Each thinks he is correct, and in a sense both are: they are just speaking about different things—things that are very similar, and yet subtly different. There is no shortage of vague or essentially contested concepts in philosophy: knowledge, self, consciousness, explanation, cause, abstract, existence, good, obligation, nature, reason, beauty, purpose, essence, time, meaning, experience, person, to name a few. Vagueness also breeds uncertainty, which can be loosely characterized as a dispute with oneself: part of you thinks it should be interpreted one way, another part thinks it should be another way. If we are to resolve both disputes and uncertainty, we must find a way to eliminate or at least reduce vagueness.

We can thus suggest ways in which to present and defend an argument. Before anything, the first step must be to define the relevant terms initially, or else be careful to define them as they come up. Definitions cannot in practice be perfect, but one must endeavor to make them as clear and complete as necessary and possible. If one is presenting an argument, one define to acceptability all important terms, making sure to limit the senses of vague terms especially. If one is rebutting an argument, one must examine the possible senses in which the terms of one’s opponent’s argument can be taken. If there are multiple, interacting terms, for charity, one must examine the various combinations of meanings and address each. This is especially important in order to avoid and demonstrate equivocation: one must be sure that the terms are being used in the same way throughout the argument, or else that the various meanings all conspire to reach the same conclusion. If one is merely engaged in a less argumentative discussion, one must be sure to consider, evaluate and decide on the terms before the discussion has proceeded too far. It is no use discussing something for hours only to find out that the entire point could have been avoided had you recognized an inverse Babel early on.

This is perhaps the most important point: fixing definitions is an indispensable part of any discussion. Sometimes, an entire discussion can be wholly or mostly forgone if the definitions are made clear at the outset. I am convinced that a significant part of current philosophical discussion results from little more than vagueness and homonymopathy. It may be just as difficult to get another to understand what you mean as it is to get another to agree with you. What is most needed is clarity and mutual understanding: conversation must always precede discussion.

IX. Glossary

Lifelong Crusoe
    A person who, for his whole life, has never come into contact with others or society.
    The set of names in a certain language an individual can recognize as corresponding to certain concepts or which an individual can use to convey certain concepts.
    The set of concepts an individual understands, can recognize as corresponding to certain names in a language, or can convey using names in a certain language.
Nominal Convention
    An agreed upon system of attaching names to various concepts (and concepts to certain names) which comprises one of the main components of a language.
Syntactical Convention
    An agreed upon system of forming sentences from words, that is, forming propositions or ideas from concepts. This typically involves a grammar, punctuation, syntax and various constructions.
Expressive/Social Convention
    An agreed upon system of expressing ideas, typically including situation- dependent vocabulary (e.g. honorifics, jargon, complexity level) methods of communication (e.g. spelling, script, diction, enunciation, accent, intonation).
Steps of Communication:
    The step of communication that involves the sender translating the idea to be communicated into a message via a language.
    The step of communication that involves moving the message from sender to recipient.
    The step of communication that involves the recipient translating the received message into ideas via a language.
Coreferential Communication
    Communication in which there is a common reference or experience that both communicators can directly experience.
Areferential Communication
    Communication in which there is no common reference or experience that both communicators can directly experience.
    The necessary prerequisite for successful communication that involves establishing the various conventions that will be used.
Natural Priming
    Priming one has naturally that allows one to infer meaning by gestures, repetition or imitation.
Primitive Basis
    The fundamental concepts which themselves cannot be formed or described using more basic concepts, generally derived from raw experiences or even a priori.
    A definition that specifies the necessary or sufficient conditions for qualifying for inclusion.
    A definition that gives an exhaustive listing of all things that qualify for inclusion.
    A definition in which, by pointing to a certain thing or set of things (either figuratively or literally), one either gives instances which can be generalized or determines what qualifies for inclusion.
    A definition in which one specifies by assertion what a term means or will be used to mean.
    A definition in which one attempts to give a meaning to a term that reflects how it is commonly understood.
    A definition in which one narrows the scope of a term from how it is commonly understood to reflect how it will be used.
    The word or term to be defined.
    Whatever is used to define a word or a term, typically composed of a set of composed subterms.
Primitive (notion)
    Terms that do not require other terms to be understood that function as axioms of conceptual understanding.
    The super-category into which the thing being defined falls.
    The property that sets apart the thing being defined from other things in its genus.
    One of the terms in the definition of a term.
Obscurum per Obscuris
    Defining a term using more difficult to understand subterms.
    A failure of communication in which a term with an intended meaning is mistaken to mean something else instead.
    A failure of communication in which a term is not understood to have the intended meaning, but a different term could have been used to achieve an understanding of the intended meaning.
Babel (phenomenon)
    A case in which communicators are trying to communicate the same meaning using different terms or languages.
Inverse Babel (phenomenon)
    A case in which the same term is interpreted to mean different things by different communicators, or in which the same term is being used in different ways.
Nominal Fallacy
    Any of the following fallacious lines of reasoning (X and Y being words or terms):
    1. X and Y have different names.
    2. Therefore X and Y refer to different things.
    1. X and Y refer to different things.
    2. Therefore X and Y have different names.
    1. X and Y have the same name.
    2. Therefore X and Y refer to the same thing.
    1. X and Y refer to the same thing.
    2. Therefore X and Y have the same name.
    (Not to be confused with a different fallacy of the same name).