## Monday, April 14, 2014

### Probability

Probability is a concept that is intuitively fairly easy to understand, yet difficult to give a comprehensive, universally acceptable interpretation. In general, probabilities are given with respect to events or propositions and give a way of quantitatively answering such questions as “How certain are we that X will occur/is true?”. Probabilities range from 0, (almost*) certain not to happen/be true, to 1, (almost*) certain to happen/be true.

There is division as to whether probabilities are objective facts or only subjective. Some say the probability of an event is a measure of the propensity of a certain situation to yield a certain outcome, while others say that the probability of an event is the relative frequency of that event in the limit of a large number of relevantly identical cases, or trials. Those who say it is subjective give, for instance, the conception that the probability of an event can be defined as “the price (in arbitrary units) at which you would buy or sell a bet that paid 1 unit if the event occurred and 0 if it did not occur”.

One way to circumvent all of these is to leave probability somewhat vague and give it a thorough mathematical basis. This can readily be done. We will deal with the probability of some event which will occur as a result of some experiment in individual trials. What is needed for a probabilistic model are two things:

• The sample space $\Omega$: the set of all possible outcomes of the experiment.
•
• The probability law $P$. This is a function that takes a subset of the sample space and returns a real number. This law, to qualify as a proper probability law, must satisfy three conditions. Let $A$ be some subset of $\Omega$.

1. Non-negativity: For any set $A$ subset of $\Omega$, $P(A) \geq 0$.
2.
3. Countable Additivity: Let $A_{1}, A_{2}, ...$ be a countable sequence of mutually disjoint sets (that is, no element in one set is in any other set), each a subset of $\Omega$. Then $P(A_{1} \cup A_{2} \cup ...)=P(A_{1})+P(A_{2})+...$.
4.
5. Normalization: the probability of some event in the space is unity, that is $P(\Omega)=1$.
If the model satisfies these conditions, it is at least admissible, though typically we have other considerations that help us choose a model, such as simplicity. These conditions imply that the empty set, that is, the set containing no elements, has probability zero.

Very typical in probability theory is the use of set-theoretic or logical-operator notation. While notation varies, the fundamental concepts remain consistent. When we want the probability that events $A$ and $B$ will both happen (e.g. a die lands on an even number and on a number above three), we ask for the probability of their conjunction, represented as $P(A \cap B)$ or $P(A \& B)$ or $P(A \wedge B)$. When we want the probability that at least one event of the events $A$ and $B$ will happen (e.g. a die lands on an even number or on a number above three, or both) we ask for the probability of their disjunction, represented as $P(A \cup B)$ or $P(A \vee B)$. When we want the probability that some event will not happen (e.g. a die does not land on an even number), we ask for the probability of the complement of the event, represented $P(\sim A)$ or $P(\bar{A})$ or $P(A^{c})$ or $P(\neg A)$. The empty set is symbolized as $\varnothing$ and represents the set with no elements. Thus, taking the union of $\varnothing$ with any other set gives the latter set, and taking the intersection yields the empty set. In addition, we can say $\varnothing=\sim \Omega$ and $\sim \varnothing= \Omega$. Lastly, a partition of set $C$ is a countable sequence of sets such that no two sets in the partition share an element (the sets are mutually exclusive) and every element in $C$ is in some set (the collection is collectively exhaustive).

Also important in the study of probability is the concept of conditional probability. Thus is the measure of probability based on some information: assuming that something is the case, what is the chance that some event will occur? For instance, we could ask what the chance is that a die landed six, given that it landed on an even number. While a more thorough discussion of conditional probability can be found elsewhere, we will here merely give the formula. $P(A|B)$, the probability that $A$ will occur given $B$ (read “the probability of $A$ given $B$” or “the probability of $A$ on $B$”), is given by the expression $P(A|B)=\frac{P(A \cap B)}{P(B)}$ whenever $P(B) \neq 0$. Sometimes it is possible to assign a meaningful value to $P(A|B)$ when $P(B)=0$. For instance, suppose we ask “what is the probability that a homogeneous, spherical marble, when rolled, will land on point A, given that it landed either on point A or point B?” The answer then seems clearly to be 0.5. A good interpretation of the conditional is a change in the sample space: when we condition on $B$, we are changing the sample space from $\Omega$ to $B$. We find that all the axioms and theorems are consistent with this view. We can also mention here the notion of independence. Two events $A$ and $B$ are independent iff $P(A \cap B)=P(A) P(B)$. This implies that $P(A|B)=P(A)$ and $P(B|A)=P(B)$. This means that, given the one, we gain no information about the other: it remains just as probable.

While the probability of events is relatively easy to understand, the probability of propositions is not as easy, as propositions can have only two values: true and false. How is it that we can say “The probability that you are female is 51%” when you are either definitely male or definitely female? This is where the notion of epistemic probability comes into play. Epistemic probability has to do with how likely something seems to us, or some other (rational) person, given some set of background information. For instance, in some murder, given that we see Joe’s fingerprints on the murder weapon, we deem it likely that Joe committed the murder. Though it is very difficult to give a good account, a rough way to quantify it would be in the following sense:
$X$ has epistemic probability $p$ given background information $B$ (i.e. $P(X|B)=p$) iff the following is true: supposing we encountered this set of information $B$ in many scenarios, we would expect $X$ to be true in fraction $p$ of those scenarios.
Again, this may not be a perfect analysis, but it does give a rough way to understand it. However, we must note that epistemic probability is of a significantly inferior sort than, say, experimental probability (observing that $X$ happens in fraction p of experimental cases), or even a good theoretical probability (theory predicts that a homogeneous cube of material will, when haphazardly tossed, land on any given face with equal probability). There is a principle called the principle of indifference that says one should assign equal epistemic probabilities to two events or propositions when we have no justification to prefer one to the other. That may be a good principle as far as epistemic probability goes, but it is very deeply restricted by background information (clearly: lacking any background information to prefer one possibility to another, we are to assign them equal probabilities), and at least somewhat subjective. It is thus greatly limited by what we know: in fact, what we think is a possibility, based on our background information, may not be a possibility at all (it could be what is called an epistemic possibility). Thus, while epistemic probability may be the best we can do, given our background information, it may not be very good at all.

Statistical probability is of the epistemic sort: suppose that fraction p of population S has property X. We then come across a member M of S. Suppose we have no way to tell immediately whether M has property X, but we know M comes from S. We therefore say that M has property X with (epistemic) probability p. This is a statistical probability: based on facts about the population, we deduce a probability as regards a given individual, even though, if we had more information, we could say that M had X with probability either zero or one. This is to be contrasted with what we might call stochastic probability. If we have a perfect coin, and flip it fairly, before we do so, there is no information anywhere, even possibly, as to what its outcome will be. We don't know what will happen when we flip it, not because we aren't privy to some information, but because there is no information to be had. This will be the case with any genuinely indeterministic event. We might demonstrate the difference between statistical and stochastic probabilities as between a coin that was flipped but is hidden from view and a coin yet to be flipped, respectively. Most physicists believe many quantum processes are genuinely stochastic, and some philosophers believe free will is also stochastic in some sense ("You will probably choose X" does not mean that based on what I know now, there is a pretty high epistemic probability that you will choose X, but if I knew more, I would be able to predict with certainty whether you will choose X or not (e.g. you chose X most of the time when you are in certain circumstances). Instead, it is that you are more disposed to choose X).

We will here give a few theorems of probability theory. We will try to present them such that their derivation is clear, but if not, then any introductory text on probability theory can give a more thorough exposition. $A$ and $B$ are some subsets of $\Omega$:
$P(\Omega)=1;\;\;\;\ P(\varnothing)=0$ $P(A \cap \Omega)=P(A);\;\;\;\ P(A \cap \varnothing)=P(\varnothing)=0$ $P(A \cup \Omega)=P(\Omega)=1;\;\;\;\ P(A \cup \varnothing)=P(A)$ $0 \leq P(A) \leq 1$ $0 \leq P(A|B) \leq 1$ $P(A \cup \sim A)=P(A)+P(\sim A)=P(\Omega)=1$ $P(A \cup B)+P(A \cap B)=P(A)+P(B)$ $P(A \cap B)\leq \min(P(A),P(B))$ $P(A\cup B) \leq P(A)+P(B)$ $P(A \cup B) \geq \max(P(A),P(B))$ $P(A \cap B)+P(A \cap \sim B)=P(A)$ $P(A \cap B)=P(A)P(B|A)$ $P(A|B)=\frac{P(B|A)P(A)}{P(B)}$ $\frac{P(A|B)}{P(A)}=\frac{P(B|A)}{P(B)}$

Let $B_{1},B_{2},…$ be a partition of $C$, then: $P(A \cap B_{1})+P(A \cap B_{2})+…=P(A \cap C)$ $P(A \cap C)=P(A|B_{1})P(B_{1})+ P(A|B_{2})P(B_{2})+…$ $P(C|A)=P(B_{1}|A)+ P(B_{2}|A)+…$ Particularly, $1=P(B|A)+P(\sim B|A)$

#### De Morgan’s Laws

$P(\sim A \cap \sim B)+P(A \cup B)=1$$P(\sim A \cup \sim B)+P(A \cap B)=1$

#### Bayes’ theorem

Let $H_{1}, H_{2}, …$ be a partition of $\Omega$. Then: $P(H_{m}|E)=\frac{P(H_{m}) P(E|H_{m})}{P(H_{1})P(E|H_{1})+ P(H_{2})P(E|H_{2})+…}$ This is typically applied to choosing a hypothesis to explain a certain fact, or given a certain set of evidence. $P(E|H_{m})$ is the (epistemic) probability that we would get evidence $E$ supposing hypothesis $H_{m}$ is true, and $P(H_{m}|E)$ is the (epistemic) probability that hypothesis $H_{m}$ is true, given evidence $E$. Thus, hypothesis $H_{m}$ becomes more likely on evidence $E$ the more probable it is without the evidence, the more likely the evidence would be on that hypothesis, the less likely the evidence would be on alternate hypotheses, and the less likely the alternate hypotheses are without the evidence.

#### Probability of a Union

We can here give a useful formula for determining the probability of the union of events, which we can deduce from DeMorgan’s laws: suppose we want to find the probability of the union of some events $Q=P(A_{1} \cup A_{2} \cup …)$.
We take the product $Q'=1-\prod_{n}(1-A_{n})$
We then replace every occurrence of $A_{m_{1}}A_{m_{2}}…$ with $P(A_{m_{1}} \cap A_{m_{2}}…)$. For instance, to find $P(A \cup B \cup C)$, we take $1-(1-A)(1-B)(1-C)=A+B+C-AB-AC-BC+ABC$
We then make replacements as described to get $P(A \cup B \cup C)=P(A)+P(B)+P(C)-P(A \cap B) -P(A \cap C) -P(B \cap C)+ P(A \cap B \cap C)$ If the events are all independent, we can simplify the formula to: $Q=1-\prod \nolimits_{n}(1-P(A_{n}))$ If $P(A_{m})=p$ for all m, we can further simplify: $Q=1-(1-p)^{N}$ Where $N$ is the number of events. For p fairly small, we can approximate this as $Q \approx 1-e^{-pN}$ And from this we can rearrange to get $N \approx \frac{-\ln(1-Q)}{p}$ This gives the number of independent trials necessary to get a probability Q of at least one success, if the probability of success in each trial is p.
As an application, we can ask what is the probability that an event will happen on a given day if it has a 50% probability of happening in a year? In this case, we want to solve for $p$ given $Q=0.5$ and $N=365$. We find that $p \approx \frac{-\ln(1-Q)}{N}=0.19\%$.

Using this, we can also show that improbable events are likely in a collection of many trials. Suppose we have $N$ trials, in each of which X happens with probability $p$. We then have the probability that X never happens as given by $(1-p)^{N} \approx e^{-pN}$. We thus see that, as N increases, the probability of no X occurring tends to zero; in fact, it tends to zero exponentially. Thus, given enough trials we would expect to see the individually improbable: long strings of all heads while flipping a coin, the same person winning the lottery multiple times, someone has two unrelated rare diseases, etc. Coincidences will always crop up given enough opportunities. These coincidences combined with confirmation bias--remembering the hits and forgetting the misses--result in muddled thinking. A coincidence happens and it is interpreted as a sign from on high, even though they ignored the hundreds of other times no coincidence happened. It is important to remember that coincidences are basically inevitable in large enough samples: if something has a one in a billion chance of happening to any given person on any given day, we can expect it will happen seven times per day worldwide.

#### Implication and Conditional Probability

We can also prove the following interesting theorem. Note that “if $A$, then $B$” or “$A\rightarrow B$” is logically equivalent to “$\sim A$ or $B$” or “$\sim A \cup B$”. Thus $P(A \rightarrow B)=P(\sim A \cup B)$. We then have
$1.\;\; 1 \geq P(A)$
$2.\;\; P(\sim B|A) \geq P(\sim B|A)P(A)=P(\sim B \cap A)$
$3.\;\; 1-P(\sim B|A) \leq 1-P(A \cap \sim B)$
$4.\;\; P(B|A) \leq P(\sim A \cup B)=P(A \rightarrow B)$
That is, the probability of “if $A$ then $B$” is not less than that of “$B$, given $A$”.

We can also note that, as $P(A)=P(A|B)P(B)+P(A|\sim B)(1-P(B))$, then $\min(P(A|B),P(A|\sim B)) \leq P(A) \leq \max(P(A|B),P(A|\sim B))$

#### Conditional Changes in Probability and How it Relates to Evidence

We can demonstrate the following:
Suppose $P(A|B)>P(A)$. Then $P(A \cap B)>P(A)P(B)$ and $P(B|A)>P(B)$. In fact, all three are equivalent. In that case:
$1.\;\; P(A \cap B)> P(A)P(B)$
$2.\;\; P(A) - P(A \cap B)< P(A)- P(A)P(B)=P(A)(1-P(B))$
$3.\;\; P(A \cap \sim B) < P(A)P(\sim B)$
$4.\;\; P(A | \sim B) < P(A)$
We can easily prove the greater-than case in the same way.
In English: "if A is more probable on B, B is more probable on A" and "if A is more probable on B, A is less probable on not-B".

An important consequent of this theorem is in discerning what counts as evidence. In a loose sense, we can say that $A$ provides evidence for $B$ if $P(B|A)>P(B)$. We thus see that a necessary and sufficient condition for $A$ providing evidence for $B$ is that $\sim A$ would need to provide evidence against $B$. Thus, if we do some experiment to test a claim, we must be willing to accept failure as evidence against the claim if we would be willing to accept success as evidence for the claim, and vice versa. We must be willing to accept the possibility of weakening the claim if we are willing to accept the possibility of strengthening it by some test. It is often said that "absence of evidence is not evidence of absence", but this needs some qualification. Suppose we want to test the claim that there is life on Mars. We then do some test, like looking at a sample of martian soil under a microscope, and it comes up negative: is that evidence against life on Mars? Certainly, albeit very weak evidence. If we had found microbes in that sample, we would certainly have said that was evidence for life on Mars, therefore we must necessarily admit that the lack of microbes is evidence against life on Mars. It may only reduce the (epistemic) probability that there is life on Mars by something like a millionth of a percent, but if we do a million tests, that amounts to about a whole percent. If we do a hundred million tests, that amounts to over 60%.

In short, absence of evidence does count as evidence of absence in any and every instance where a presence of evidence would count as evidence of presence.

#### "Extraordinary Claims Require Extraordinary Evidence"

This phrase is declared nearly as often as it is denounced. However, it is clearly not specific enough to be definitively evaluated. One way of interpreting it is to say "Initially improbable hypotheses require improbable evidence to make them probable". This formulation is relatively easy to demonstrate as being true: $P(E)=\frac{P(H \cap E)}{P(H|E)} \leq \frac{P(H)}{P(H|E)}$ For example, if $P(H)=1 \%$ and $P(H|E)=75 \%$ then $P(E) \leq 1.33 \%$.
If $P(H|E) \geq 0.5$, then $P(E) \leq 2 P(H)$. Thus, it is clear that the evidence required to make an initially improbable hypothesis probable must be comparably improbable.

#### Inscrutable Probabilities, Meta-probabilities and Widening Epistemic Probability

Sometimes, in cases of certain probabilities, we cannot estimate the probabilities, either at all, or to an adequate degree. We call such probabilities inscrutable. For all we know, these probabilities could have any value. We can use the concept of inscrutable probabilities to improve the descriptive accuracy of our epistemic probability judgements. For instance, suppose we have a die, and we are $90\%$ sure that it is fair. We then want to find the probability that a six will be rolled. We make use of the formula: $P(A)=P(A|B)P(B)+P(A|\sim B)P(\sim B)$ In this case, A is the event "a six is rolled" and B is the event "the die is fair". In this case, $P(A|\sim B)$ is inscrutable: given that the dies is not fair, we cannot predict what the outcome will be. However, we do know that this probability, like all probabilities, is between zero and 1. Thus: $P(A|B)P(B) \le P(A) \le P(A|B)P(B)+P(\sim B)$ In this case, we find $P(6|\text{fair})P(\text{fair}) \le P(6) \le P(6|\text{fair})P(\text{fair})+P(\sim \text{fair})$ $\frac{1}{6} \cdot 0.9 \le P(6) \le \frac{1}{6} \cdot 0.9 + 0.1$ $0.15 \le P(6) \le 0.25$ Here we may introduce the concept of meta-probabilities. These take the form of the probability that something is true about a probability, for instance $P(P(X) \ge \alpha)$ is the probability that the probability of X is not less than $\alpha$. Returning to our example, suppose we are only $80\%$ confident that the probability that the die is fair is $90\%$. Applying the above formula: $P(\text{fair}|P(\text{fair})=0.9)P(P(\text{fair})=0.9) \le P(\text{fair}) \le P(\text{fair}|P(\text{fair})=0.9)P(P(\text{fair})=0.9)+P(P(\text{fair}) \neq 0.9)$ $0.9 \cdot 0.8 \le P(\text{fair}) \le 0.9 \cdot 0.8+0.2$ $0.72 \le P(\text{fair}) \le 0.92$ This then implies $0.08 \le P(\sim \text{fair}) \le 0.28$.
Returning to our former equation, we then have: $P(6|\text{fair})P(\text{fair}) \le P(6) \le P(6|\text{fair})P(\text{fair})+P(\sim \text{fair})$ $\frac{1}{6} \cdot 0.72 \le P(6) \le \frac{1}{6} \cdot 0.92 + 0.28$ $0.12 \le P(6) \le 0.433...$ We thus see that adding in our meta-probabilistic uncertainty in our estimate for $P(\text{fair})$ has further widened our uncertainty in the likelihood of rolling a six. This highlights the importance of both accounting for and minimizing any potential sources of uncertainty. We must factor in our confidence in a model in assessing the results it predicts to be likely or unlikely, if we are to use that model to form our epistemic probabilities of the predicted results.

*       $P(X)=0$ does not mean that X can and will never happen. If you roll a marble, the chance of it landing on any given point is zero (ideally), and yet you will have it land on some point. What $P(X)=0$ means is specific: it means that the measure of the space in which X holds is zero relative to the measure of $\Omega$. There may still be a possibility that X happens, just that the region in which X happens is of zero "area", compared to $\Omega$ (e.g. it is a point, and $\Omega$ is a line segment). If $P(X)=0$ we say that X will almost surely not happen, as opposed to $\varnothing$, which will surely not happen.