# A Generalized Evidence Function for Widget Testing

*[Context: E.T. Jaynes’ Probability Theory: The Logic of Science, sections 4.2-4.4]*

The example scenario Jaynes uses to introduce the evidence function and multiple hypothesis testing is extremely contrived. We have a bunch of widgets that are all either good or bad. And we have some prior information that tells us the widgets came either from a “good” machine, which produces one defective widget for every five good ones, or a “bad” machine, which produces one defective widget for every two good ones. We then add a third hypothesis: the information we got about good and bad machines is inaccurate, and our widgets actually came from a horrible machine that produces 99 defective widgets for every good one. We are supposed to determine which of these cases is true by testing one widget at a time and evaluating the posterior evidence function for each hypothesis.

Despite being so contrived, I certainly think this example does a good job of introducing the material.

But what about after having been introduced?

It seems to me that in a reasonable “real world” situation, you would not have prior information about the possible defective rates of your widget-producing machines. Rather, if you were doing quality control on the machines, your job might be to determine, ex nihilio, the real defective rate. With a small extension, we can use the math from the contrived scenario to solve this problem.

First off, let’s define the new problem.

Your job is quality control at the widget factory. Today you have a box of widgets that came from machine M-1. You have no information about the possible defective rate, \(f_d\), of machine M-1, other than by definition \(f_d\) must be a real number between 0 and 1. You have a test that will determine whether a given widget is defective or not. Find \(f_d\) to within a reasonable margin of error.

Now, knowing nothing about the underlying rate of defectiveness, how do we come
up with a set of hypotheses to test? A first thought might be to just test *all*
the hypotheses! That is, *every* real number between 0 and 1. A second thought
might be that we hardly have that kind of time on our hands. But our problem
only asks for \(f_d\) to within some reasonable margin of error. If a
“reasonable” margin of error is, say, 5%, then we can just test fractions at
intervals of 0.05.

And so, we come up with a set of hypotheses to test:

$$ \begin{equation} \begin{split} H_0 & \equiv \textrm{“0 widgets are defective” } \newline H_1 & \equiv “\frac{5}{100} \textrm{ widgets are defective” } \newline H_2 & \equiv “\frac{10}{100} \textrm{ widgets are defective” } \newline \dots \newline H_{19} & \equiv “\frac{95}{100} \textrm{ widgets are defective” } \newline H_{20} & \equiv \textrm{“All widgets are defective.” } \newline \end{split} \end{equation} $$

For each hypothesis, we need an evidence function. That is what tells us how strongly to consider the hypothesis relative to all the others, given our testing data.

## Deriving the evidence function

Each evidence function will take the form

$$ e(H_i|DX) = e(H_i|X) + 10\log_{10}\bigg[ \frac{P(D|H_iX)}{P(D|\overline{H_i}X)} \bigg]. $$

In my last post I showed how to expand the numerator and
denominator of the term inside the logarithm. That was for the three hypothesis
case. For the \(n\) hypothesis case, the numerator is the same, since it just
depends on \(H_i\). Some work has to be done to generalize the denominator,
however, since it depends on all of the hypotheses *except* \(H_i\).

$$ \begin{equation} \begin{split} P(D|\overline{H_i}X) & = P(D|(H_0 + \dots + H_{i-1} + H_{i+1} + \dots + H_n)X) \newline & = P(D|X) \frac{P(H_0 + \dots + H_{i-1} + H_{i+1} + \dots + H_n|DX)}{P(H_0 + \dots + H_{i-1} + H_{i+1} + \dots + H_n|X)} \end{split} \end{equation} \tag{1}\label{1} $$

In Logic of Science: 2.2 - equation 5 we saw that if the propositions \(\{H_0,\dots,H_n\}\) are mutually exclusive, then the term in the denominator will be equal to the sum of the probabilities of the individual propositions given the prior \(X\).

$$ P(H_0 + \dots + H_{i-1} + H_{i+1} + \dots + H_n|X) = \sum_{j\neq i}^N P(H_j|X) \tag{2}\label{2} $$

Due to the same mutual exclusivity, the numerator will come out to a similar sum.

$$ P(H_0 + \dots + H_{i-1} + H_{i+1} + \dots + H_n|DX) = \sum_{j\neq i}^N P(H_j|DX) \tag{3}\label{3} $$

Distributing the prior probability of the data, and carrying out a single application of the product rule,

$$ \begin{equation} \begin{split} P(D|X) \cdot \sum_{j\neq i}^N P(H_j|DX) & = \sum_{j\neq i}^N P(D|X)P(H_j|DX) \newline & = \sum_{j\neq i}^N P(H_j|X)P(D|H_jX). \end{split} \end{equation} \tag{4}\label{4} $$

Substituting these results into \(\eqref{1}\), we have

$$ P(D|\overline{H_i}X) = \frac{\sum_{j\neq i} P(H_j|X)P(D|H_jX)}{\sum_{j\neq i} P(H_j|X)} $$

And so, we have an expanded evidence function, which looks like

$$ \begin{equation} \begin{split} e(H_i|DX) & = e(H_i|X) + 10\log_{10}\bigg[ P(D|H_iX) \cdot \frac{\sum_{j\neq i} P(H_j|X)}{\sum_{j\neq i} P(H_j|X)P(D|H_jX)} \bigg]. \end{split} \end{equation} $$

This evidence function is correct for the \(n\) hypothesis case, but only under the assumption that the hypotheses in \(\{H_0,\dots,H_n\}\) are mutually exclusive.

## Mapping to our problem

Equipped with a general enough evidence function, we can almost go about the business of determining the real fraction of defective widgets produced by machine M-1. But first, we have to specify how the elements of our problem map to the elements in the function.

### Data Set

The evidence function requires a parameter \(D\), on which our belief of each hypothesis is conditioned. The data set in this problem consists only of two pieces of information:

- The number of widgets that have been tested, \(N\), and
- The number of widgets that have been found to be defective, \(N_d\).

### Prior probabilities

The term \(P(H_j|X)\) occurs twice in the evidence function, and represents
the probability that hypothesis \(H_j\) is true, before we see a data set.
Since we don’t have any information that would lead us to favor one hypothesis
over another, it makes sense to give the same prior probability to each
one.^{1} There are 21 mutually exclusive hypotheses, which means

$$ P(H_i|X) = \frac{1}{21}, \quad 0 \leq i \leq 20. $$

### Likelihood of data

The remaining term in the evidence function is \(P(D|H_iX)\), and represents
how probable it is to observe the data set if the hypothesis \(H_i\) is true.
Jaynes explains why we can use the binomial sampling distribution to compute
this.^{2} If we let \(y\) equal the fraction of bad widgets specified
by \(H_i\), \(N_d\) equal the number of tested widgets that are defective,
and \(N\) equal the number of widgets tested so far, then

$$ P(D|H_iX) = \binom{N}{N_d} y^{N_d}(1-y)^{N - N_d}. $$

## The widget testing process

At this point we can actually evaluate the evidence for each hypothesis, conditional on a data set. Given that, how do we find the real defective rate of machine M-1, and therefore solve the problem?

Since we are not actually *in* a widget factory,^{3} and it would take
way too long to do the calculations by hand, I’ve written some code to model the
scenario.^{4}

```
#!/usr/bin/env python3
import random
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import binom
from enum import Enum
class Proposition:
def __init__(self, name, desc, prior, defect_rate):
self.name = name
self.desc = desc
self.prior = prior
self.defect_rate = defect_rate
def posterior_evidence(self, D):
pass
class DataSet:
def __init__(self, n, nb):
self.n = n
self.nb = nb
class WidgetQuality(Enum):
GOOD = "good"
BAD = "bad"
def b(n, r, f):
return binom(n, r) * f**r * (1-f)**(n-r)
def evidence(x):
return 10 * np.log10((x + 1e-300) / (1 - x + 1e-300))
def posterior_evidence_fn(Hi, Hrest):
def posterior_evidence(data):
def f(H, data):
return H.prior * b(data.n, data.nb, H.defect_rate)
prior_evidence = evidence(Hi.prior)
posterior_odds = b(data.n, data.nb, Hi.defect_rate) \
* sum([h.prior for h in Hrest]) \
/ (sum([f(h, data) for h in Hrest]) + 1e-300)
return prior_evidence + 10 * np.log10(posterior_odds + 1e-300)
return posterior_evidence
def test_widget(real_defective_rate):
if random.random() < real_defective_rate:
return WidgetQuality.BAD
return WidgetQuality.GOOD
# ---------------------------------------------------------------------- #
# number of hypotheses
M = 21
# prior probability of each H_i
uniform_prior = 1.0/M
# create the hypotheses
hypotheses = []
for i in range(M):
P_i = Proposition(str(i), '{}/{} of widgets are bad'.format(i,M), uniform_prior, (i*5)/100)
hypotheses.append(P_i)
# set the posterior evidence functions
for h in hypotheses:
h_rest = [h_other for h_other in hypotheses if h_other.name != h.name]
h.posterior_evidence = posterior_evidence_fn(h, h_rest)
real_defective_rate = 11.0/100
# track the number of observed defective widgets
nd = 0
# track the evidence for each hypothesis after each test
evidence_histories = [[] for h in hypotheses]
N = 200
xs = range(N)
for i in range(N):
if test_widget(real_defective_rate) == WidgetQuality.BAD:
nd += 1
data = DataSet(i, nd)
for j, h in enumerate(hypotheses):
evidence_histories[j].append(h.posterior_evidence(data))
for history in evidence_histories:
plt.plot(xs, history)
for i, h in enumerate(hypotheses):
plt.annotate(h.name, xy=(200, evidence_histories[i][-1]))
plt.ylim(-60, 20)
plt.axhline(y=0, color='k')
plt.title('Evidence for {H_0, ..., H_20} over 100 tests')
plt.xlabel('Number of widgets tested')
plt.ylabel('Evidence (dB)')
plt.show()
```

The code simulates testing 200 widgets when the real defective rate of the machine is \(\frac{11}{100}\). It records the evidence for each hypothesis at each step of the way. Here is the plot of one run.

Since pulling a widget out of the box and testing it is a random sampling process, we are not guaranteed to always have the most evidence for the hypothesis that specifices the real defective rate. That is actually the case in the plot above, where \(H_3\) has more evidence after 200 tests than \(H_2\), which is the hypothesis that most closely represents the real defective rate.

In the limit, however, the correct hypothesis will always prevail.^{5}

1: I understand there are more sophisticated ways of assigning priors, but I don’t know them yet.

2: It’s because we are assuming the number of widgets tested is far fewer than the number of widgets in the box.

3: At least, I’m not. I don’t know about you.

4: I took some general concept, and specifically the binomial distribution function from this github gist.

5: This is a result which Jaynes leaves to the reader to prove, and I haven’t worked it out yet.