Chapter 4 Statistical Pattern Recognition

4.1 Introduction
units of statistical pattern recognition: image regions, projected segments
each unit: has associated measurement vector
decision rule: designed optimally to assign unit to class or category
statistical pattern recognition techniques include

  1. feature selection and extraction techniques
  2. decision rule construction techniques
  3. techniques for estimating decision rule error




4.2 Bayes Decision Rules: Maximum Utility Model for Pattern Discrimination
In the simple pattern discrimination or pattern identification process, a unit is observed or measured and a category assignment is made that names or classifies the unit as a type of object
unit category assignment: made solely on observed measurement (pattern)




distinct facts characterizing the situation

$(t, a, d)$: event of classifying the observed unit
$P(t, a, d)$: probability of the event $(t, a, d)$




4.2.1 Economic Gain Matrix
making category assignments: carries consequences $(t, a, d)$ economically
$e(t,a)$: economic gain or utility with true category $t$, assigned category $a$
decision rule: so resulting expected utility is highest




identity gain matrix: right the largest possible fraction of the time

\begin{displaymath}
e(t,a) = \left\{
\begin{array}{l l}
1 & {\rm\ if \ } t = a \\
0 & {\rm\ otherwise}
\end{array}\right.
\end{displaymath}




economic gain matrix for jet fan blade inspection:
test costs $10
decision positive on crack, $500 fan blade has to be discarded
really crack but negative on decision, $50,000,000 for airplane crash
=====Table 4.1=====
optimizing decision rule using this matrix: different from identity matrix




automatic defect-inspection machine: good, bad objects
$P(g, g)$: probability of true good, assigned good
$P(g, b)$: probability of true good, assigned bad
$P(b, g)$: probability of true bad, assigned good
$P(b, b)$: probability of true bad, assigned bad
=====Table 4.2=====
fraction of good objects manufactured $P(g) = P(g,g)+P(g,b)$
fraction of bad objects manufactured $P(b)=P(b,g)+P(b,b)$




$e$ positive: profit consequence
$e$ negative: loss consequence
=====Table 4.3=====




Expected profit per object

\begin{displaymath}
E = P(g,g)e(g,g) + P(g,b)e(g,b)+P(b,g)e(b,g)+P(b,b)e(b,b)
\end{displaymath}




inspection machine performance specified by conditional probabilities
given object good, probability detected as good

\begin{displaymath}
P(g\vert g) = \frac{P(g,g)}{P(g,g)+P(g,b)}
\end{displaymath}

given object good, probability detected as bad

\begin{displaymath}
P(b\vert g) = \frac{P(g,b)}{P(g,g)+P(g,b)}
\end{displaymath}

given object bad, probability detected as good

\begin{displaymath}
P(g\vert b) = \frac{P(b,g)}{P(b,g)+P(b,b)}
\end{displaymath}

given object bad, probability detected as bad

\begin{displaymath}
P(b\vert b) = \frac{P(b,b)}{P(b,g)+P(b,b)}
\end{displaymath}

$P(b\vert g)$: false-detection rate: false-alarm rate
$P(g\vert b)$: misdetection rate

\begin{displaymath}
P(g\vert g) = 1 - P(b\vert g)
\end{displaymath}


\begin{displaymath}
P(b\vert b) = 1 - P(g\vert b)
\end{displaymath}




$P(g)$: characterizes manufacturing environment
$P(b\vert g), P(g\vert b)$: characterizes inspection machine performance
$e(g,g), e(g,b), e(b,g), e(b,b)$: economic consequences




Discussion
automatic defect-detection machine operating curve: false-alarm/misdetection
possible to trade false-alarm for misdetection rate, or vice-versa




trade-off between misdetection and false-alarm rate
=====Example 4.1=====
=====Table 4.4=====
=====Table 4.5=====
=====Example 4.2=====
=====Table 4.6=====
=====Oldie 33:12=====




4.2.2 Decision Rule Construction
$P(t, a, d)$: probability true category $t$, assigned category $a$, measurement $d$
average economic gain

\begin{displaymath}
E[e]=\sum_{t \in C} \sum_{a \in C} e(t,a) P(t,a)
\end{displaymath}

=====Fig. 4.1=====




when economic gain matrix is identity matrix

\begin{displaymath}
E[e]=\sum_{t \in C} \sum_{a \in C} e(t,a) P(t,a)=\sum_{t \in C} P(t,t)
= P({\rm correct \ assignment})
\end{displaymath}

the expected gain is the probability of correct assignment




fair game assumption: decision rule uses only measurement data in assignment
$P(a\vert t,d)$: conditional probability assignment $a$ given true $t$, observed $d$

\begin{displaymath}
P(a\vert t,d) = \frac{P(t,a,d)}{P(t,d)}
\end{displaymath}

conditional probability of making assignment $a$ given measurement $d$

\begin{displaymath}
P(a\vert d) = \frac{P(a,d)}{P(d)}
\end{displaymath}

fair game assumption says

\begin{displaymath}
P(a\vert t,d)=P(a\vert d)
\end{displaymath}

$f(a\vert d)$: completely defines decision rule
deterministic decision rule

\begin{displaymath}
f(a\vert d) = \left\{
\begin{array}{l l}
1 & {\rm\ for \ som...
...
0 & {\rm\ for \ all \ other \ } a' \neq a
\end{array}\right.
\end{displaymath}

decision rules not deterministic: probabilistic: nondeterministic: stochastic




expected value of economic gain depending on decision rule $f(a\vert d)$

\begin{displaymath}
E[e; f] = \sum_{d \in D} \{ \sum_{a \in C} f(a\vert d) [ \sum_{t \in C} e(t,a)P(t,d)]\}
\end{displaymath}

Bayes decision rules: maximize expected economic gain
Bayes decision rule $f$ satisfies

\begin{displaymath}
E[e;f] \geq E[e; g] {\rm\ for \ any \ decision \ rule \ } g
\end{displaymath}

=====Fig. 4.2=====
=====Fig. 4.3=====
continuous measurement space rather than discrete
=====p. 108, p. 109=====




4.3 Prior Probability

\begin{displaymath}
P(t,d) = P(d\vert t)P(t)
\end{displaymath}

$P(d\vert t)$: conditional probability of measurement $d$ given true $t$
$P(t)$: prior probability true category is $t$




4.4 Economic Gain Matrix and the Decision Rule
identity economic gain matrix: maximizes correct classification probability

\begin{displaymath}
\left(
\begin{array}{c c c}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{array}\right)
\end{displaymath}

correct assignment gains 0, incorrect loses 1: same optimal rule

\begin{displaymath}
\left(
\begin{array}{c c c}
0 & -1 & -1 \\
-1 & 0 & -1 \\
-1 & -1 & 0
\end{array}\right)
\end{displaymath}




4.5 Maximin Decision Rule
maximin decision rule: maximizes average gain over worst prior probability
$c^1, c^2$: two categories
$d^1, d^2, d^3$: three possible measurements
$P(d\vert c)$: conditional probability of measured $d$, given category $c$
$f_1, f_2, ..., f_8$: eight possible decision rules
$e$: gain matrix with 0s off the diagonal and 1s on the diagonal
$E[e\vert c^1;f]$: conditional gain
$E[e\vert c^1;f_2]=0.5=0.2+0.3; \ \ \ E[e\vert c^2;f_2]=0.1$
=====Example 4.3=====
$f_5, f_7$: yield minimum expected conditional gain of 0.5
$f_5, f_7$: deterministic maximin rules
$f_5, f_7$: maximize the minimum expected conditional gain
expected gain that the eight possible decision rules yield
=====Fig. 4.4=====
convex set of the expected conditional gains possible for any decision rule
=====Fig. 4.5=====
$3\frac{2}{3}=3\frac{2}{3}(\frac{3}{4}+\frac{1}{4}); \ \ \
2\frac{1}{2}=3\frac{2}{3}\cdot\frac{3}{4}+(-1)\frac{1}{4}$
=====Example 4.4=====
$f_2$: maximin rule yielding minimum expected gain of $2\frac{1}{2}$
expected gains as a function of prior probability for various decision rules
=====Fig. 4.6=====
convex set of expected conditional gains =====Fig. 4.7=====
$P(c^1\vert d,f_1)=0.3+0.4+0.3; \ \ \ P(c^1\vert d,f_2)=0.3+0.4$
=====Example 4.5=====
expected gain as a function of prior probability for various decision rules
=====Fig. 4.8=====
convex set of the conditional gains possible for any decision rule
=====Fig. 4.9=====
=====Garfield 17:4=====




4.6 Decision Rule Error: Misidentification/False Identification
misidentification error for category $c^k$ when true $c^k$ assigned $c^j \neq
c^k$

\begin{displaymath}
\alpha_k = P({\rm\ unit \ not \ assigned \ to \ } c^k\vert {\rm\ true \ category
\ of \ unit \ is \ } c^k)
\end{displaymath}

false-identification error for category $c^k$ when assigned $c^k$ true $c^j \neq
c^k$

\begin{displaymath}
\beta_k = P({\rm\ unit \ assigned \ to \ } c^k\vert {\rm\ true \ category
\ of \ unit \ is \ } c^j, c^j \neq c^k)
\end{displaymath}

$\alpha_1(f)=\frac{\sum_{j=1,j\neq 1}^2 \sum_{d \in D} f(c^j\vert d)P(d,c^1)}{P(...
... +0.05\cdot 1 + 0.05\cdot 1}{0.2+0.1+0.05+0.05+
0.05+0.05}=\frac{0.25}{0.5}=0.5$
$\beta_1(f)=0.2=\frac{0.05+0.05}{0.5}; \ \ \ \alpha_1(g)=0.8=\frac{0.4}{0.5}$
=====p. 129=====




4.7 Reserving Judgement
reserved judgement: important technique to control error rate
reserved judgement: may withhold judgement for some measurement
=====ITRI, MIRL, plate=====




4.8 Nearest Neighbor Rule
nearest neighbor rule: assigns pattern $x$ to closest vector in training set
chief difficulty: brute-force nearest neighbor algorithm computational
complexity proportional to number of patterns in training set




4.9 A Binary Decision Tree Classifier
decision tree classifier: assigns by hierarchical decision procedure
typical binary decision tree classifier
=====Fig. 4.10=====
three major problems in constructing a decision tree classifier

five decision rules at each nonterminal node:




4.10 Decision Rule Error Estimation
decision rule constructed: important to characterize performance by errors
training data set: must be independent of testing data set
hold-out method: one common error estimation technique
hold-out method: divide total data set in half
hold-out method: one half to construct decision rule, other half to test it




4.11 Neural Networks
neural network: set of units taking linear combination of input values
linear combination: goes through nonlinear function e.g. threshold
neural network: has a training algorithm, responses observed
neural network: reinforcement algorithms/back propagation to change weights
neural network literature extensive, course by Prof. Cheng-Yuan Liou




4.12 Summary
pattern recognition literature extensive, course by Prof. Yi-Ping Hung
=====joke=====



2001-09-19
Counter:
FastCounter by bCentral