Fisher's Natural Breaks Classification

From ObjectVision

Jump to: navigation, search

An \( O(k \times n \times \log(n)) \) algorithm is presented here, with proof of its validity and complexity, for the classification of an array of \( n \) numeric values into \( k \) classes such that the sum of the squared deviations from the class means is minimal, known as Fisher's Natural Breaks Classification. This algorithm is an improvement of Jenks' Natural Breaks Classification Method, which is a (re)implementation of the algorithm described by Fisher within the context of choropleth maps, which has time complexity \( O(k \times n^2) \).

Finding breaks for 15 classes for a data array of 7 million unique values now takes 20 seconds on a desktop PC.

Contents

Definitions

Given a sequence of strictly increasing values \(v_i\) and positive weights \(w_i\) for \(i \in \{1..n\}\), and a given number of classes \(k\) with \(k \le n\),

choose classification function \( I(x): \{1..n\} \to \{1..k\} \), such that \(SSD_{n,k}\), the sum of the squares of the deviations from the class means, is minimal, where:

$$ \begin{align} SSD_{n,k} &:= \min\limits_{I: \{1..n\} \to \{1..k\}} \sum\limits^k_{j=1} ssd(S_j) & & \text{minimal sum of sum of squared deviations } \\ S_j &:= \{i\in \{1..n\}|I(i) = j\} & & \text{set of indices that map to class}\,j\\ ssd(S) &:= { \sum\limits_{i \in S} {w_i (v_i - m(S))^2} } & & \text{the sum of squared deviations of the values of any index set}\,S\\ m(S) &:= { s(S) \over w(S) } & & \text{the weighted mean of the values of any index set}\,S\\ s(S) &:= { \sum\limits_{i \in S} {w_i v_i} } & & \text{the weighted sum of the values of any index set}\,S\\ w(S) &:= \sum\limits_{i \in S} {w_i} & & \text{the sum of the value-weights of any index set}\,S \end{align} $$

Note that any array of n unsorted and not necessarily unique values can be sorted and made into unique \( v_i \) by removing duplicates with \(w_i\) representing the occurrence frequency of each value in \( O(n \log(n)) \) time.

Characteristics

One can derive the following:

  • \(ssd(S) = sqr(S) - wsm(S)\) with

$$ \begin{align} sqr(S) &:= \sum\limits_{i \in S} {w_i {v_i}^2} \\ wsm(S) &:= {w(S) {m(S)}^2} = s(S) m(S) & &\text{the weighted square of the mean of S} \end{align} $$

  • \(SSD = SSV - WSM\) with

$$ \begin{align} SSV &:= \sum\limits_{j} {sqr(S_j)} = sqr(\{1..n\}) \\ WSM &:= \sum\limits_{j} {wsm(S_j)} = \sum\limits_{j} {w(S_j) {m(S_j)}^2} \end{align} $$

  • Since \(SSV\) is independent of \(S_j\), minimizing \(SSD\) is equivalent to maximizing:

$$ WSM = \sum\limits_{j} { {[\sum\limits_{i \in S_j} {w_i v_i}]^2} \over {\sum\limits_{i \in S_j} {w_i}} } $$

  • Since the sequence \(v\) is sorted, all \(v_i\)'s that belong to one class \( S_j \) must be consecutive and if equal values would be allowed, these would end up in the same class provided that at least \(k\) unique values are given (for proof, see: Fisher, On Grouping for Maximum Homogeneity, 1958). Consequently, the \(S_j\)'s can be 1-to-1 related to a strictly increasing sequence of class-break-indices \(i_j := \min S_j\) for \(j = 1 .. k\) with \(i_1 = 1\) and also to a strictly increasing sequence of class-break-values \(v_{i_j}\).


  • With \(n\) values and \(k\) classes, one can choose \(\binom{n-1}{k-1} = \frac{(n-1)!}{(k-1)! (n-k)!} \) elements as first value of a class (the minimum value must be the minimum of the lowest class).
  • given indices \( i_1 \le i_2 \) and \( i_3 \le i_4 \): \( i_1 \le i_3 \wedge i_2 \le i_4 \implies m(i_1..i_2) \le m(i_3..i_4) \)

Dynamic programming approach

Take \(SSM_{i,j}\) as the maximum sum of squared deviations for the first i values classified to j classes and \(CB_{i,j}\) as the value index of the last class-break for such classification. Since a maximal sum is a sum, it is easy to see that \( SSM_{i,j} \) is equal to the maximum value for \( p \in \{j..i\} \)of \( SSM_{p-1, j-1}+ssm(\{p..i\}) \) . Thus, the \( SSM_{p-1, j-1} \) provide Overlapping Sub Problems with an Optimal Sub Structure and the following dynamic program can be used to find a solution.

Dynamic rule for \(i \ge j \gt 1 \):

$$ \begin{align} SSM_{i,j} &:= \max\limits_{p \in \{j..i\}} SSM_{p-1, j-1}+ssm(\{p..i\}) \\ CB_{i,j} &:= arg\max\limits_{p \in \{j..i\}} SSM_{p-1, j-1}+ssm(\{p..i\}) \end{align} $$

Start conditions: $$ \begin{align} SSM_{i, 1} &:= ssm(\{1..i\}) \\ CB_{i,1} &:= 1 \\ \end{align} $$ Thus $$ \begin{align} SSM_{i,i} &:= sqr(\{1..i\}) \\ CB_{i,i} &:= i \\ \end{align} $$ Note that \( CB_{i,j} \) and \( SSM_{i,j} \) are only defined for \( j \le i \) thus \( i - j \ge 0 \). Furthermore, for finding values with indices \( (n,k) \), only elements with \( i-j \le n-k \) are relevant.

  • If \(i = CB_{n,k}\) then the corresponding previous class-break index has been stored as \(CB_{i-1,k-1}\).
  • Fisher described an algorithm that goes through all i and for each i finding and storing all \(SSD_{i,j}\) and \(CB_{i,j}\) for each subsequent j, which is the approach taken in the found implementations (see links). This requires \( O(k \times n) \) working memory and \(O(k \times n^2)\) time.
  • Dynamically going through all j and for each j finding all \(SSM_{i,j}\) for each i only requires \(O(n-k)\) working memory and \(O(k \times n \times \log(n))\) time by exploiting the no-crossing-paths characteristic, as proven below, but only provides the last ClassBreak for each i.
  • no crossing paths property: \( CB_{i,j} \)is monotonous increasing with i, thus \( i1 \le i2 \implies CB_{i1,j} \le CB_{i2,j} \) (see proof below).
  • To obtain a chain of \(CB_{i,j}\) back from \(CB_{n,k}\), one needs to maintain \( (n-k+1) \times (k-2) \) intermediate CB's, unless the \( O(k \times n \times \log(n)) \) algorithm is reapplied k times.

Complexity

  • Each \(ssm(i_1..i_2)\) can be calculated in constant time as \( {{\left(WV[i_2] - WV[i_1-1]\right)^2} \over {W[i_2] - W[i_1-1]}} \) after a single initialization of \(O(n)\) time and keeping a buffer of \(O(n)\) size containing series of cumulative weights, \( W \), and cumulative weighted values, \( WV \).
  • During the calculation of the \(n-k+1\) relevant values of SSM and CB for a row \(j \in\{2..k-1\}\) using the \(n-k+1\) values that were calculated for row \(j-1\), one can divide and conquer recursively by calculating \( CB_{i,j} \) for i in the middle of a range for which a left and right boundary are given and using the "no crossing paths" property to cut the remaining number of options in half by sectioning the lower indices to the left or equal options and the higher indices to the greater or equal options. The required processing time budget \( C(n,m) \) for processing \(n\) class-breaks using \(m\) values for previous class breaks is \( c \times m + \max i: [ C(n/2,i) + C(n/2,m+1-i) ] \). A budget of \( C(n,m):=c \times (m+n) \times (\log_2(n)+1) \) is sufficient, thus the calculation of \( CB_{i,j} \) given all \( CB_{i,j-1} \) requires processing time of \( C(n, n) \), which is \( O(n \times \log n) \).

Proof of the no crossing paths property

  • Given indices \( p1, p2, i1, i2 \) such that \( p1 < p2 \le i1 < i2 \),

\[ \begin{align} m_p &:= m(\{p1 .. p2-1\}) &w_p &:= w(\{p1 .. p2-1\}) \\ m_j &:= m(\{p2 .. i1 \}) &w_j &:= w(\{p2 .. i1 \}) \\ m_n &:= m(\{i1+1 .. i2\}) &w_n &:= w(\{i1+1 .. i2\}) \\ m_{pj} &:= m(\{p1 .. i1 \}) &w_{pj} &:= w(\{p1 .. i1\}) \\ m_{jn} &:= m(\{p2 .. i2 \}) &w_{jn} &:= w(\{p2 .. i2\}) \\ m_{pjn} &:= m(\{p1 .. i2 \}) &w_{pjn} &:= w(\{p1 .. i2\}) \\ \end{align} \]

  • it follows that

\[ v_{p1} < v_{p2} \le v_{i1} < v_{i2} \\ m_p < m_{pj} < m_j < m_{jn} < m_n \\ m_{pj} < m_{pjn} < m_{jn} \] \[ \begin{align} w_{pj} &= w_p + w_j \\ w_{jn} &= w_j + w_n \\ w_{pjn} &= w_p + w_j + w_n = w_{pj}+w_{jn} - w_j \\ \end{align} \] \[ \begin{align} w_{pj} \times m_{pj} &= w_p \times m_p + w_j \times m_j \\ w_{jn} \times m_{jn} &= w_j \times m_j + w_n \times m_n \\ w_{pjn} \times m_{pjn} &= w_j \times m_j + w_n \times m_n + w_n \times m_n \\ w_{pj} \times m_{pj}+w_{jn} \times m_{jn} &= w_j \times m_j + w_{pjn} \times m_{pjn} \end{align} \] \[ \begin{align} ssm(\{p1 .. i1\}) &= w_{pj} \times m_{pj}^2 \\ ssm(\{p2 .. i1\}) &= w_{j} \times m_{j}^2 \\ ssm(\{p1 .. i2\}) &= w_{pjn} \times m_{pjn}^2 \\ ssm(\{i1 .. i2\}) &= w_{jn} \times m_{jn}^2 \end{align} \]

  • and it follows that the following quantities are positive:

\[ \begin{align} w_{pj1} &:= w_j \times (m_{jn} - m_{j}) / (m_{jn} - m_{pj}) &> 0 \\ w_{jn1} &:= w_j \times (m_{j} - m_{pj}) / (m_{jn} - m_{pj}) &> 0 \\ w_{pj3} &:= w_{pj} - w_{pj1} &> 0\\ w_{jn3} &:= w_{jn} - w_{jn1} &> 0\\ \end{align} \]

  • note that

\[ \begin{align} w_{pj1}+w_{jn1} &= w_j \\ w_{pj3}+w_{jn3} &= w_{pjn} \\ w_{pj1}+w_{pj3} &= w_{pj} \\ w_{jn1}+w_{jn3} &= w_{jn} \\ \\ w_{pj1} \times m_{pj} + w_{jn1} \times m_{jn} &= w_j \times m_j \\ w_{pj3} \times m_{pj} + w_{jn3} \times m_{jn} &= w_{pj} \times m_{pj}+w_{jn} \times m_{jn}-w_j \times aj = w_{pjn} \times m_{pjn} \end{align} \]

  • and therefore

\[ \begin{align} w_{pj1} \times (m_{pj} - m_j )^2 &+ w_{jn1} \times (m_{jn} - m_j )^2 &= w_{pj1} \times m_{pj}^2 &+ w_{jn1} \times m_{jn}^2 - w_{j} \times m_{j}^2 &> 0 \\ w_{pj3} \times (m_{pj} - m_{pjn})^2 &+ w_{jn3} \times (m_{jn} - m_{pjn})^2 &= w_{pj3} \times m_{pj}^2 &+ w_{jn3} \times m_{jn}^2 - w_{pjn} \times m_{pjn}^2 &> 0 \end{align} \]

  • summation of these two inequalities gives:

\[ w_{pj} \times m_{pj}^2 + w_{jn} \times m_{jn}^2 - w_j \times m_j^2 - w_{pjn} \times m_{pjn}^2 > 0 \]

  • thus

\[ w_{jn} \times m_{jn}^2 - w_j \times m_j^2 > w_{pjn} \times m_{pjn}^2 - w_{pj} \times m_{pj}^2 \]

  • which can be restated as

\[ ssm(\{p2..i2\}) - ssm(\{p2..i1\}) > ssm(\{p1..i2\}) - ssm(\{p1..i1\}) \]

  • Given that : \( SSM_{i1,j} = SSM_{p2-1,j-1} + ssm(\{p2..i1\}) \ge SSM_{p1-1,j-1} + ssm(\{p1..i1\}) \text{ for all } p1 \lt p2 \) when \( p2 \) is taken as \( CB_{i1,j} \),
  • adding the last two inequalities results in:

\[ SSM_{p2-1,j-1} + ssm(\{p2..i2\}) \gt SSM_{p1-1,j-1} + ssm(\{p1..i2\}) \text{ for all } n \ge i \]

  • given the definition for \( CB \), this implies that \( CB_{i2,j} \) cannot be \( p1 \) and therefore \( CB_{i2,j} \ge p2 = CB_{i1,j} \), QED.

Further improvements

Note that

  • for each \( i \), and \( j < i \): \( SSM_{i,j} < SSM_{i,j+1} \) since splitting up any class would make the total \( SSM \) larger.
  • it is not always true that \( SSM_{i,j} <= SSM_{i+1,j} \), for example, take \( v_1 = -10 \) and \( v_2 = 0 \), then \( SSM_{1,1} = 1 \times (-10)^2 = 100 \) and \( SSM_{2,1} = 2 \times (-5)^2 = 50 \)
  • From the previous paragraph, it follows that:

\[ ssm(\{p2..i2\}) - ssm(\{p1..i2\}) > ssm(\{p2..i1\}) - ssm(\{p1..i1\}) \]

  • for each \( i1 \), \( i2 > i1 \),and \( j < i1 \): \( SSM_{i2,j+1} - SSM_{i2,j} \ge SSM_{i1,j+1} - SSM_{i1,j} \), which can be proven by induction.

following the lines of the latter induction proof it can be shown (elaborate!) that for each \( i \) and \( j < i \): \( CB_{i,j} \le CB_{i,j+1} \), thus \( CB_{i,j} \) can be used as a lower bound when calculating \( CB_{i,j+1} \).

Previously known Algorithms

  • Iteratively moving values from classes with high variation to classes with low variation is discussed on Wikipedia, which does not always result in an optimal solution. Start for example with Class 1 with (1 4) and Class 2 with (99 100). The described algorithms prescribes that 4 needs to go to Class 2 since Class 1 has the highest variation.
  • Local moves as long as the total sum of squared deviations decreases (as described in the ESRI FAQ). Note that this has the risk of sticking to a local optimum. For example when starting with Class 1 = {1, 8, 9, 10} and Class 2 = {16}, the SSD = 50. When moving the 10 to class 2, the SSD increases to 56, but the global minimum is reached when Class 1 = (1) and Class 2 = (8, 9, 10, 16) with an SSD of 38.75.

Source Code

Source code is part of the [GeoDms] and can be found [->here<-] or in our Subversion Repository, look for the function ClassifyJenksFisher() .

Acknowledgements

This algorithm and proof of its validity and complexity is found by Maarten Hilferink, also being the author of this page. © Object Vision BV. The work for developing this method has partly been made possible by a software development request from the [PBL Netherlands Environmental Assessment Agency].

References

  • Fisher, W. D. (1958). On grouping for maximum homogeneity. American Statistical Association Journal, 53, 789-798.
  • Jenks, G. F. (1977). Optimal data classification for choropleth maps, Occasional paper No. 2. Lawrence, Kansas: University of Kansas, Department of Geography.
  • Evans, I.S. (1977). The selection of class intervals. Transactions of the Institute of British Geographers, 2:98-124.

Links

Personal tools