#jsDisabledContent { display:none; } My Account | Register | Help

# Qualitative variation

Article Id: WHEBN0009252619
Reproduction Date:

 Title: Qualitative variation Author: World Heritage Encyclopedia Language: English Subject: Collection: Publisher: World Heritage Encyclopedia Publication Date:

### Qualitative variation

An index of qualitative variation (IQV) is a measure of statistical dispersion in nominal distributions. There are a variety of these, but they have been relatively little-studied in the statistics literature. The simplest is the variation ratio, while more complex indices include the information entropy.

## Properties

There are several types of indexes used for the analysis of nominal data. Several are standard statistics that are used elsewhere - range, standard deviation, variance, mean deviation, coefficient of variation, median absolute deviation, interquartile range and quartile deviation.

In addition to these several statistics have been developed with nominal data in mind. A number have been summarized and devised by Wilcox (Wilcox 1967), (Wilcox 1973), who requires the following standardization properties to be satisfied:

• Variation varies between 0 and 1.
• Variation is 0 if and only if all cases belong to a single category.
• Variation is 1 if and only if cases are evenly divided across all category.[1]

In particular, the value of these standardized indices does not depend on the number of categories or number of samples.

For any index, the closer to uniform the distribution, the larger the variance, and the larger the differences in frequencies across categories, the smaller the variance.

Indices of qualitative variation are then analogous to information entropy, which is minimized when all cases belong to a single category and maximized in a uniform distribution. Indeed, information entropy can be used as an index of qualitative variation.

One characterization of a particular index of qualitative variation (IQV) is as a ratio of observed differences to maximum differences.

## Wilcox's indexes

Wilcox gives a number of formulae for various indices of QV (Wilcox 1973), the first, which he designates DM for "Deviation from the Mode", is a standardized form of the variation ratio, and is analogous to variance as deviation from the mean.

### ModVR

The formula for the variation around the mode (ModVR) is derived as follows:

M = \sum_{ i = 1 }^K ( f_m - f_i )

where fm is the modal frequency, K is the number of categories and fi is the frequency of the ith group.

This can be simplified to

M = Kf_m - N

where N is the total size of the sample.

Freeman's index (or variation ratio) is[2]

v = 1 - \frac{ f_m }{ N }

This is related to M as follows:

\frac{ ( \frac{ f_m }{ N } ) - \frac{ 1 }{ K } }{ \frac{ N }{ K }\frac{ ( K - 1 )} { N } } = \frac{ M }{ N( K - 1 ) }

The ModVR is defined as

ModVR = 1 - \frac{ Kf_m - N }{ N( K - 1 ) } = \frac{ K( N - f_m ) }{ N ( K - 1 ) } = \frac{ K v }{ K - 1 }

where v is Freeman's index.

Low values of ModVR correspond to small amount of variation and high values to larger amounts of variation.

When K is large, ModVR is approximately equal to Freeman's index v.

### RanVR

This is based on the range around the mode. It is defined to be

RanVR = 1 - \frac{ f_m - f_l }{ f_m } = \frac{ f_l }{ f_m }

where fm is the modal frequency and fl is the lowest frequency.

### AvDev

This is an analog of the mean deviation. It is defined as the arithmetic mean of the absolute differences of each value from the mean.

AvDev = 1 - \frac{ 1 }{ 2N } \frac{ K }{ K - 1 } \sum^K_{ i = 1 }| f_i - \frac{ N }{ K } |

### MNDif

This is an analog of the mean difference - the average of the differences of all the possible pairs of variate values, taken regardless of sign. The mean difference differs from the mean and standard deviation because it is dependent on the spread of the variate values among themselves and not on the deviations from some central value.[3]

MNDif = 1 - \frac{1}{ N( K - 1 ) } \sum_{ i = 1 }^{ K - 1 } \sum_{ j = i + 1 }^K | f_i - f_j |

where fi and fj are the ith and jth frequencies respectively.

The MNDif is the Gini coefficient applied to qualitative data.

### VarNC

This is an analog of the variance.

VarNC = 1 - \frac{ 1 }{ N^2 }\frac{ K }{ ( K - 1 ) } \sum( f_i - \frac{ N }{ K } )^2

It is the same index as Mueller and Schussler's Index of Qualitative Variation[4] and Gibbs' M2 index.

It is distributed as a chi square variable with K - 1 degrees of freedom.[5]

### StDev

Wilson has suggested two versions of this statistic.

The first is based on AvDev.

StDev_1 = 1 - \sqrt{ \frac{ \sum_{ i = 1 }^K( f_i - \frac{ N }{ K } )^2 }{ ( N - \frac{ N }{ K } )^2 + ( K - 1 ) ( \frac{ N }{ K } )^2 } }

The second is based on MNDif

StDev_2 = 1 - \sqrt{ \frac{ \sum^{ K - 1 }_{ i = 1 } \sum^K_{j = i + 1 } ( f_i - f_j ) }{ N^2 ( K - 1 )} }

### HRel

This index was originally developed by Claude Shannon for use in specifying the properties of comnmunication channels.

HRel = \frac{ - \sum p_i log_2 p_i }{ \log_2 K }

where pi = fi / N.

### B index

Wilcox adapted a proposal of Kaiser[6] based on the geometric mean and created the B index. The B index is defined as

B = 1 - \sqrt{ 1 - [ \sqrt[k] { \Pi_{ i = 1 }^k \frac{ f_i K }{ N } } ]^2 }

### R packages

Several of these indices have been implemented in the R language.[7]

## Gibb's indices and related formulae

Gibbs et al proposed six indexes.[8]

### M1

The unstandardized index (M1) (Gibbs 1975, p. 471) is

M1 = 1 - \sum_{ i = 1 }^K p_i^2

where K is the number of categories and p_i = f_i / N is the proportion of observations that fall in a given category i.

M1 can be interpreted as one minus the likelihood that a random pair of samples will belong to the same category (Lieberson 1969, p. 851), so this formula for IQV is a standardized likelihood of a random pair falling in the same category. This index has also referred to as the index of differentiation, the index of sustenance differentiation and the geographical differentiation index depending on the context it has been used in.

### M2

A second index is the M2[9](Gibbs 1975, p. 472) is:

M2 = \frac{ K }{ K - 1 } \left( 1 - \sum_{ i = 1 }^K p_i^2 \right)

where K is the number of categories and p_i = f_i / N is the proportion of observations that fall in a given category i. The factor of \frac{ K }{ K - 1 } is for standardization.

M1 and M2 can be interpreted in terms of variance of a multinomial distribution (Swanson 1976) (there called an "expanded binomial model"). M1 is the variance of the multinomial distribution and M2 is the ratio of the variance of the multinomial distribution to the variance of a binomial distribution.

### M4

The M4 index is

M4 = \frac{ \sum_{ i = 1 }^K | X_i - m | }{ 2 \sum_{ i = 1 }^K X_i }

where m is the mean.

### M6

The formula for M6 is

M6 = K \left[ 1 - \frac{ \sum_{ i = 1 }^K | X_i - m | }{ 2 N } \right]

where K is the number of categories, Xi is the number of data points in the ith category, N is the total number of data points, || is the absolute value (modulus) and

m = \frac{ \sum_{ i = 1 }^K X_i }{ N }

This formula can be simplified

M6 = K\left[ 1 - \frac{ \sum_{ i = 1 }^K | p_i - \frac{ 1 }{ N } | }{ 2 } \right]

where pi is the proportion of the sample in the ith category.

In practice M1 and M6 tend to be highly correlated which militates against their combined used.

### Related indices

The sum

\sum_{ i = 1 }^K p_i^2

has also found application. This is known as the Simpson index in ecology and as the Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics. A variant of this is known as the Hunter–Gaston index in microbiology[10]

In linguistics and cryptanalysis this sum is known as the repeat rate. The incidence of coincidence (IC) is an unbiased estimator of this statistic[11]

IC = \sum \frac{ f_i ( f_i - 1 ) }{ n ( n - 1 ) }

where fi is the count of the ith grapheme in the text and n is the total number of graphemes in the text.

M1

The M1 statistic defined above has been proposed several times in a number of different settings under a variety of names. These include Gini's index of mutability,[12] Simpson's measure of diversity,[13] Bachi's index of linguistic homogeneity,[14] Mueller and Schuessler's index of qualitative variation,[15] Gibbs and Martin's index of industry diversification,[16] Lieberson's index.[17] and Blau's index in sociology, psychology and management studies.[18] The formulation of all these indices are identical.

Simpson's D is defined as

D = 1 - \sum_{ i = 1 }^K { \frac{ n_i ( n_i - 1 ) }{ n( n - 1 ) } }

where n is the total sample size and ni is the number of items in the ith category.

For large n we have

u \sim 1 - \sum_{ i = 1 }^K p_i^2

Another statistic that has been proposed is the coefficient of unalikeability which ranges between 0 and 1.[19]

u = \frac{ c( x, y ) }{ n^2 - n }

where n is the sample size and c(x,y) = 1 if x and y are alike and 0 otherwise.

For large n we have

u \sim 1 - \sum_{ i = 1 }^K p_i^2

where K is the number of categories.

Another related statistic is the quadratic entropy

H^2 = 2 \left( 1 - \sum_{ i = 1 }^K p_i^2 \right)

which is itself related to the Gini index.

M2

Greenberg's monolingual non weighted index of linguistic diversity[20] is the M2 statistic defined above.

M7

Another index – the M7 – was created based on the M4 index of Gibbs et al.[21]

M7 = \frac{ \sum_{ i = 1 }^K \sum_{ j = 1 }^L | R_i - R | }{ 2 \sum R_i }

where

R_{ ij } = \frac{ O_{ ij } } { E_{ ij } } = \frac{ O_{ ij } }{ n_i p_j }

and

R = \frac{ \sum_{ i = 1 }^K \sum_{ j = 1 }^L R_{ ij } }{ \sum_{ i = 1 }^K n_i }

where K is the number of categories, L is the number of subtypes, Oij and Eij are the number observed and expected respectively of subtype j in the ith category, ni is the number in the ith category and pj is the proportion of subtype j in the complete sample.

Note: This index was designed to measure women's participation in the work place: the two subtypes it was developed for were male and female.

## Other single sample indices

These indices are summary statistics of the variation within the sample.

### Berger–Parker index

The Berger–Parker index equals the maximum p_i value in the dataset, i.e. the proportional abundance of the most abundant type.[22] This corresponds to the weighted generalized mean of the p_i values when q approaches infinity, and hence equals the inverse of true diversity of order infinity (1/D).

### Brillouin index of diversity

This index is strictly applicable only to entire populations rather than to finite samples. It is defined as

I_B = \frac{ \log( N! ) - \sum_{ i = 1 }^K ( \log( n_i! ) ) }{ N }

where N is total number of individuals in the population, ni is the number of individuals in the ith category and N! is the factorial of N. Brillouin's index of evenness is defined as

E_B = I_B / I_{B( \max )}

where IB(max) is the maximum value of IB.

### Hill's diversity numbers

Hill suggested a family of diversity numbers[23]

N_a = \frac{1}{ \left[ \sum_{ i = 1 }^K p_i^a \right]^{ a - 1 } }

For given values of a several of the other indices can be computed

• a = 0: Na = species richness
• a = 1: Na = Shannon's index
• a = 2: Na = 1/Simpson's index (without the small sample correction)
• a = 3: Na = 1/Berger–Parker index

Hill also suggested a family of evenness measures

E_{ a, b } = \frac{ N_a }{ N_b }

where a > b.

Hill's E4 is

E_4 = \frac{ N_2 } { N_1 }

Hill's E5 is

E_5 = \frac{ N_2 - 1 } { N_1 - 1 }

### Margalef's index

I_{Marg} = \frac{ S - 1 } { log_e N}

where S is the number of data types in the sample and N is the total size of the sample.[24]

### Menhinick's index

I_\mathrm{Men} = \frac{ S }{ \sqrt{ N } }

where S is the number of data types in the sample and N is the total size of the sample.[25]

In linguistics this index is the identical with the Kuraszkiewicz index (Guiard index) where S is the number of distinct words (types) and N is the total number of words (tokens) in the text being examined.[26][27] This index can be derived as a special case of the Generalised Torquist function.[28]

### Q statistic

This is a statistic invented by Kempton and Taylor.[29] and involves the quartiles of the sample. It is defined as

Q = \frac{ \frac{ 1 }{ 2 } ( n_{ R1 } + n_{ R2 } ) + \sum_{ j = R_1 + 1 }^{ R_2 - 1 } n_j } { log( R_2 / R_1 ) }

where R1 and R1 are the 25% and 75% quartiles respectively on the cumulative species curve, nj is the number of species in the jth category, nRi is the number of species in the class where Ri falls (i = 1 or 2).

### Shannon–Wiener index

This is taken from information theory

H = \log_e N - \frac{ 1 }{ N } \sum n_i p_i \log( p_i )

where N is the total number in the sample and pi is the proportion in the ith category.

In ecology where this index is commonly used, H usually lies between 1.5 and 3.5 and only rarely exceeds 4.0.

An approximate formula for the standard deviation (SD) of H is

SD( H ) = \frac{ 1 }{ N } \left[ \sum p_i [ \log_e( p_i ) ]^2 - H^2 \right]

where pi is the proportion made up by the ith category and N is the total in the sample.

A more accurate approximate value of the variance of H(var(H)) is given by[30]

\operatorname{var}( H ) = \frac{ \sum p_i [ \log( p_i ) ]^2 - \left[ \sum p_i \log( p_i ) \right]^2 } { N } + \frac{ K - 1 }{ 2N^2 } + \frac{ -1 + \sum p_i^2 - \sum p_i^{ -1 } \log( p_i ) + \sum p_i^{ -1 }\sum p_i \log( p_i ) }{ 6N^3 }

where N is the sample size and K is the number of categories.

A related index is the Pielou J defined as

J = \frac{ H } {\log_e( S ) }

One difficulty with this index is that S is unknown for a finite sample. In practice S is usually set to the maximum present in any category in the sample.

### Rényi entropy

The Rényi entropy is a generalization of the Shannon entropy to other values of q than unity. It can be expressed:

{}^qH = \frac{ 1 }{ 1 - q } \; \ln\left ( \sum_{ i = 1 }^K p_i^q \right )

which equals

{}^qH = \ln\left ( { 1 \over \sqrt[ q - 1 ]{ X + Y } \log \frac{ X + Y }{ x_{ kj } } \right)
H_\mathrm{ obs } = \sum \frac{ x_{ ij } + x_{ kj } }{ X + Y } \log \frac{ X + Y }{ x_{ ij } + x_{ kj } }

In these equations xij and xkj are the number of times the jth data type appears in the ith or kth sample respectively.

### Rarefaction index

In a rarefied sample a random subsample n in chosen from the total N items. In this sample some groups may be necessarily absent from this subsample. Let X_n be the number of groups still present in the subsample of n items. X_n is less than K the number of categories whenever at least one group is missing from this subsample.

The rarefaction curve, f_n is defined as:

f_n = E[ X_n ] = K - \binom{ N }{ n }^{ -1 } \sum_{ i = 1 }^K \binom{ N - N_i }{ n }

Note that 0 ≤ f(n) ≤ K.

Furthermore,

f( 0 )= 0,\ f( 1 ) = 1,\ f( N ) = K .

Despite being defined at discrete values of n, these curves are most frequently displayed as continuous functions.[38]

This index is discussed further in Rarefaction (ecology).

### Caswell's V

This is a z type statistic based on Shannon's entropy.[39]

V = \frac{ H - E( H ) }{ SD( H ) }

where H is the Shannon entropy, E(H) is the expected Shannon entropy for a neutral model of distribution and SD(H) is the standard deviation of the entropy. The standard deviation is estimated from the formula derived by Pielou

SD( H ) = \frac{ 1 }{ N } \left[ \sum p_i [ \log_e( p_i ) ]^2 - H^2 \right]

where pi is the proportion made up by the ith category and N is the total in the sample.

### Lloyd & Ghelardi's index

This is

I_{ LG } = \frac{ K }{ K' }

where K is the number of categories and K' is the number of categories according to MacArthur's broken stick model yielding the observed diversity.

### Average taxonomic distinctness index

This index is used to compare the relationship between hosts and their parasites.[40] It incorporates information about the phylogenetic relationship amongst the host species.

S_{ TD } = 2 \frac{ \sum \sum_{ i < j } \omega_{ ij } }{ s( s - 1 ) }

where s is the number of host species used by a parasite and ωij is the taxonomic distinctness between host species i and j.

## Indices for comparison of two or more data types within a single sample

Several of these indexes have been developed to document the degree to which different data types of interest may coexist within a geographic area.

### Index of dissimilarity

Let A and B be two types of data item. Then the index of dissimilarity is

D = \frac{ 1 }{ 2 } \sum_{ i = 1 }^K \left| \frac{ A_i }{ A } - \frac{ B_i }{ B } \right|

where

A = \sum_{ i = 1 }^K A_i
B = \sum_{ i = 1 }^K B_i

Ai is the number of data type A at sample site i, Bi is the number of data type B at sample site i, K is the number of sites sampled and || is the absolute value.

This index is probably better known as the index of dissimilarity (D).[41] It is closely related to the Gini index.

This index is biased as its expectation under a uniform distribution is > 0.

A modification of this index has been proposed by Gorard and Taylor.[42] Their index (GT) is

GT = D \left( 1 - \frac{ A }{ A + B } \right)

### Index of segregation

The index of segregation (IS)[43] is

SI = \frac{ 1 }{ 2 }\sum_{ i = 1 }^K | \frac{ A_i }{ A } - \frac{ t_i - A_i }{ T - A } |

where

A = \sum_{ i = 1 }^K A_i
T = \sum_{ i = 1 }^K t_i

and K is the number of units, Ai and ti is the number of data type A in unit i and the total number of all data types in unit i.

### Hutchen's square root index

This index (H) is defined as[44]

H = 1 - \sum_{ i = 1}^K \sum_{ j = 1 }^i \sqrt{ p_i p_j }

where pi is the proportion of the sample composed of the ith variate.

### Lieberson's isolation index

This index ( Lxy ) was invented by Lieberson in 1981.[45]

L_{ xy } = \frac{ 1 }{ N } \sum_{ i = 1 }^K \frac{ X_i Y_i }{ X_\mathrm{ tot } }

where Xi and Yi are the variables of interest at the ith site, K is the number of sites examined and Xtot is the total number of variate of type X in the study.

### Bell's index

This index is defined as[46]

I_R = \frac{ p_{ xx } - p_x } { 1 - p_x }

where px is the proportion of the sample made up of variates of type X and

p_{ xx } = \frac{ \sum_{ i = 1 }^K x_i p_i }{ N_x }

where Nx is the total number of variates of type X in the study, K is the number of samples in the study and xi and pi are the number of variates and the proportion of variates of type X respectively in the ith sample.

### Index of isolation

The index of isolation is

II = \sum_{ i = 1 }^K \frac{ A_i }{ A } \frac{ A_i }{ t_i }

where K is the number of units in the study, Ai and ti is the number of units of type A and the number of all units in ith sample.

A modified index of isolation has also been proposed

MII = \frac{ II - \frac{ A }{ T } }{ 1 - \frac{ A }{ T } }

The MII lies between 0 and 1.

### Gorard's index of segregation

This index (GS) is defined as

GS = \frac{ 1 }{ 2 } \sum_{ i = 1 }^K | \frac{ A_i }{ A } - \frac{ t_i }{ T } |

where

A = \sum_{ i = 1 }^K A_i
T = \sum_{ i = 1 }^K t_i

and Ai and ti are the number of data items of type A and the total number of items in the ith sample.

### Index of exposure

This index is defined as

IE = \sum_{ i = 1 }^K \frac{ A_i }{ A } \frac{ B_i }{ t_i }

where

A = \sum_{ i = 1 }^K A_i

and Ai and Bi are the number of types A and B in the ith category and ti is the total number of data points in the ith category.

### Ochai index

This is a binary form of the cosine index.[47] It is used to compare presence/absence data of two data types (here A and B). It is defined as

O = \frac{ a }{ \sqrt{ ( a + b )( a + c ) } }

where a is the number of sample units where both A and B are found, b is number of sample units where A but not B occurs and c is the number of sample units where type B is present but not type A.

### Kulczynczi's coefficient

This coeficient was invented by Stanisław Kulczyński in 1927[48] and is an index of association between two types (here A and B). It varies in value between 0 and 1. It is defined as

K = \frac{ a }{ 2 } ( \frac{ 1 }{ a + b } + \frac{ 1 }{ a + c } )

where a is the number of sample units where type A and type B are present, b is the number of sample units where type A but not type B is present and c is the number of sample units where type B is present but not type A.

### Yule's Q

This index was invented by Yule in 1900.[49] It concerns the association of two different types (here A and B). It is defined as

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Q varies in value between -1 and +1. In the ordinal case Q is known as the Goodman-Kruskal γ.

Because the denominator potentially may be zero, Leinhert and Sporer have recommened adding +1 to a, b, c and d.[50]

### Yule's Y

This index is defined as

Y = \frac{ \sqrt{ ad } - \sqrt{ bc } }{ \sqrt{ ad } + \sqrt{ bc } }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Baroni-Urbani-Buser coefficient

This index was invented by Baroni-Urbani and Buser in 1976.[51] It varies between 0 and 1 in value. It is defined as

BUB = \frac{ \sqrt{ ad } + a }{ \sqrt{ ad } + a + b + c }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. When d = 0, this index is identical to the Jaccard index.

### Hamman coefficient

This coefficient is defined as

H = \frac{ ( a + d ) - ( b + c ) }{ a + b + c + d }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Rogers-Tanimoto coefficient

This coefficient is defined as

RT = \frac{ ( a + d ) }{ a + 2( b + c ) + d }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Sokal-Sneath coefficient

This coefficient is defined as

SS = \frac{ 2( a + d ) }{ 2( a + d ) + b + c }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Sokal's binary distance

This coefficient is defined as

SBD = \sqrt{ \frac{ b + c }{ a + b + c + d } }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Russel-Rao coeeficient

This coefficient is defined as

RR = \frac{ a }{ a + b + c + d }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Phi coefficient

This coefficient is defined as

\phi = \frac{ ad - bc }{ \sqrt{ ( a + b ) ( a + c ) ( b + c ) ( c + d ) } }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Soergel's coefficient

This coefficient is defined as

S = \frac{ b + c }{ b + c + d }

where b is the number of samples where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Simpson's coefficient

This coefficient is defined as

S = \frac{ a }{ a + min( b, c ) }

where b is the number of samples where type A is present but not type B, c is the number of samples where type B is present but not type A.

### Dennis' coefficient

This coefficient is defined as

D = \frac{ ad - bc }{ \sqrt{ ( a + b + c + d ) ( a + b ) ( a + c ) } }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Forbes' coefficient

This coefficient is defined as

F = \frac{ a ( a + b + c + d ) }{ ( a + b ) ( a + c ) }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Simple match coefficient

This coefficient is defined as

SM = \frac{ a + d }{ ( a + b + c + d ) }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Fossum's coefficient

This coefficient is defined as

F = \frac{ ( a + b + c + d ) ( a - 0.5 )^2 }{ ( a + b ) ( a + c ) }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Stile's coefficient

This coefficient is defined as

S = log [ \frac{ n ( | ad - bc | - \frac{ n }{ 2 } )^2 }{ ( a + b ) ( a + c ) ( b + d )( c + d ) } ]

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A, d is the sample count where neither type A nor type B are present, n equals a + b + c + d and || is the modulus (absolute value) of the difference.

### Michael's coefficient

This coefficient is defined as

M = \frac{ 4 ( ad - bc ) }{ ( a + d )^2 + ( b + c )^2 }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Pierce's coefficient

In 1884 Pierce suggested the following coefficient

P = \frac{ ab + bc }{ ab + 2bc + cd }

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

### Hawkin-Dotson coefficient

In 1975 Hawkin and Dotson proposed the following coefficient

HD = \frac{ 1 }{ 2 } ( \frac{ a }{ a + b + c } + \frac{ d }{ b + c + d } )

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present

## Indices for comparison between two or more samples

### Czekanowski's quantitative index

This is also known as the Bray–Curtis index, Schoener's index, least common percentage index, index of affinity or proportional similarity. It is related to the Sørensen similarity index.

CZI = \frac{ \sum \min( x_i, x_j ) }{ \sum ( x_i + x_j ) }

where xi and xj are the number of species in sites i and j respectively and the minimum is taken over the number of species in common between the two sites.

### Canberra metric

The Canberra distance is a weighted version of the L1 metric. It was introduced by introduced in 1966[52] and refined in 1967[53] by G. N. Lance and W. T. Williams. It is used to defined a distance bwteen two vectors - here two sites with K categories within each site.

The Canberra distance d between vectors p and q in an K-dimensional real vector space is

d ( \mathbf{ p }, \mathbf{ q } ) = \sum_{ i = 1 }^n \frac{ |p_i - q_i |}{ | p_i| + |q_i | }

where pi and qi are the values of the ith category of the two vectors.

### Sorensen's coefficient of community

This is used to measure similarities between communities.

CC = \frac{ 2c } { s_1 + s_2 }

where s1 and s2 are the number of species in community 1 and 2 respectively and c is the number of species common to both areas.

### Jaccard's index

This is a measure of the similarity between two samples:

J = \frac{ A }{ A + B + C }

where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively.

This index was invented in 1902 by the Swiss botanist Paul Jaccard.[54]

Under a random distribution the expected value of J is[55]

J = \frac{ 1 }{ A } ( \frac{ 1 }{ A + B + C } )

The standard error of this index with the assumption of a random distribution is

SE( J ) = \sqrt{ \frac{ A ( B + C ) } { N ( A + B + C )^3 } }

where N is the total size of the sample.

### Dice's index

This is a measure of the similarity between two samples:

D = \frac{ 2A }{ 2A + B + C }

where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively.

### Match coefficient

This is a measure of the similarity between two samples:

M = \frac{ N - B - C }{ N }

where N is the number of data points in the two samples and B and C are the data points found only in the first and second samples respectively.

### Morisita's index

Morisita’s index of dispersion ( Im ) is the scaled probability that two points chosen at random from the whole population are in the same sample.[56] Higher values indicate a more clumped distribution.

I_m = \frac { \sum x ( x - 1 ) } { n m ( m - 1 ) }

An alternative formulation is

I_m = n \frac{ \sum x^2 - \sum x } { \left( \sum x \right)^2 - \sum x }

where n is the total sample size, m is the sample mean and x are the individual values with the sum taken over the whole sample. It is also equal to

I_m = \frac { n\ IMC } { nm - 1 }

where IMC is Lloyd's index of crowding.[57]

This index is relatively independent of the population density but is affected by the sample size.

Morisita showed that the statistic[56]

I_m \left( \sum x - 1 \right) + n - \sum x

is distributed as a chi-squared variable with n − 1 degrees of freedom.

A alternative significance test for this index has been developed for large samples.[58]

z = \frac { I_m - 1 } { 2 / n m^2 }

where m is the overall sample mean, n is the number of sample units and z is the normal distribution abscissa. Significance is tested by comparing the value of z against the values of the normal distribution.

### Standardised Morisita’s index

Smith-Gill developed a statistic based on Morisita’s index which is independent of both sample size and population density and bounded by −1 and +1. This statistic is calculated as follows[59]

First determine Morisita's index ( Id ) in the usual fashion. Then let k be the number of units the population was sampled from. Calculate the two critical values

M_u = \frac { \chi^2_{ 0.975 } - k + \sum x } { \sum x - 1 }
M_c = \frac { \chi^2_{ 0.025 } - k + \sum x } { \sum x - 1 }

where χ2 is the chi square value for n − 1 degrees of freedom at the 97.5% and 2.5% levels of confidence.

The standardised index ( Ip ) is then calculated from one of the formulae below

When IdMc > 1

I_p = 0.5 + 0.5 \left( \frac { I_d - M_c } { k - M_c } \right)

When Mc > Id ≥ 1

I_p = 0.5 \left( \frac { I_d - 1 } { M_u - 1 } \right)

When 1 > IdMu

I_p = -0.5 \left( \frac { I_d - 1 } { M_u - 1 } \right)

When 1 > Mu > Id

I_p = -0.5 + 0.5 \left( \frac { I_d - M_u } { M_u } \right)

Ip ranges between +1 and −1 with 95% confidence intervals of ±0.5. Ip has the value of 0 if the pattern is random; if the pattern is uniform, Ip < 0 and if the pattern shows aggregation, Ip > 0.

### Peet's evenness indices

These indices are a measure of evenness between samples.[60]

E_1 = \frac{ I - I_\min }{ I_\max - I_\min }
E_2 = \frac{ I }{ I_\max }

where I is an index of diversity, Imax and Imin are the maximum and minimum values of I between the samples being compared.

### Loevinger's coefficient

Loevinger has suggested a coefficient H defined as follows:

H = \sqrt{ \frac{ p_{ max } ( 1- p_{ min } ) } { p_{ min } (1-p_{ max } ) } }

where pmax and pmin are the maximum and minimum proportions in the sample.

## Metrics used

A number of metrics (distances between samples) have been proposed.

### Euclidean distance

While this is usually used in quantitative work it may also be used in qualitative work. This is defined as

d_{ jk } = \sqrt { \sum_{ i = 1 }^N ( x_{ ij } - x_{ ik } )^2 }

where djk is the distance between xij and xik.

### Manhattan distance

While this is more commonly used in quantitative work it may also be used in qualitative work. This is defined as

d_{ jk } = \sum_{ i = 1 }^N | x_{ ij } - x_{ ik } |

where djk is the distance between xij and xik and || is the absolute value of the difference between xij and xik.

### Prevosti’s distance

This is related to the Manhattan distance. It was described by Prevosti et al and was used to compare differences between chromosomes.[61] Let P and Q be two collections of r finite probability distributions. Let these distributions have values that are divided into k categories. Then the distance DPQ is

D_{PQ} = \frac{ 1 }{ r } \sum_{ j = 1 }^r \sum_{ i = 1 }^k | p_{ ji } - q_{ ji } |

where r is the number of discrete probability distributions in each population, kj is the number of categories in distributions Pj and Qj and pji (respectively qji) is the theoretical probability of category i in distribution Pj (Qj) in population P(Q).

Its statistical properties were examined by Sanchez et al[62] who recommended a bootstrap procedure to estimate confidence intervals when testing for differences between samples.

### Other metrics

Let

A = \sum x_{ ij }
B = \sum x_{ ik }
J = \sum \min ( x_{ ij }, x_{ jk } )

where min(x,y) is the lesser value of the pair x and y.

Then

d_{ jk } = A + B - 2J

is the Manhattan distance,

d_{ jk } = \frac{ A + B - 2J }{ A + B }

is the Bray−Curtis distance,

d_{ jk } = \frac{ A + B - 2J }{ A + B - J }

is the Jaccard (or Ruzicka) distance and

d_{ jk } = 1 - \frac{ 1 }{ 2 } \left( \frac{ J }{ A } + \frac{ J }{ B } \right)

is the Kulczynski distance.

## Ordinal data

If the categories are at least ordinal then a number of other indices may be computed.

### Leik's D

Leik's measure of dispersion (D) is one such index.[63] Let there be K categories and let pi be fi/N where fi is the number in the ith category and let the categories be arranged in ascending order. Let

c_a = \sum^a_{ i = 1 } p_j

where aK. Let da = ca if ca ≤ 0.5 and 1 − ca ≤ 0.5 otherwise. Then

D = 2 \sum_{ a = i }^K \frac{ d_a }{ K - 1 }

### Normalised Herfindahl measure

This is the square of the coefficient of variation divided by N - 1 where N is the sample size.

H = \frac{ 1 }{ N - 1 } \frac{ s^2 }{ m^2 }

where m is the mean and s is the standard deviation.

### Potential for Conflict Index

The Potential for Conﬂict Index (PCI) describes the ratio of scoring on either side of a rating scale’s centre point.[64] This index requires at least ordinal data. This ratio is often be displayed as a bubble graph.

The PCI uses an ordinal scale with an odd number of rating points (−n to +n) centred at 0. It is calculated as follows

PCI = \frac{ X_t }{ Z } \left[ 1 - \left| \frac{ \sum_{ i = 1 }^{ r_+ } X_+ }{ X_t } - \frac{ \sum _{ i = 1 }^{ r_- } X_-} { X_t } \right| \right]

where Z = 2n, || is the absolute value (modulus), r+ is the number of responses in the positive side of the scale, r- is the number of responses in the negative side of the scale, X+ are the responses on the positive side of the scale, X- are the responses on the negative side of the scale and

X_t = \sum_{ i = 1 }^{ r_+ } | X_+ | + \sum_{ i = 1 }^{ r_- } | X_- |

Theoretical difficulties are known to exist with the PCI. The PCI can be computed only for scales with a neutral center point and an equal number of response options on either side of it. Also a uniform distribution of responses does not always yield the midpoint of the PCI statistic but rather varies with the number of possible responses or values in the scale. For example, ﬁve-, seven- and nine-point scales with a uniform distribution of responses give PCIs of 0.60, 0.57 and 0.50 respectively.

The first of these problems is relatively minor as most ordinal scales with an even number of response can be extended (or reduced) by a single value to give an odd number of possible responses. Scale can usually be recentred if this is required. The second problem is more difficult to resolve and may limit the PCI's applicability.

The PCI has been extended[65]

PCI_2 = \frac{ \sum_{ i = 1 }^K \sum_{ j = 1 }^i k_i k_j d_{ ij } }{ \delta }

where K is the number of categories, ki is the number in the ith category, dij is the distance between the ith and ith categories, and δ is the maximum distance on the scale multiplied by the number of times it can occur in the sample. For a sample with an even number of data points

\delta = \frac{ N^2 }{ 2 } d_\max

and for a sample with an odd number of data points

\delta = \frac{ N^2 - 1 }{ 2 } d_\max

where N is the number of data points in the sample and dmax is the maximum distance between points on the scale.

Vaske et al suggest a number of possible distance measures for use with this index.[65]

D_1: d_{ ij } = | r_i - r_j | - 1

if the signs (+ or −) of ri and rj differ. If the signs are the same dij = 0.

D_2: d_{ ij } = | r_i - r_j |
D_3: d_{ ij } = | r_i - r_j |^p

where p is an arbitrary real number > 0.

Dp_{ ij }: d_{ ij } = [ | r_i - r_j | - ( m - 1 ) ]^p

if sign(ri ) ≠ sign(ri ) and p is a real number > 0. If the signs are the same then dij = 0. m is D1, D2 or D3.

The difference between D1 and D2 is that the first does not include neutrals in the distance while the latter does. For example, respondents scoring −2 and +1 would have a distance of 2 under D1 and 3 under D2.

The use of a power (p) in the distances allows for the rescaling of extreme responses. These differences can be highlighted with p > 1 or diminished with p < 1.

In simulations with a variates drawn from a uniform distribution the PCI2 has a symmetric unimodal distribution.[65] The tails of its distribution are larger than those of a normal distribution.

Vaske et al suggest the use of a t test to compare the values of the PCI between samples if the PCIs are approximately normally distributed.

### van der Eijk's A

This measure is a weighted average of the degree of agreement the frequency distribution.[66] A ranges from −1 (perfect bimodality) to +1 (perfect unimodality). It is defined as

A = U \left( 1 - \frac{ S - 1 }{ K - 1 } \right)

where U is the unimodality of the distribution, S the number of categories that have nonzero frequencies and K the total number of categories.

The value of U is 1 if the distribution has any of the three following characteristics:

• all responses are in a single category
• the responses are evenly distributed among all the categories
• the responses are evenly distributed among two or more contiguous categories, with the other categories with zero responses

With distributions other than these the data must be divided into 'layers'. Within a layer the responses are either equal or zero. The categories do not have to be contiguous. A value for A for each layer (Ai) is calculated and a weighted average for the distribution is determined. The weights (wi) for each layer are the number of responses in that layer. In symbols

A_\mathrm{overall} = \sum w_i A_i

A uniform distribution has A = 0: when all the responses fall into one category A = +1.

One theoretical problem with this index is that it assumes that the intervals are equally spaced. This may limit its applicability.

## Related statistics

### Birthday problem

If there are n units in the sample and they are randomly distributed into k categories (nk), this can be considerer a variant of the birthday problem.[67] The probability (p) of all the categories having only one unit is

p = \prod_{ i = 1 }^n \left( 1 - \frac{ i }{ k } \right)

If c is large and n is small compared with c2/3 then to a good approximation

p = \exp\left( \frac{ -n^2 } { 2c } \right)

This approximation follows from the exact formula as follows:

\log_e \left( 1 - \frac{ i }{ c } \right) \approx - \frac{ i }{ c }
Sample size estimates

For p = 0.5 and p = 0.05 respectively the following estimates of n may be useful

n = 1.2 \sqrt{ c }
n = 2.448 \sqrt{ c } \approx 2.5 \sqrt{ c }

This analysis can be extended to multiple categories. For p = 0.5 and p 0.05 we have respectively

n = 1.2 \sqrt{ \frac{ 1 }{ \sum_{ i = 1 }^k \frac{ 1 }{ c_i } } }
n \approx 2.5 \sqrt{ \frac{ 1 }{ \sum_{ i = 1 }^k \frac{ 1 }{ c_i } } }

where ci is the size of the ith category. This analysis assumes that the categories are independent.

If the data is ordered in some fashion then for at least one event occurring in two categories lying within j categories of each other than a probability of 0.5 or 0.05 requires a sample size (n) respectively of[68]

n = 1.2 \sqrt { \frac{ c }{ 2j + 1 } }
n \approx 2.5 \sqrt { \frac{ c }{ 2j + 1 } }

where c is the number of categories.

### Birthday-death day problem

Whether or not there is a relation between birthdays and death days has been investigated with the following statistic[69]

- \log_{10} \left( \frac{ 1 + 2 d }{ 365 } \right)

where d is the number of days in the year between the birthday and the death day.

## Evaluation of indices

Different indices give different values of variation, and may be used for different purposes: several are used and critiqued in the sociology literature especially.

If one wishes to simply make ordinal comparisons between samples (is one sample more or less varied than another), the choice of IQV is relatively less important, as they will often give the same ordering.

Where the data is ordinal a method that may be of use in comparing samples is ORDANOVA.

In some cases it is useful to not standardize an index to run from 0 to 1, regardless of number of categories or samples (Wilcox 1973, pp. 338), but one generally so standardizes it.

## Notes

-- Module:Hatnote -- -- -- -- This module produces hatnote links and links to related articles. It -- -- implements the and meta-templates and includes -- -- helper functions for other Lua hatnote modules. --

local libraryUtil = require('libraryUtil') local checkType = libraryUtil.checkType local mArguments -- lazily initialise Module:Arguments local yesno -- lazily initialise Module:Yesno

local p = {}

-- Helper functions

local function getArgs(frame) -- Fetches the arguments from the parent frame. Whitespace is trimmed and -- blanks are removed. mArguments = require('Module:Arguments') return mArguments.getArgs(frame, {parentOnly = true}) end

local function removeInitialColon(s) -- Removes the initial colon from a string, if present. return s:match('^:?(.*)') end

function p.findNamespaceId(link, removeColon) -- Finds the namespace id (namespace number) of a link or a pagename. This -- function will not work if the link is enclosed in double brackets. Colons -- are trimmed from the start of the link by default. To skip colon -- trimming, set the removeColon parameter to true. checkType('findNamespaceId', 1, link, 'string') checkType('findNamespaceId', 2, removeColon, 'boolean', true) if removeColon ~= false then link = removeInitialColon(link) end local namespace = link:match('^(.-):') if namespace then local nsTable = mw.site.namespaces[namespace] if nsTable then return nsTable.id end end return 0 end

function p.formatPages(...) -- Formats a list of pages using formatLink and returns it as an array. Nil -- values are not allowed. local pages = {...} local ret = {} for i, page in ipairs(pages) do ret[i] = p._formatLink(page) end return ret end

function p.formatPageTables(...) -- Takes a list of page/display tables and returns it as a list of -- formatted links. Nil values are not allowed. local pages = {...} local links = {} for i, t in ipairs(pages) do checkType('formatPageTables', i, t, 'table') local link = t[1] local display = t[2] links[i] = p._formatLink(link, display) end return links end

function p.makeWikitextError(msg, helpLink, addTrackingCategory) -- Formats an error message to be returned to wikitext. If -- addTrackingCategory is not false after being returned from -- Module:Yesno, and if we are not on a talk page, a tracking category -- is added. checkType('makeWikitextError', 1, msg, 'string') checkType('makeWikitextError', 2, helpLink, 'string', true) yesno = require('Module:Yesno') local title = mw.title.getCurrentTitle() -- Make the help link text. local helpText if helpLink then helpText = ' (help)' else helpText = end -- Make the category text. local category if not title.isTalkPage and yesno(addTrackingCategory) ~= false then category = 'Hatnote templates with errors' category = string.format( '%s:%s', mw.site.namespaces[14].name, category ) else category = end return string.format( '%s', msg, helpText, category ) end

-- Format link -- -- Makes a wikilink from the given link and display values. Links are escaped -- with colons if necessary, and links to sections are detected and displayed -- with " § " as a separator rather than the standard MediaWiki "#". Used in -- the template.

function p._formatLink(link, display) -- Find whether we need to use the colon trick or not. We need to use the -- colon trick for categories and files, as otherwise category links -- categorise the page and file links display the file. checkType('_formatLink', 1, link, 'string') checkType('_formatLink', 2, display, 'string', true) link = removeInitialColon(link) local namespace = p.findNamespaceId(link, false) local colon if namespace == 6 or namespace == 14 then colon = ':' else colon = end -- Find whether a faux display value has been added with the | magic -- word. if not display then local prePipe, postPipe = link:match('^(.-)|(.*)$') link = prePipe or link display = postPipe end -- Find the display value. if not display then local page, section = link:match('^(.-)#(.*)$') if page then display = page .. ' § ' .. section end end -- Assemble the link. if display then return string.format('%s', colon, link, display) else return string.format('%s%s', colon, link) end end

-- Hatnote -- -- Produces standard hatnote text. Implements the template.

function p.hatnote(frame) local args = getArgs(frame) local s = args[1] local options = {} if not s then return p.makeWikitextError( 'no text specified', 'Template:Hatnote#Errors', args.category ) end options.extraclasses = args.extraclasses options.selfref = args.selfref return p._hatnote(s, options) end

function p._hatnote(s, options) checkType('_hatnote', 1, s, 'string') checkType('_hatnote', 2, options, 'table', true) local classes = {'hatnote'} local extraclasses = options.extraclasses local selfref = options.selfref if type(extraclasses) == 'string' then classes[#classes + 1] = extraclasses end if selfref then classes[#classes + 1] = 'selfref' end return string.format( '
%s
', table.concat(classes, ' '), s )

end

return p-------------------------------------------------------------------------------- -- Module:Hatnote -- -- -- -- This module produces hatnote links and links to related articles. It -- -- implements the and meta-templates and includes -- -- helper functions for other Lua hatnote modules. --

local libraryUtil = require('libraryUtil') local checkType = libraryUtil.checkType local mArguments -- lazily initialise Module:Arguments local yesno -- lazily initialise Module:Yesno

local p = {}

-- Helper functions

local function getArgs(frame) -- Fetches the arguments from the parent frame. Whitespace is trimmed and -- blanks are removed. mArguments = require('Module:Arguments') return mArguments.getArgs(frame, {parentOnly = true}) end

local function removeInitialColon(s) -- Removes the initial colon from a string, if present. return s:match('^:?(.*)') end

function p.findNamespaceId(link, removeColon) -- Finds the namespace id (namespace number) of a link or a pagename. This -- function will not work if the link is enclosed in double brackets. Colons -- are trimmed from the start of the link by default. To skip colon -- trimming, set the removeColon parameter to true. checkType('findNamespaceId', 1, link, 'string') checkType('findNamespaceId', 2, removeColon, 'boolean', true) if removeColon ~= false then link = removeInitialColon(link) end local namespace = link:match('^(.-):') if namespace then local nsTable = mw.site.namespaces[namespace] if nsTable then return nsTable.id end end return 0 end

function p.formatPages(...) -- Formats a list of pages using formatLink and returns it as an array. Nil -- values are not allowed. local pages = {...} local ret = {} for i, page in ipairs(pages) do ret[i] = p._formatLink(page) end return ret end

function p.formatPageTables(...) -- Takes a list of page/display tables and returns it as a list of -- formatted links. Nil values are not allowed. local pages = {...} local links = {} for i, t in ipairs(pages) do checkType('formatPageTables', i, t, 'table') local link = t[1] local display = t[2] links[i] = p._formatLink(link, display) end return links end

function p.makeWikitextError(msg, helpLink, addTrackingCategory) -- Formats an error message to be returned to wikitext. If -- addTrackingCategory is not false after being returned from -- Module:Yesno, and if we are not on a talk page, a tracking category -- is added. checkType('makeWikitextError', 1, msg, 'string') checkType('makeWikitextError', 2, helpLink, 'string', true) yesno = require('Module:Yesno') local title = mw.title.getCurrentTitle() -- Make the help link text. local helpText if helpLink then helpText = ' (help)' else helpText = end -- Make the category text. local category if not title.isTalkPage and yesno(addTrackingCategory) ~= false then category = 'Hatnote templates with errors' category = string.format( '%s:%s', mw.site.namespaces[14].name, category ) else category = end return string.format( '%s', msg, helpText, category ) end

-- Format link -- -- Makes a wikilink from the given link and display values. Links are escaped -- with colons if necessary, and links to sections are detected and displayed -- with " § " as a separator rather than the standard MediaWiki "#". Used in -- the template.

function p._formatLink(link, display) -- Find whether we need to use the colon trick or not. We need to use the -- colon trick for categories and files, as otherwise category links -- categorise the page and file links display the file. checkType('_formatLink', 1, link, 'string') checkType('_formatLink', 2, display, 'string', true) link = removeInitialColon(link) local namespace = p.findNamespaceId(link, false) local colon if namespace == 6 or namespace == 14 then colon = ':' else colon = end -- Find whether a faux display value has been added with the | magic -- word. if not display then local prePipe, postPipe = link:match('^(.-)|(.*)$') link = prePipe or link display = postPipe end -- Find the display value. if not display then local page, section = link:match('^(.-)#(.*)$') if page then display = page .. ' § ' .. section end end -- Assemble the link. if display then return string.format('%s', colon, link, display) else return string.format('%s%s', colon, link) end end

-- Hatnote -- -- Produces standard hatnote text. Implements the template.

function p.hatnote(frame) local args = getArgs(frame) local s = args[1] local options = {} if not s then return p.makeWikitextError( 'no text specified', 'Template:Hatnote#Errors', args.category ) end options.extraclasses = args.extraclasses options.selfref = args.selfref return p._hatnote(s, options) end

function p._hatnote(s, options) checkType('_hatnote', 1, s, 'string') checkType('_hatnote', 2, options, 'table', true) local classes = {'hatnote'} local extraclasses = options.extraclasses local selfref = options.selfref if type(extraclasses) == 'string' then classes[#classes + 1] = extraclasses end if selfref then classes[#classes + 1] = 'selfref' end return string.format( '
%s
', table.concat(classes, ' '), s )

end

return p
1. ^ This can only happen if the number of cases is a multiple of the number of categories.
2. ^ Freemen LC (1965) Elementary applied statistics. New York: John Wiley and Sons pp 40–43
3. ^ Kendal MC, Stuart A (1958) The advanced theory of statistics. Hafner Publishing Company p46
4. ^ Mueller JE, Schuessler KP (1961) Statistical reasoning in sociology. Boston: Houghton Mifflin Company. pp 177–179
5. ^ Wilcox AR (1967) Indices of qualitative variation
6. ^ Kaiser HF (1968) A measure of the population quality of legislative apportionment. The American Political Science Review 62 (1) 208
7. ^ Joel Gombin (2015). qualvar: Implements Indices of Qualitative Variation Proposed by Wilcox (1973). R package version 0.1.0. http://CRAN.R-project.org/package=qualvar
8. ^ Gibbs JP, Poston Jr, Dudley L (1975) The division of labor: Conceptualization and related measures. Social Forces 53 (3) 468–476 doi:10.2307/2576589
9. ^ IQV at xycoon
10. ^ Hunter PR, Gaston MA (1988) Numerical index of the discriminatory ability of typing systems: an application of Simpson's index of diversity. J Clin Microbiol 26(11): 2465–2466
11. ^ Friedman WF (1925) The incidence of coincidence and its applications in cryptanalysis. Technical Paper. Office of the Chief Signal Officer. United States Government Printing Office.
12. ^ Gini CW (1912) Variability and mutability, contribution to the study of statistical distributions and relations. Studi Economico-Giuricici della R. Universita de Cagliari
13. ^ Simpson EH (1949) Measurement of diversity. Nature 163:688
14. ^ Bachi R (1956) A statistical analysis of the revival of Hebrew in Israel. In: Bachi R (ed) Scripta Hierosolymitana, Vol III, Jerusalem: Magnus press pp 179–247
15. ^ Mueller JH, Schuessler KF (1961) Statistical reasoning in sociology. Boston: Houghton Mifflin
16. ^ Gibbs JP, Martin, WT (1962) Urbanization, technology and division of labor: International patterns. American Sociological Review 27: 667–677
17. ^ Lieberson S (1969) Measuring population diversity. American Sociological Review 34(6) 850–862
18. ^ Blau P (1977) Inequality and Heterogeneity. Free Press, New York
19. ^ Perry M, Kader G (2005) Variation as unalikeability. Teaching Stats 27 (2) 58–60
20. ^ Greenberg JH (1956) The measurement of linguistic diversity. Language 32: 109–115
21. ^ Lautard EH (1978) PhD thesis
22. ^ Berger WH, Parker FL (1970) Diversity of planktonic Foramenifera in deep sea sediments. Science 168:1345–1347
23. ^ a b Hill, M O. 1973. Diversity and evenness: a unifying notation and its consequences. Ecology 54:427–431
24. ^ Margalef R (1958) Temporal succession and spatial heterogeneity in phytoplankton. In: Perspectives in marine biology. Buzzati-Traverso (ed) Univ Calif Press, Berkeley pp 323–347
25. ^ Menhinick EF (1964) A comparison of some species-individuals diversity indices applied to samples of field insects. Ecology 45 (4) 859–861
26. ^ Kuraszkiewicz W (1951) Nakladen Wroclawskiego Towarzystwa Naukowego
27. ^ Guiraud P (1954) Les caractères statistiques du vocabulaire. Presses Universitaires de France, Paris
28. ^ Panas E (2001) The Generalized Torquist: Specification and estimation of a new vocabulary-text size function. J Quant Ling 8(3) 233–252
29. ^ Kempton RA, Taylor LR (1976) Models and statistics for species diversity. Nature 262: 818–820
30. ^ Hutcheson K (1970) A test for comparing diversities based on the Shannon formula. J Theo Biol 29: 151–154
31. ^ Fisher RA, Corbet A, Williams CB (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. Animal Ecol 12: 42–58
32. ^ Anscombe (1950) Sampling theory of the negative binomial and logarithmic series distributions. Biometrika 37: 358–382
33. ^ Strong WL (2002) Assessing species abundance uneveness within and between plant communities. Community Ecology 3: 237–246
34. ^ Camargo JA (1993) Must dominance increase with the number of subordinate species in competitive interactions? J. Theor Biol 161 537–542
35. ^ Smith, Wilson (1996)
36. ^ Bulla L (1994) An index of evenness and its associated diversity measure. Oikos 70:167–171
37. ^ Horn HS (1966) Measurement of 'overlap' in comparative ecological studies. Am Nat 100 (914): 419–423
38. ^ Siegel, Andrew F (2006) Rarefaction curves. Encyclopedia of Statistical Sciences 10.1002/0471667196.ess2195.pub2.
39. ^ Caswell H (1976) Community structure: a neutral model analysis. Ecol Monogr 46: 327–354
40. ^ Poulin R, Mouillot D (2003) Parasite specialization from a phylogenetic perspective: a new index of host speciﬁcity. Parasitology 126: 473–480
41. ^ Duncan OD, Duncan B (1955) A methodological analysis of segregation indexes. Am Sociol Review, 20: 210–217
42. ^ Gorard S, Taylor C (2002b) What is segregation? A comparison of measures in terms of 'strong' and 'weak' compositional invariance. Sociology, 36(4), 875–895
43. ^ Massey DS, Denton NA (1988) The dimensions of residential segregation. Social Forces 67: 281–315
44. ^ Hutchens RM (2004) One measure of segregation. International Economic Review 45: 555–578
45. ^ Lieberson S (1981) An asymmetrical approach to segregation. In: Peach C, Robinson V, Smith S (ed.s) Ethnic segregation in cities. London: Croom Helmp. 61–82
46. ^ Bell W (1954) A probability model for the measurement of ecological segregation. Social Forces 32:357–364
47. ^ Ochiai A (1957) Zoogeographic studies on the soleoid fishes found in Japan and its neighbouring regions. Bull Jpn Soc Sci Fish 22: 526–530
48. ^ Kulczynski S (1927) Die Pflanzenassoziationen der Pieninen. Bulletin International de l'Academie Polonaise des Sciences et des Lettres, Classe des Sciences
49. ^ Yule GU (1900) On the association of attributes in statistics. Philos Trans Roy Soc
50. ^ Lienert GA and Sporer SL (1982) Interkorrelationen seltner Symptome mittels Nullfeldkorrigierter YuleKoeffizienten. Psychologische Beitrage 24: 411–418
51. ^ Baroni-Urbani C & Buser MW (1976) similarity of binary Data. Systematic Biology 25: 251-259
52. ^
53. ^
54. ^ Jaccard P (1902) Lois de distribution florale. Bulletin de la Socíeté Vaudoise des Sciences Naturelles 38:67-130
55. ^ Archer AW and Maples CG (1989) Response of selected binomial coefficients to varying degrees of matrix sparseness and to matrices with known data interrelationships. Mathematical Geology 21: 741-753
56. ^ a b Morisita M (1959) Measuring the dispersion and the analysis of distribution patterns. Memoires of the Faculty of Science, Kyushu University Series E. Biol 2:215–235
57. ^ Lloyd M (1967) Mean crowding. J Anim Ecol 36: 1–30
58. ^ Pedigo LP & Buntin GD (1994) Handbook of sampling methods for arthropods in agriculture. CRC Boca Raton FL
59. ^ Smith-Gill S J (1975) Cytophysiological basis of disruptive pigmentary patterns in the leopard frog Rana pipiens. II. Wild type and mutant cell specific patterns. J Morphol 146, 35–54
60. ^ Peet (1974) The measurements of species diversity. Ann Rev Ecol System 5: 285–307
61. ^ Prevosti A, Ribo, G, Serra L, Aguade M, Balanya J, Monclus M, Mestres F (1988) Colonization of America by Drosophila subobscura: experiment in natural populations that supports the adaptive role of chromosomal inversion polymorphism. Proc Natl Acad Sci USA 85: 5597–5600
62. ^ Sanchez A, Ocana J, Utzetb F, Serrac L (2003) Comparison of Prevosti genetic distances. Journal of Statistical Planning and Inference 109 (2003) 43–65
63. ^ Leik R (1966) A measure of ordinal consensus. Pacific sociological review 9 (2): 85–90
64. ^ Manfredo M, Vaske, JJ, Teel TL (2003) The potential for conflict index: A graphic approach tp practical significance of human dimensions research. Human Dimensions of Wildlife 8: 219–228
65. ^ a b c Vaske JJ, Beaman J, Barreto H, Shelby LB (2010) An extension and further validation of the potential for conﬂict index. Leisure Sciences 32: 240–254
66. ^ Van der Eijk C (2001) Measuring agreement in ordered rating scales. Quality and quantity 35(3): 325–341
67. ^ Von Mises R (1939) Uber Aufteilungs-und Besetzungs-Wahrcheinlichkeiten. Revue de la Facultd des Sciences de de I'Universite d'lstanbul NS 4: 145−163
68. ^ Sevast'yanov BA (1972) Poisson limit law for a scheme of sums of dependent random variables. (trans. S. M. Rudolfer) Theory of probability and its applications, 17: 695−699
69. ^ Hoaglin DC, Mosteller, F and Tukey, JW (1985) Exploring data tables, trends, and shapes, New York: John Wiley

## References

This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.

Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.