4.2.4. Mathematical statistics

4.2.4.1. Probability distribution

At-least required concepts for properly using statistical analysis.

4.2.4.1.1. Parameters

Most of probabilities, discrete or continuous ones, can be uniquely described using several numbers. Those numbers are commonly expressed by Greek letter. These numbers are called parameters of distribution. For examples, uni-variate gaussian is expressed as \(\mathcal{N}(\mu, \sigma)\), beta distribution is expressed as \(\mathrm{Beta}(\alpha, \beta)\).

Generally, those numbers are unknown variate. Customarily it uses hat \(\hat{a}\) as example, to describe a certain calculation rule to obtained approximated result for the parameter \(a\) from some data points (samples). Or in terminology, parameter estimation. As example, we assume some data sampled from a certain uni-variate gaussian \(\mathcal{N}(\mu, \sigma)\), the parameter \(\mu\) is actually unknown, however, we can estimate via \(\hat{\mu} = (\sum_{i=1}^n x_i)/n\).

4.2.4.1.2. Kernel and scale

For functions which can analytically express distributions, we call them probability mass function (in discrete case) or probability density function (in continuous case). Those functions can be factorized as several multiplied items. The minimal item which contains variable and all of required parameters of distributions, we call it kernel. The scale refers the parameters in no-kernel items. For instance, the probability density function of uni-variate gaussian is:

(4.3)\[f(x|\mu,\sigma) = \frac{1}{\sqrt{2\pi}\sigma} \exp{[-\frac{(x-\mu)^2}{\sigma^2}]}\]

The kernel item \(k = \exp{[-(x-\mu)^2/\sigma^2]}\) contains \(x\), \(\mu\) and \(\sigma\). And for item \(s = \frac{1}{\sqrt{2\pi}\sigma}\), it only contains parameter \(\sigma\), therefore \(\sigma\) is also called the scale of uni-variate gaussian.

Additionally, it is unnecessary for \(k = k(x)\) to guarantee its sum or integral as 1. But for \(s \cdot k\), \(\sum_{i} s \cdot k \cdot x_i = 1\) or \(\int_{x} s \cdot k dx = 1\) should be ensured.

4.2.4.1.3. Parametric and non-parametric

In statistics, there is a system with strong assumption relying on probability distribution, called parametric methods in terminology. We believe the data sourced from a uni-variate gaussian \(\mathcal{N}(\mu, \sigma)\), therefore we use \(\hat{\mu} = \mathbb{E}[X]\) and \(\hat{\sigma^2} = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\) to make estimation. It works well unless the priori belief is actually guaranteed.

Non-parametric, correspondingly, is the method to focus the rank of values, rather than the values themselves. Considering sets with two treatments \(A = \{1, 2, 3, 4, 5\}\) and \(B = \{1, 2, 3, 4, 5000\}\), with parametric methods it can simplistically report some misleading information such as the treatment \(B\) can significantly improve something, or whatever, due to its mean value is totally different. However, if we use median, the non-parametric version for mean, it will result in a opposite conclusion.

Data does not lie. People do.

—Lee Baker, Truth, Lies & Statistics: How to Lie with Statistics

4.2.4.1.4. Marginal probability

For a certain \(n\)-variate distribution \(f(X_1, X_2, \dots, X_n)\), if it denotes the set \(U = \{1, 2, \dots, n\}\), and assume there is a positive integer \(m\) which satisfies \(0 < m < n\), the marginal probability density of \(f(X_1, X_2, \dots, X_n)\) can be denoted as \(f(X_{s_1}, X_{s_2}, \dots, X_{s_m})\), where the set \(A = \{s_1, s_{2}, \dots, s_{m}\}\) is subset of \(U\) (\(A \subset U\)).

Formally, for complement \(A^C = \{k_1, k_2, \dots, k_{n-m}\}\), the calculation for \(f(X_{s_1}, X_{s_2}, \dots, X_{s_m})\) is:

(4.4)\[\begin{split}f({X}_{s_1}, {X}_{s_2}, \dots, {X}_{s_m}) &= \int_{{X}_{k_1}} \int_{{X}_{k_2}} \cdots \int_{{X}_{{k}_{n-m}}} \mathrm{}f(X_1, X_2, \dots, X_n) d{X}_{k_1} dt{X}_{k_2} \cdots d{X}_{{k}_{n-m}} \\ \mathrm{}&= {E}_{{X}_{k_1}, {X}_{k_2}, \cdots, {X}_{{k}_{n-m}}} [f(X_1, X_2, \dots, X_n)]\end{split}\]

Apparently, the marginal probability \(f(X_{s_1}, X_{s_2}, \dots, X_{s_m})\) is in a \(m\)-dimensional subspace. Particularly, if \(n - m = 1\), this marginal probability density is 1 dimensionally degenerated from the original \(n\)-space.

Note

Degeneracy describe a class of object changes its nature in the condition of some constraints. For example, for an ellipse \(g(a, b)\), if \(a = b\), it degenerates into a circle; if \(a \cdot b = 0,\ a+b \neq 0\), it degenerates into a line segment; if \(a \cdot b = 0,\ a+b = 0\), it degenerates into a point.

Degeneracy also occurs in probability distribution. One-point distribution can be degenerated from an uni-variate Gaussian \(g(x|\mu, s)\) when \(s = 0\); beta distribution can be degenerated from a Dirichlet distribution \(\mathrm{Dir}(\alpha_1, \dots, \alpha_m)\) if \(m = 2\). However, for multivariate Gaussian, either its marginal or its conditional distribution will always be multivariate Gaussian, despite degeneracy occurred in dimensions.

4.2.4.2. Hypothesis testing

Statistical hypothesis testing is developed and enriched by Karl Pearson, William Sealy Gosset, Ronald Fisher, Jerzy Neyman, and Egon Pearson [Fisher1955, Neyman1933, Goodman1999, Heyde2001]. It is the method to decide whether the collected data can sufficiently support a certain statistical hypothesis.

For all hypothesis testing, there must be an assumption called null hypothesis \(H_0\), and its complement \(H_1 = H_0^C\) is alternative hypothesis where \(C\) refers the full probability space (\(p(C) = 1\)). Most types of test will export the statistic, commonly scalar indicator devised for describing some property, and \(p\)-value, how likely we obtain the collected data in one study if our \(H_0\) is of the truth.

Because the \(p\)-value refers probability, its value will range from 0 to 1. Practically, the less the \(p\)-value, the more tendency to reject the null hypothesis \(H_0\), based on our tested data.

There are two types as for \(H_0\): similarity hypothesis called two-tailed, and un-similarity hypothesis called single-tailed. For example, \(\mu_1\) and \(\mu_2\) are mean values for two populations \(X_1\) and \(X_2\), the hypothesis \(\mu_1 = \mu_2\) is two-tailed; but for \(\mu_1 > \mu_2\) or \(\mu_1 < \mu_2\), they are single-tailed. Notes two key facts: 1) this concept only exist in cases for two group comparison; 2) difference of alternative generally changes the final \(p\)-value, but not for the statistic.

4.2.4.2.1. one-way ANOVA test

One-way ANOVA is designed to compare whether two or more sample’s means are significantly different using \(F\) distribution [Lowry2014, Heiman2001]. For one-way ANOVA:

\(H_0\):

Samples of all groups are drawn from the populations with the same mean

\(H_1\):

Samples of all groups are not drawn from the populations with the same mean

Statistic:

(4.5)\[s = \frac{{MS}_{B}}{{MS}_{W}} \sim \mathcal{F}\]

Where \(MS_{B}\) and \(MS_{W}\) are the mean squares between and within groups respectively. This statistic \(s\) follows a certain \(\mathcal{F}\) distribution.

More specifically, \(MS_{B} = S_{B}/f_{B}\), where \(S_{B}\) is the sum of squared difference, and the \(f_{B}\) is the degrees of freedom, for between groups. All about \(MS_{W}\) is as similar as those of \(MS_{B}\) but for within groups.

4.2.4.2.2. Student’s T test

Student’s T test is designed to evaluate whether the population mean of one group is equal, greater, or less than a specific value (one-sample in statistical terminology), or that mean of another group (i.e. two-sample in statistics). It gets its name from the paper publication from William Sealy Gosset with his pseudonym Student [Lehmann1992]. For the two-tailed independent T test:

\(H_0\):

For population mean values \(\mu_1\) and \(\mu_2\) in two groups, \(\mu_1 = \mu_2\)

\(H_1\):

\(\mu_1 \neq \mu_2\)

statistic:

(4.6)\[s = \frac{\mu_1 - \mu_2}{{s}_{\Delta}}\]

\(s_{\Delta}\) differs when data possess in different variance level in two groups. Assume \(n_1\) and \(n_2\) are number of samples, and \(s_1\) and \(s_2\) are unbiased estimators of standard variance, for the 1st and 2nd group respectively. For similar variances:

(4.7)\[{s}_{\Delta} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}} \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\]

For two groups with variances in great difference, the Welch’s T test will be executed for adaption. In this condition:

(4.8)\[{s}_{\Delta} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

Specifically, if \(n_1 = n_2 = n\), those \(s_{\Delta}\) will simultaneously converge into the form of \(\sqrt{s_1^2 + s_2^2}/\sqrt{n}\). Assume \(s^{\prime} = \sqrt{s_1^2 + s_2^2}\), it can be found that \(s_1\) and \(s_2\) are defined in two orthogonal axes. That’s the reason why it is called independent T test. Additionally, for no independent (related) case, \(s_1\) and \(s_2\) are defined within the same axis, the calculation for \(s^{\prime}\) will be \(\mid s_1 - s_2 \mid\), therefore the statistic in this circumstance is:

(4.9)\[s = \frac{\mu_1 - \mu_2}{\mid s_1 - s_2\mid/\sqrt{n}}\]

4.2.4.2.3. Shapiro-Wilk test

Shapiro-Wilk test is proposed by Shapiro and Wilk [Shapiro1965] for determining the normality of data where:

\(H_0\):

The data was drawn from a normal distribution

\(H_1\):

The data was not drawn from a normal distribution

Statistic:

(4.10)\[s = \frac{(\sum_{i=1}^n a_i {x}_{(i)})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\]

Where \(a_i\) is \(i\)-th element in coefficient vector \(\boldsymbol{a}\), as defined in [Davis1977].

Note that no analytical formula for its distribution, then the corresponding \(p\)-value is obtain via Monte Carlo (MC) simulation.

4.2.4.2.4. Omnibus Normality test

Omnibus test for normality is proposed main by D’Agostino [Agostino1971, Agostino1973], for determining the departure of sample distribution from uni-variate gaussian:

\(H_0\):

The data was drawn from a normal distribution

\(H_1\):

The data was not drawn from a normal distribution

statistic:

(4.11)\[s = s_s^2 + s_k^2\]

Where the \(s_s\) and \(s_k\) are statistics returned from skew test and kurtosis test.

4.2.4.2.5. Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test is a non-parametric method to quantify the distance from one empirical distribution function to a cumulative distribution function (one-sample), or to another empirical distribution function (two-sample). It is generally be used to test the goodness of fit. As for two-tailed Kolmogorov-Smirnov test:

\(H_0\):

For cumulative distribution function \(F(x)\) and \(F^\prime(x)\), \(F(x) = F^\prime(x)\)

\(H_1\):

\(F(x) \neq F^\prime(x)\)

statistic:

(4.12)\[s = \mathrm{sup}_x \mid F(x) - F^\prime(x) \mid\]

Where:

(4.13)\[F(x) = F_n(x) = \frac{1}{n} \sum_{i=1}^{n} {I}_{(-\inf, x]}(X)\]

In one-sample test, \(F^\prime(x)\) is denoted with another pre-defined distribution; In two-sample test, \(F^\prime(x) = F_m(x)\) which is of the similarity as \(F_n(x)\) but from another dataset

4.2.4.2.6. Cramér-von Mises test

The Cramér-von Mises test is proposed by Harald Cramér and Richard Edler von Mises as a criterion to measure the distance from one empirical distribution function to a cumulative distribution function (one-sample), or to another empirical distribution function (two-sample).

\(H_0\):

For empirical distribution and cumulative distribution function \(F_{n}(x)\) and \(F^\prime(x)\), \(F_{n}(x) = F^\prime(x)\)

\(H_1\):

\(F_{n}(x) \neq F^\prime(x)\)

statistic:

(4.14)\[s = \frac{1}{12n} + \sum_{i=1}^{n} [\frac{2i-1}{2n} - F^\prime(x_i)]^2\]

Specially, if the \(F^\prime(x)\) sources from another empirical distribution \(F_{m}(y)\), it will be two-sample test with the following statistic:

(4.15)\[s = \frac{n\sum_{i=1}^{n}({r}_{x_i, a}-i)^2+m\sum_{j=1}^{m}({r}_{y_j, a}-j)^2 }{mn(m+n)}-\frac{4mn-1}{6(m+n)}\]

Where \(r_{v, a}\) are rank of \(v\) in series \(a = \{x_1, x_2, \dots, x_n, y_1, y_2, \dots, y_m\}\)

4.2.4.2.7. Alexander Govern test

An alternative testing for one-way ANOVA test proposed by Alexander for dealing with multi grouped data with heterogeneity on variance. Similar as one-way ANOVA:

\(H_0\):

Samples of all groups are drawn from the populations with the same mean

\(H_1\):

Samples of all groups are not drawn from the populations with the same mean

Statistic:

(4.16)\[s = \sum_{j=1}^{J} Z_j^2\]

Where \(J\) is the number of groups, \(Z_j\) is the standard normal deviate for each group. For more details, see description summarized by Ochuko.

4.2.4.2.8. Tukey’s range test

Tukey’s honestly significant difference (HSD) test, a comprehensive test proposed by John Tukey [Tukey1949] compares all possible pairs of means.

\(H_0\):

Samples of all groups are drawn from the populations with the same mean

\(H_1\):

Samples of all groups are not drawn from the populations with the same mean

Statistic:

(4.17)\[{s}_{i, j} = \frac{\mu_{i} -\mu_{j}}{SE} \sim Q\]

Where \(\mu_{i}\) and \(\mu_{j}\) are means of group \(i\) and \(j\); \(SE\) is the standard error of the sum of means. \(Q\) is a certain studentized range distribution.

4.2.4.2.9. Kruskal-Wallis H-test

A non-parametric method proposed by William Kruskal and W. Allen Wallis [Kruskal1952] to measure whether samples originate from the identical distribution. It can be seen as the non-parametric alternative for one-way ANOVA test.

\(H_0\):

Samples of all groups are drawn from the populations with the same median

\(H_1\):

Samples of all groups are not drawn from the populations with the same median

Statistic:

(4.18)\[s = (N-1)\frac{\sum_{i=1}^{g} n_i (\bar{r}_{i\cdot}-\bar{r})^2}{\sum_{i=1}^g \sum_{j=1}^{n_i} ({r}_{ij} - \bar{r})^2}\]

Where \(N\) is the number of all observations; \(n_i\) is the number of observation in \(i\)-th group; \(g\) is the number of groups; \(r_{i, j}\) is the global rank of \(j\)-th observation in \(i\)-th group, while \(\bar{r}_{i\cdot}\) is calculated from \((\sum_{j=1}^{n_i} r_{ij})/n_i\), and \(\bar{r}\) is calculated from \(0.5\cdot(N+1)\).

4.2.4.2.10. Mood’s test

Mood’s test can measure median and scale on multi-grouped data. Mood’s median test is a non-parametric alternative to one-way ANOVA test, and also a special case of Pearson’s Chi-Squared Test.

\(H_0\):

Samples of all groups are drawn from the populations with the same median

\(H_1\):

Samples of all groups are not drawn from the populations with the same median

statistic:

(4.19)\[s = \sum_{i=1}^{g} \sum_{j=0}^{1} \frac{({A}_{i,j}-{B}_{i,j})^2}{{B}_{i,j}}\]

Assume the grand median is \(\bar{m}\). \(A_{i,0}\) is the counts of observations less than or equal as \(\bar{m}\) in \(i\)-th group, \(A_{i,1}\) is that greater than \(\bar{m}\). \(B_{i, j}\) is defined as \((\sum_{i=1}^{g} A_{i,j} \cdot \sum_{j=0}^{1} A_{i,j})/\sum_{i=1}^{g} \sum_{j=0}^{1} A_{i,j}\).

Scale is parameter to describe the range of distribution (See scale). For pair-wised Mood’s scale test, the underlying model is assumption that two samples are drawn from distributions \(f(x-l)\) and \(f((x-l)/m)/m\) respectively, \(l\) is for location and \(m\) is for scale. Null hypothesis in these case is \(m = 1\).

4.2.4.2.11. Bartlett’s test

The statistical approach proposed by Maurice Stevenson Bartlett for testing homoscedasticity on samples drawn from populations with equal variances.

\(H_0\):

Samples of all groups are of the same variance

\(H_1\):

Samples of all groups are not of the same variance

statistic:

(4.20)\[s = \frac{(N-g)\mathrm{ln}S_p^2 - \sum_{i=1}^g (n_i - 1)\mathrm{ln}S_i^2}{1 + \frac{1}{3(g-1)} [\sum_{i=1}^g (\frac{1}{n_i-1} - \frac{1}{N-g})]} \sim \chi^2\]

Where \(n_i\) is the number of observations in \(i\)-th group among \(g\) groups; \(S_i^2\) is variance of group \(i\); \(N=\sum_{i=1}^g n_i\); and \(S_p^2 = (N-g)^{-1} \sum_{i=1}^g (n_i-1)S_i^2\). This statistic obeys a \(\chi^2\) distribution with degree of freedom of \(g-1\).

4.2.4.2.12. Levene test

The statistical approach proposed by Levene for testing homoscedasticity on samples drawn from populations with equal variances. An alternative for Bartlett’s test due to its robust performance.

\(H_0\):

Samples of all groups are of the same variance

\(H_1\):

Samples of all groups are not of the same variance

statistic:

(4.21)\[s = \frac{N-g}{g-1}\cdot\frac{\sum_{i=1}^g n_i ({z}_{i,\cdot}-{z}_{\cdot,\cdot})^2}{\sum_{i=1}^g \sum_{j=1}^{n_i} ({z}_{i,j}-{z}_{i,\cdot})^2}\]

Where \(z_{i,j}\) is the absolute distance to the mean (trimmed or not), or median of all observations of \(i\)-th group, from \(j\)-th case. \(n_i\) is the number of all observations of the \(i\)-th group, among \(g\) groups. Group mean \(z_{i,\cdot}\) is defined as \(n_i^{-1} \sum_{j=1}^{n_i} z_{i,j}\). Grand mean \(z_{\cdot, \cdot}\) is defined as \((\sum_{i=1}^g n_i)^{-1} \cdot \sum_{i=1}^{g} \sum_{j=1}^{n_i} z_{i,j}\).

4.2.4.2.13. Fligner-Killeen test

Non-parametric alternative of Bartlett’s test for testing homoscedasticity on samples. Perform well when observations distributed non-normally, or outliers existed.

\(H_0\):

Samples of all groups are of the same variance

\(H_1\):

Samples of all groups are not of the same variance

statistic:

(4.22)\[s = \frac{\sum_{i=1}^g n_i (\bar{z}_i - \bar{z})^2}{s^2}\]

Where \(\bar{z}_i\) is the mean of \(z\) scores of \(i\)-th group among \(g\) groups. \(\bar{z}\) and \(s^2\) are grand mean and variance of all \(z\) scores. \(n_i\) is the number of observations for \(i\)-th group.

4.2.4.2.14. Anderson-Darling test

Statistical approach proposed by Theodore Wilbur Anderson and Donald A. Darling [Anderson1952], to determine whether a given set of observations is drawn from a given probability distribution. For K samples Anderson-Darling test, it can measure whether several group of observations are sourced from a single distribution. For K samples samples Anderson-Darling test:

\(H_0\):

Samples of all groups are drawn from the same population

\(H_1\):

Samples of all groups are not drawn from the same population

statistic:

(4.23)\[s = \frac{1}{N} \sum_{i=1}^{g} \frac{1}{n_i} \sum_{j=1}^{N-1} \frac{(N\cdot{M}_{i,j}-j \cdot n_i)^2}{j(N-j)}\]

Where \(n_i\) is the number of all observations of the \(i\)-th group, among \(g\) groups. \(N\) is the total number of observations (\(N = \sum_{i=1}^g n_i\)). \(M_{i,j}\) is the number of observations in \(i\)-th group that less than or equal as \(r_j\), the pooled rank of \(x_j\) in \(\{x_1, x_2, \dots, x_N\}\) in ascending order.

4.2.4.2.15. Wilcoxon rank test

Frank Wilcoxon firstly proposed in 1945 [Wilcoxon1992] to use rank instead of the values themselves to run variance analysis. It concludes unpaired and paired methods. Unpaired one is known as rank sum test, while paired one is known as single rank test. For rank sum test:

\(H_0\):

Samples of two groups are drawn from the same distribution

\(H_1\):

Samples of two groups are not drawn from the same distribution

statistic:

(4.24)\[s = \sum_{j=1}^{n_1} {r}_{1, j}\]

Where \(n_1\) is the number of observations in first group, \(r_{1, j}\) refers the rank of all 1st group observations in pooled set of two groups.

Additionally, for single rank version, the statistic will be like:

(4.25)\[s = \sum_{i=1}^{n} \mathrm{sgn}(x_i - y_i) r_i\]

Where \(r_i\) is the rank of \(i\)-th item in the set of \(\{|x_1 - y_1|, |x_2 - y_2|, \dots, |x_n - y_n|\}\).

4.2.4.2.16. Epps-Singleton test

The method suggested by Epps and Singleton [Epps1986] to use characteristic function \(g\) instead of observed distribution \(F\) for test. This method weakens the assumptions for specifying type continuity of probability distributions, and applied whether continuity or not of the underlying distributions.

\(H_0\):

Samples of two groups are of the same underlying distribution; \(g_1 = g_2\)

\(H_1\):

Samples of two groups are not of the same underlying distribution; \(g_1 \neq g_2\)

statistic:

(4.26)\[s = \sqrt{n_1+n_2} \cdot (g_1-g_2) \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Omega})\]

Where \(n_1\) and \(n_2\) are numbers of observations of two groups. \(g\) is the characteristic function defined as Fourier transform of observed distribution \(F\) (\(g_i = \int_{-\inf}^{\inf} e^{itx} dF_{n_i}(x) = n_i^{-1} \sum_{j=1}^{n_i} e^{itX_{ij}}\)). The item \(g(X_{ij}) = e^{itX_{ij}}\) is expressed via Euler number as 4-dimensional vector. This statistic will asymptotically approximates to a multivariate gaussian \(\mathcal{N}(\boldsymbol{0}, \boldsymbol{\Omega})\). For \(\boldsymbol{\Omega}\), it can be estimated as \(\hat{\boldsymbol{\Omega}}=\sum_{i=1}^{2}[(n_{i}-1)(\sum_{i=1}^{2}n_i)/n_{i}^2]\mathrm{cov}\{g(X_{ij})\}\).

4.2.4.2.17. Mann–Whitney U test

None parametric method proposed by Mann Henry B. and Whitney Donald R. [Mann1947] to measure the distance of two I.I.D. samples drawn from two populations.

\(H_0\):

Samples of two groups are drawn from the same distribution

\(H_1\):

Samples of two groups are not drawn from the same distribution

statistic:

(4.27)\[s = \sum_{i=1}^{n} \sum_{j=1}^{m} S(x_i, y_j)\]

Where \(n\) and \(m\) are numbers of observations of two groups. \(S(x_i, y_j)\) is 1 if \(x_i > y_j\); 0.5 if \(x_i = y_j\); and 0 if \(x_i < y_j\).

4.2.4.2.18. Brunner-Munzel test

Test with equal or even greater power than that of Mann–Whitney U test proposed by Brunner and Munzel [Brunner2000, Karch2021].

\(H_0\):

for observations from two populations \(X\) and \(Y\), \(P(X > Y) = P(X < Y)\)

\(H_1\):

\(P(X > Y) \neq P(X < Y)\)

statistic:

(4.28)\[s = \frac{U}{n_1 \cdot n_2}\]

Where \(U\) is the statistic of U test, \(n_1\) and \(n_2\) are number of observations for two groups.

4.2.4.2.19. Ansari-Bradley test

Also know as dispersion test firstly proposed by Ansari and Bradley [Ansari1960] to measure the scales difference between two groups of samples.

\(H_0\):

for two populations with scales \(\sigma_x\) and \(\sigma_y\), \(\sigma_x = \sigma_y\)

\(H_1\):

\(\sigma_x \neq \sigma_y\)

statistic:

(4.29)\[s = \sum_{i=1}^{n_x} {r}_{x_i}\]

Where \(n_x\) is the number of observations for \(x\), \(r_{x_i}\) is the rank assigned to \(x_i\) in pooled set of \(x\) and \(y\).

4.2.4.2.20. Skew test

Test to quantify how extent the skewness of data distribution departed from a standard uni-variate gaussian suggested by Agostino et. al.:

\(H_0\):

skewness of data is of the same as that of standard uni-variate gaussian

\(H_1\):

skewness of data is not of the same as that of standard uni-variate gaussian

statistic:

(4.30)\[s = \delta + \log{[\frac{y}{\alpha} + \sqrt{(\frac{y}{\alpha})^2 + 1}]}\]

Where \(n\) is the number of samples, for \(\delta\), \(y\) and \(\alpha\):

(4.31)\[\delta = \frac{1}{\sqrt{0.5 \cdot \log{W_2}}}\]
(4.32)\[y = \frac{b_2}{\sqrt{\frac{6(n-2)}{(n+1)(n+3)}}}\]
(4.33)\[\alpha = \sqrt{\frac{2}{W_2 -1}}\]

And for \(W_2\) and \(b_2\) (skewness know as \(z^3\)):

(4.34)\[\begin{split}W_2 &= -1 + \sqrt{2(\beta_2 - 1)} \\ \mathrm{for}\ \beta_2 &= \frac{3(n^2+27n-70)(n+1)(n+3)}{(n-2)(n+5)(n+7)(n+9)}\end{split}\]
(4.35)\[b_2 = \frac{\sum_{i=1}^{n} z_i^3}{n}\]

4.2.4.2.21. Kurtosis test

Test to quantify how extent the kurtosis of data distribution departed from a standard uni-variate gaussian suggested by Anscombe et. al.:

\(H_0\):

kurtosis of data is of the same as that of standard uni-variate gaussian

\(H_1\):

kurtosis of data is not of the same as that of standard uni-variate gaussian

statistic:

(4.36)\[s = (T_1 - T_2) \cdot \sqrt{\frac{9A}{2}}\]

Where for \(T_1\), \(T_2\) and \(A\):

(4.37)\[T_1 = 1 - \frac{2}{9A}\]
(4.38)\[T_2 = \mathrm{sgn}(D) \cdot [\frac{(1-\frac{2}{A})}{\mid D \mid}]^{\frac{1}{3}}\]
(4.39)\[A = 6 + \frac{8}{\surd b_1} (\frac{2}{\surd b_1} + \sqrt{1 + \frac{4}{{\surd b_1}^2}})\]

and \(n\) is the number of samples, for \(D\), \(\surd b_1\), and the intermediate variable \(x\):

(4.40)\[D = 1 + x \cdot \sqrt{\frac{2}{A-4}}\]
(4.41)\[\surd b_1 = \frac{6(n^2-5n+2)}{(n+7)(n+9)} \cdot \sqrt{\frac{n(n-2)(n-3)}{6(n+3)(n+5)}}\]
(4.42)\[x = \frac{b_2 - E}{\sqrt{V}}\]

Where \(b_2\) is the kurtosis of the \(z\) scores; \(E = 3(n-1)/(n+1)\); and \(V = [24n(n-2)(n-3)]/[(n+1)^2(n+5)(n+5)]\).

4.2.4.2.22. Jarque-Bera test

Statistical approach proposed by Carlos Jarque and Anil K. Bera [Jarque1980] to test the goodness of fit of samples to standard uni-variate gaussian. Work well only large number of observations.

\(H_0\):

sample has the skewness and kurtosis matching the standard uni-variate gaussian

\(H_1\):

sample has the skewness and kurtosis not matching the standard uni-variate gaussian

statistic:

(4.43)\[s = \frac{n}{6} [S^2 + \frac{1}{4} (K-3)^2]\]

Where the \(S\) is the skewness (\(b_2\) in Equation 4.32), \(K\) is the kurtosis of samples (\(b_2\) in Equation 4.42).

4.2.4.2.23. Cressie-Read power divergence test

Test proposed by Read and Cressie [Read1988] to determine whether the samples match the given categorical frequencies.

\(H_0\):

observations match the given categorical frequencies

\(H_1\):

observations not match the given categorical frequencies

statistic:

(4.44)\[s = \frac{2}{\lambda (\lambda + 1)} \sum_{i=1}^k o_i [(\frac{o_i}{e_i})^\lambda - 1]\]

Where \(\lambda\) is an user-predefined real-value parameter. \(k\) is the parameter of an \(k\) categorical distribution. \(o_i\) and \(e_i\) are observed frequency and expected frequency for the \(k\)-th category, respectively.

4.2.4.2.24. Chi-Squared test

Pearson’s chi-squared test [Pearson1900] to determine whether the samples match the given categorical frequencies. If priori frequencies not given, the expected frequencies are calculated from the data.

\(H_0\):

observations match the expected categorical frequencies

\(H_1\):

observations not match the expected categorical frequencies

statistic:

(4.45)\[s = \sum_{i=1}^{n} \sum_{j=1}^{m} \frac{({o}_{i,j}-{e}_{i,j})^2}{{e}_{i,j}} \sim \chi^2\]

Assume there are \(n\) options of variable 1 coupled with \(m\) options of variable 2. \(o_{i,j}\) is the observed frequency of \(i\)-th option in variable 1 and \(j\)-th option in variable 2. \(e_{i,j}\) is calculated from:

(4.46)\[{e}_{i,j} = \frac{\sum_{i=1}^{n} {o}_{i,j} \cdot \sum_{j=1}^{m} {o}_{i,j}}{\sum_{i=1}^{n} \sum_{j=1}^{m} {o}_{i,j}}\]

This statistic obeys a \(\chi^2\) distribution with degree of freedom \((n-1)\cdot(m-1)\).

4.2.4.2.25. Pearson correlation coefficient

Correlation coefficient (PCC, or PPMCC) is the statistical approach to measure the relation between two factors in observed samples.

\(H_0\):

two factors of observed samples are uncorrelated

\(H_1\):

two factors of observed samples are not uncorrelated

statistic:

(4.47)\[s = \frac{\mathrm{cov} (X, Y)}{ \rho_X \rho_Y} = \frac{\mathbb{E}[(X-\mu_X)(Y-\mu_Y)]}{\rho_X \rho_Y}\]

Notes that \(\rho_X\) and \(\rho_Y\) are standard deviation for two factors \(X\) and \(Y\); \(\mathbb{E}[(X-\mu_X)(Y-\mu_Y)]\) can be calculated as \(\mathbb{E}[XY]-\mathbb{E}[X]\mathbb{E}[Y]\).

4.2.4.2.26. Spearman correlation coefficient

Non-parametric version of Pearson correlation coefficient using ranks of values instead of values themselves when computing coefficient.

\(H_0\):

factors of observed samples are uncorrelated

\(H_1\):

factors of observed samples are not uncorrelated

statistic:

(4.48)\[s = \frac{\mathrm{cov} (R(X), R(Y))}{ \rho_{R(X)} \rho_{R(Y)}} = \frac{\mathbb{E}[(R(X)-\mu_{R(X)}) (R(Y)-\mu_{R(Y)})]}{\rho_{R(X)} \rho_{R(Y)}}\]

\(R(X)\) and \(R(Y)\) are the rank of \(X\) and \(Y\) series, respectively.

4.2.4.2.27. Kendall’s tau correlation coefficient

The statistic \(\tau\) measure the fraction of difference from concordant to discordant pairs, over all number of pairs. For any concordant pairs \((x_i, y_i)\) and \((x_j, y_j)\) when \(i < j\), \(\mathrm{sgn} (x_i - x_j) \mathrm{sgn} (y_i - y_j) > 0\). This method is developed by Kendall in 1938 [Kendall1938].

\(H_0\):

factors of observed samples are uncorrelated

\(H_1\):

factors of observed samples are not uncorrelated

statistic:

(4.49)\[s = \frac{1}{n(n-1)} \sum_{i < j} \mathrm{sgn} (x_i - x_j) \mathrm{sgn} (y_i - y_j)\]

4.2.4.2.28. Friedman test

The test to measure whether repeated samples of the same individuals have the same distribution. This method is firstly proposed by Milton Friedman [Friedman1937].

\(H_0\):

repeated samples have the same distribution

\(H_1\):

repeated samples not have the same distribution

statistic:

(4.50)\[s = \frac{12n}{g(g+1)} \sum_{j=1}^{g} (\bar{r}_{\cdot j} - \frac{g+1}{2})^2 \sim \chi_{g-1}^2\]

Assume the data is organized in \(\boldsymbol{X} \in \mathbb{R}^{n \times g}\) where \(n\) is the number of observations while \(g\) is the number of factors. There is also a rank matrix \(\boldsymbol{R} \in \mathbb{Z}^{+\ n \times g}\) where \(r_{i j}\) is the rank of \(x_{i j}\) in \(x_{i \cdot} = \{x_{i 1}, x_{i 2}, \dots, x_{i g}\}\). \(\bar{r}_{\cdot j}\) is defined as \(n^{-1}\sum_{i=1}^{n} r_{i j}\). This statistic obeys a \(\chi^2\) distribution with \(g-1\) degree of freedom.

4.2.4.2.29. Multiscale Graph Correlation test

The method proposed by Vogelstein et.al. to quantify the correlation between two high-dimensional observations. Refer the section Multi Graph Correlation for algorithm details.

\(H_0\):

two high-dimensional data \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) are independent

\(H_1\):

two high-dimensional data \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) are not independent

statistic:

(4.51)\[s = f(M_X, M_Y)\]

Where \(M_X\) and \(M_Y\) are distance matrices for \(X\) and \(Y\) respectively.

4.2.4.2.30. Monte Carlo hypothesis test

Test data whether significantly varies from the from specified distributions, via comparing to a pseudo data set generated for simulation.

\(H_0\):

test data are randomly sampled from specified distribution

\(H_1\):

test data are not randomly sampled from specified distribution

statistic:

(4.52)\[s = f({s}_{agg}(X), {s}_{agg}({X}_{d}^\prime))\]

Where the aggregation function \(s_{agg}\) is predefined as statistic by user. \(X^\prime\) is the randomly sampling data generated from user defined distribution \(d\).

4.2.4.2.31. Permutation test

statistical simulation to test whether two groups of data have the same underlying distribution.

\(H_0\):

test \(n\ (n \geq 2)\) groups of data are randomly sampled from the same distribution

\(H_1\):

test \(n\ (n \geq 2)\) groups of data are not randomly sampled from the same distribution

statistic:

(4.53)\[s = f({s}_{agg}(X_1), {s}_{agg}(X_2), \dots, {s}_{agg}(X_n))\]

Where the aggregation function \(s_{agg}\) is predefined by user.


Authors:

Chen Zhang

Version:

0.0.5

Created on:

May 26, 2023