Commit b55a343b by Eric Coissac

First version of the abstract

parent 762805dc
\IfFileExists{bioinfo.cls}{%
\def\mode{0}%
}{%
\def\mode{1}%
}
\def\mode{1}% Class bioinfo if 0; simple article otherwise
\if 0\mode
\documentclass{bioinfo}%
......@@ -22,20 +18,22 @@
\newcommand\corresp[1]{}
\newcommand\authorname[2]{\author{#2}}
\newcommand\maintitle[2]{\title{#2}}
\newcommand\processtable[3]{\caption{#1}%
#2}
\usepackage{hyperref}
\usepackage{natbib}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{xcolor}
\usepackage{graphicx}
\usepackage{lineno}
\newenvironment{methods}{}{}
\newenvironment{knitrout}{}{}
\definecolor{fgcolor}{rgb}{0.345, 0.345, 0.345}
\newcommand\maxwidth{\textwidth}
\newcommand\maxwidth{0.8\textwidth}
\fi
\usepackage{amsmath}
\usepackage{lipsum}
\usepackage{multirow}
\DeclareMathOperator{\rpearson}{R}
......@@ -80,18 +78,11 @@ $^{\text{\sf 2}}$Department, Institution, City, Post Code, Country.}
\maketitle
\fi
\abstract{\textbf{Motivation:} Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text
Text Text Text Text Text.\\
\textbf{Results:} Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text\\
\textbf{Availability:} Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text\\
\textbf{Contact:} \href{eric.coissac@metabarcoding.org}{eric.coissac@metabarcoding.org}\\
\textbf{Supplementary information:} Supplementary data are available at \textit{Bioinformatics}
online.}
\abstract{\textbf{Motivation:} Molecular biology and ecology are producing many high dimension data.
Estimating correlation and shared variation between such data sets is an important step to distangle relationships among differents elements of a biological system. Unfortunatly when using classical measure, because of the high dimension of the data, high correlation can be falsly infered. \\
\textbf{Results:} Here we propose a corrected version of the Procrustean correlation coeficient that is not sensible to the high dimension of the data. This allows for a correct estimation of the shared variation between two data sets and of the partial correlation coefficient between a set of matrix data.\\
\textbf{Availability:} The proposed corrected coeficients are implemented in the ProcMod R package available on \url{https://git.metabarcoding.org/lecasofts/ProcMod}\\
\textbf{Contact:} \href{eric.coissac@metabarcoding.org}{eric.coissac@metabarcoding.org}}
\if 0\mode
\maketitle
......@@ -120,6 +111,7 @@ library(energy)
library(ProcMod)
library(vegan)
registerDoParallel(4)
......@@ -139,7 +131,7 @@ set.seed(1)
Multidimensional data and even high-dimensional data, where the number of variables describing each sample is far larger than the sample count, are now regularly produced in functional genomics (\emph{e.g.} transcriptomics, proteomics or metabolomics) and molecular ecology (\emph{e.g.} DNA metabarcoding). Using various techniques, the same sample set can be described by several multidimensional data sets, each of them describing a different facet of the samples. This invites using data analysis methods able to evaluate mutual information shared by these different descriptions. Correlative approaches can be a first and simple way to decipher pairwise relationships of those data sets.
Since a long time ago, several coefficients have been proposed to measure correlation between two matrices \citep[for a comprehensive review see][]{Ramsay:84:00}. But when applied to high-dimensional data, they suffer from the over-fitting effect leading them to estimate a high correlation even for unrelated data sets. Modified versions of some of these matrix correlation coefficients have been already proposed to tackle this problem. The $\rv_2$ coefficient \citep{Smilde:09:00} is correcting the original $\rv$ coefficient \citep{Escoufier:73:00} for over-fitting. Similarly, a modified version of the distance correlation coefficient $\dcor$ \citep{Szekely:07:00} has been proposed by \cite{SzeKely:13:00}. $\dcor$ has the advantage over the other correlation factors for not considering only linear relationships. Here we will focus on the Procrustes correlation coefficient $\rls$ proposed by \cite{Lingoes:74:00} and by \cite{Gower:71:00}. Let define $Trace$, a function summing the diagonal elements of a matrix. For a $ \; n \; \times \; p \; $ real matrix $\X$ and a second $ \; n \; \times \; q \; $ real matrix $\Y$ defining respectively two sets of $p$ and $q$ centered variables caracterizing $n$ individuals, we define $\covls(\X,\X)$ an analog of covariance applicable to vectorial data following Equation~(\ref{eq:CovLs})
Since a long time ago, several coefficients have been proposed to measure correlation between two matrices \citep[for a comprehensive review see][]{Ramsay:84:00}. But when applied to high-dimensional data, they suffer from the over-fitting effect leading them to estimate a high correlation even for unrelated data sets. Modified versions of some of these matrix correlation coefficients have been already proposed to tackle this problem. The $\rv_2$ coefficient \citep{Smilde:09:00} is correcting the original $\rv$ coefficient \citep{Escoufier:73:00} for over-fitting. Similarly, a modified version of the distance correlation coefficient $\dcor$ \citep{Szekely:07:00} has been proposed by \cite{SzeKely:13:00}. $\dcor$ has the advantage over the other correlation factors for not considering only linear relationships. Here we will focus on the Procrustes correlation coefficient $\rls$ proposed by \cite{Lingoes:74:00} and by \cite{Gower:71:00}. Let define $Trace$, a function summing the diagonal elements of a matrix. For a $ \; n \; \times \; p \; $ real matrix $\X$ and a second $ \; n \; \times \; q \; $ real matrix $\Y$ defining respectively two sets of $p$ and $q$ centered variables caracterizing $n$ individuals, we define $\covls(\X,\Y)$ an analog of covariance applicable to vectorial data following Equation~(\ref{eq:CovLs})
\begin{equation}
\covls(\X,\Y) = \frac{\trace((\mathbf{XX}'\mathbf{YY}')^{1/2})}{n-1}
......@@ -154,7 +146,9 @@ and $\varls(\X)$ as $\covls(\X,\X)$. $\rls$ can then be expressed as follow in E
\label{eq:Rls}
\end{equation}
Procrustean analyses have been proposed as a good alternative to Mantel's statistics for analyzing ecological data, and more generally for every high-dimensional data sets \citep{Peres-Neto:01:00}. Among the advantages of $\rls$, its similarity with the Pearson correlation coefficient $\rpearson$ \citep{Bravais:44:00} has to be noticed. Considering $\covls(\X,\Y)$ and $\varls(\X)$ respectively corresponding to the covariance of two matrices and the variance of a matrix, Equation~(\ref{eq:Rls}) highlight the analogy between both the correlation coefficients. Moreover, when $p=1 \text{ and } q = 1,\; \rls = \lvert \rpearson \rvert$.
Among the advantages of $\rls$, its similarity with the Pearson's correlation coefficient $\rpearson$ \citep{Bravais:44:00} has to be noticed. Considering $\covls(\X,\Y)$ and $\varls(\X)$, respectively corresponding to the covariance of two matrices and the variance of a matrix, Equation~(\ref{eq:Rls}) highlight the analogy between both the correlation coefficients. Besides, when $p=1 \text{ and } q = 1,\; \rls = \lvert \rpearson \rvert$. When squared $\rls$ is an estimate, like the squared Pearson's $\rpearson$, of the amount of variation shared between the two datasets. This property allows for developing variance analyzing of matrix data sets.
Moreover, Procrustean analyses have been proposed as a good alternative to Mantel's statistics for analyzing ecological data summarized by distance matrices \citep{Peres-Neto:01:00}. In such analyze distance matrices are projected into an orthogonal space using metric or non metric multidimensional scaling according to the geometrical properties of the used distances. Correlations are then estimated between these projections.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
......@@ -189,7 +183,7 @@ This expression illustrates that actually $\covls(\X,\Y)$ is the variance of the
\label{eq:ICovLs}
\end{equation}
Similarly the informative counter-part of $\varls(\X)$ is defined as $\ivarls(\X)=\icovls(\X,\X)$,
\sloppy Similarly the informative counter-part of $\varls(\X)$ is defined as $\ivarls(\X)=\icovls(\X,\X)$,
and $\irls(\X,\Y)$ the informative Procruste correlation coefficient as follow.
\begin{equation}
......@@ -197,9 +191,9 @@ and $\irls(\X,\Y)$ the informative Procruste correlation coefficient as follow.
\label{eq:IRLs}
\end{equation}
Like $\rls(\X,\Y)$, $\irls(\X,\Y) \in [0;1]$ with the $0$ value corresponding to not correlation and the maximum value $1$ reached for two strictly homothetic data sets.
Like $\rls(\X,\Y)$, $\irls(\X,\Y) \in [0;1]$ with the $0$ value corresponding to no correlation and the maximum value $1$ reached for two strictly homothetic data sets.
The corollary of $\icovls(\X,\Y)$ and $\ivarls(\X)$ definitions is that $\icovls(\X,\Y) \geqslant 0$ and $\ivarls(\X) > 0$. Therefore for $M=\{\mathbf{M}_1,\mathbf{M}_2,...,\mathbf{M}_k\}$ a set of $k$ matrices with the same number of row, the informative covariance matrix $\mathbf{C}$ defined as $\mathbf{C}_{i,j} = \icovls(\mathbf{M}_i,\mathbf{M}_j)$ for is definite positive and symmetrical. This allows for defining the precision matrix $\mathbf{P}=\mathbf{C}^{-1}$ and the related partial correlation coefficent matrix $\irls_{partial}$ using Equation~(\ref{eq:IRls.partial})
The corollary of $\icovls(\X,\Y)$ and $\ivarls(\X)$ definitions is that $\icovls(\X,\Y) \geqslant 0$ and $\ivarls(\X) > 0$. Therefore for $M=\{\mathbf{M}_1,\mathbf{M}_2,...,\mathbf{M}_k\}$ a set of $k$ matrices with the same number of row, the informative covariance matrix $\mathbf{C}$ defined as $\mathbf{C}_{i,j} = \icovls(\mathbf{M}_i,\mathbf{M}_j)$ is definite positive and symmetrical. This allows for defining the precision matrix $\mathbf{P}=\mathbf{C}^{-1}$ and the related partial correlation coefficent matrix $\irls_{partial}$ using Equation~(\ref{eq:IRls.partial})
\begin{equation}
\irls_{partial}(\mathbf{M}_i,\mathbf{M}_j) = \frac{\mathbf{P}_{i,j}}{\sqrt{P_{i,i}P_{j,j}}}
......@@ -931,6 +925,22 @@ power <- apply(h1_sims < alpha,
FUN = mean)
@
<<empirical_data_set, message=FALSE, warning=FALSE, include=FALSE>>=
data("MicroCurvulatum")
mc <- procmod.frame(bacteria = vegdist(decostand(MicroCurvulatum$bacteria,
method = "hellinger"),
method = "euclidean"),
fungi = vegdist(decostand(MicroCurvulatum$fungi,
method = "hellinger"),
method = "euclidean"),
plants = vegdist(MicroCurvulatum$plants > 0,
method = "jaccard"),
soil = scale(MicroCurvulatum$soil,
center=TRUE,
scale=TRUE))
@
\vspace*{1pt}
\end{methods}
......@@ -1038,7 +1048,7 @@ print(tab,
}{\ } % <- we can add a footnote in the last curly braces
\end{table}
Two main parameters can influence the Monte Carlo estimation of $\overline{\rcovls(\X,\Y)}$ : the distribution used to generate the random matrices and $k$ the number of random matrix pair. Two very different distribution are tested to regenerate the random matrices, the normal and the exponential distributions. The first one is symmetric where the second is not with a high probability for small values and a long tail of large ones. Despite the use of these contrasted distributions, estimates of $\overline{\rcovls(\X,\Y)}$ and of $\sigma(\overline{\rcovls(\X,\Y)})$ are identical if we assume the normal distribution of the $\overline{\rcovls(\X,\Y)}$ estimator and a $0.95$ confidence interval of $\overline{\rcovls(\X,\Y)} \pm 2 \, \sigma(\overline{\rcovls(\X,\Y)})$ (Table~\ref{tab:mrcovls}).
\sloppy Two main parameters can influence the Monte Carlo estimation of $\overline{\rcovls(\X,\Y)}$~: the distribution used to generate the random matrices and $k$ the number of random matrix pair. Two very different distribution are tested to regenerate the random matrices, the normal and the exponential distributions. The first one is symmetric where the second is not with a high probability for small values and a long tail of large ones. Despite the use of these contrasted distributions, estimates of $\overline{\rcovls(\X,\Y)}$ and of $\sigma(\overline{\rcovls(\X,\Y)})$ are identical if we assume the normal distribution of the $\overline{\rcovls(\X,\Y)}$ estimator and a $0.95$ confidence interval of $\overline{\rcovls(\X,\Y)} \pm 2 \, \sigma(\overline{\rcovls(\X,\Y)})$ (Table~\ref{tab:mrcovls}).
\subsection{Relative sensibility of $IRLs(X,Y)$ to overfitting}
......@@ -1187,7 +1197,7 @@ ggplot(data = partial_r2_sims.stats,
\subsection{$p_{value}$ distribution under null hyothesis}
As expected, $P_{values}$ of the $CovLs$ test based on the estimation of $\overline{RCovLs(X,Y)}$ are uniformely distributed under $H_0$.
whatever the $p$ tested (Table~\ref{tab:alpha_pvalue}). This ensure that the probability of a $P_{value} \leqslant \alpha\text{-risk}$ is equal to $\alpha\text{-risk}$. Moreover $P_{values}$ of the $CovLs$ test are strongly linerarly correlated with those of both the other tests ($R^2=\Sexpr{round(cor(h0_alpha_tibble$Covls.test,h0_alpha_tibble$protest)^2,3)}$ and $R^2=\Sexpr{round(cor(h0_alpha_tibble$Covls.test,h0_alpha_tibble$procuste.rtest)^2,3)}$ respectively for the correlation with \texttt{vegan::protest} and \texttt{ade4::procuste.rtest} $P_{values}$). The slopes of the corresponding linear models are respectively $\Sexpr{round(lm(h0_alpha_tibble$Covls.test~h0_alpha_tibble$protest)$coefficients[2],3)}$ and $\Sexpr{round(lm(h0_alpha_tibble$Covls.test~h0_alpha_tibble$procuste.rtest)$coefficients[2],3)}$.
whatever the $p$ tested (Table~\ref{tab:alpha_pvalue}). This ensure that the probability of a $P_{value} \leqslant \alpha\text{-risk}$ is equal to $\alpha\text{-risk}$. Moreover $P_{values}$ of the $CovLs$ test are strongly linerarly correlated with those of both the other tests ($R^2=\Sexpr{round(cor(h0_alpha_tibble$Covls.test,h0_alpha_tibble$protest)^2,3)}$ and $R^2=\Sexpr{round(cor(h0_alpha_tibble$Covls.test,h0_alpha_tibble$procuste.rtest)^2,3)}$ respectively for the correlation with \texttt{vegan::\-protest} and \texttt{ade4::\-procuste.rtest} $P_{values}$). The slopes of the corresponding linear models are respectively $\Sexpr{round(lm(h0_alpha_tibble$Covls.test~h0_alpha_tibble$protest)$coefficients[2],3)}$ and $\Sexpr{round(lm(h0_alpha_tibble$Covls.test~h0_alpha_tibble$procuste.rtest)$coefficients[2],3)}$.
\begin{table}[!t]
\processtable{$P_{values}$ of the Cramer-Von Mises test of conformity
......
No preview for this file type
\IfFileExists{bioinfo.cls}{%
\def\mode{0}%
}{%
\def\mode{1}%
}
\def\mode{1}% Class bioinfo if 0; simple article otherwise
\if 0\mode
\documentclass{bioinfo}\usepackage[]{graphicx}\usepackage[]{color}
......@@ -72,20 +68,22 @@
\newcommand\corresp[1]{}
\newcommand\authorname[2]{\author{#2}}
\newcommand\maintitle[2]{\title{#2}}
\newcommand\processtable[3]{\caption{#1}%
#2}
\usepackage{hyperref}
\usepackage{natbib}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{xcolor}
\usepackage{graphicx}
\usepackage{lineno}
\newenvironment{methods}{}{}
\newenvironment{knitrout}{}{}
\definecolor{fgcolor}{rgb}{0.345, 0.345, 0.345}
\newcommand\maxwidth{\textwidth}
\newcommand\maxwidth{0.8\textwidth}
\fi
\usepackage{amsmath}
\usepackage{lipsum}
\usepackage{multirow}
\DeclareMathOperator{\rpearson}{R}
......@@ -130,18 +128,11 @@ $^{\text{\sf 2}}$Department, Institution, City, Post Code, Country.}
\maketitle
\fi
\abstract{\textbf{Motivation:} Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text
Text Text Text Text Text.\\
\textbf{Results:} Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text\\
\textbf{Availability:} Text Text Text Text Text Text Text Text Text Text Text Text Text Text
Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text\\
\textbf{Contact:} \href{eric.coissac@metabarcoding.org}{eric.coissac@metabarcoding.org}\\
\textbf{Supplementary information:} Supplementary data are available at \textit{Bioinformatics}
online.}
\abstract{\textbf{Motivation:} Molecular biology and ecology are producing many high dimension data.
Estimating correlation and shared variation between such data sets is an important step to distangle relationships among differents elements of a biological system. Unfortunatly when using classical measure, because of the high dimension of the data, high correlation can be falsly infered. \\
\textbf{Results:} Here we propose a corrected version of the Procrustean correlation coeficient that is not sensible to the high dimension of the data. This allows for a correct estimation of the shared variation between two data sets and of the partial correlation coefficient between a set of matrix data.\\
\textbf{Availability:} The proposed corrected coeficients are implemented in the ProcMod R package available on \url{https://git.metabarcoding.org/lecasofts/ProcMod}\\
\textbf{Contact:} \href{eric.coissac@metabarcoding.org}{eric.coissac@metabarcoding.org}}
\if 0\mode
\maketitle
......@@ -163,7 +154,7 @@ online.}
Multidimensional data and even high-dimensional data, where the number of variables describing each sample is far larger than the sample count, are now regularly produced in functional genomics (\emph{e.g.} transcriptomics, proteomics or metabolomics) and molecular ecology (\emph{e.g.} DNA metabarcoding). Using various techniques, the same sample set can be described by several multidimensional data sets, each of them describing a different facet of the samples. This invites using data analysis methods able to evaluate mutual information shared by these different descriptions. Correlative approaches can be a first and simple way to decipher pairwise relationships of those data sets.
Since a long time ago, several coefficients have been proposed to measure correlation between two matrices \citep[for a comprehensive review see][]{Ramsay:84:00}. But when applied to high-dimensional data, they suffer from the over-fitting effect leading them to estimate a high correlation even for unrelated data sets. Modified versions of some of these matrix correlation coefficients have been already proposed to tackle this problem. The $\rv_2$ coefficient \citep{Smilde:09:00} is correcting the original $\rv$ coefficient \citep{Escoufier:73:00} for over-fitting. Similarly, a modified version of the distance correlation coefficient $\dcor$ \citep{Szekely:07:00} has been proposed by \cite{SzeKely:13:00}. $\dcor$ has the advantage over the other correlation factors for not considering only linear relationships. Here we will focus on the Procrustes correlation coefficient $\rls$ proposed by \cite{Lingoes:74:00} and by \cite{Gower:71:00}. Let define $Trace$, a function summing the diagonal elements of a matrix. For a $ \; n \; \times \; p \; $ real matrix $\X$ and a second $ \; n \; \times \; q \; $ real matrix $\Y$ defining respectively two sets of $p$ and $q$ centered variables caracterizing $n$ individuals, we define $\covls(\X,\X)$ an analog of covariance applicable to vectorial data following Equation~(\ref{eq:CovLs})
Since a long time ago, several coefficients have been proposed to measure correlation between two matrices \citep[for a comprehensive review see][]{Ramsay:84:00}. But when applied to high-dimensional data, they suffer from the over-fitting effect leading them to estimate a high correlation even for unrelated data sets. Modified versions of some of these matrix correlation coefficients have been already proposed to tackle this problem. The $\rv_2$ coefficient \citep{Smilde:09:00} is correcting the original $\rv$ coefficient \citep{Escoufier:73:00} for over-fitting. Similarly, a modified version of the distance correlation coefficient $\dcor$ \citep{Szekely:07:00} has been proposed by \cite{SzeKely:13:00}. $\dcor$ has the advantage over the other correlation factors for not considering only linear relationships. Here we will focus on the Procrustes correlation coefficient $\rls$ proposed by \cite{Lingoes:74:00} and by \cite{Gower:71:00}. Let define $Trace$, a function summing the diagonal elements of a matrix. For a $ \; n \; \times \; p \; $ real matrix $\X$ and a second $ \; n \; \times \; q \; $ real matrix $\Y$ defining respectively two sets of $p$ and $q$ centered variables caracterizing $n$ individuals, we define $\covls(\X,\Y)$ an analog of covariance applicable to vectorial data following Equation~(\ref{eq:CovLs})
\begin{equation}
\covls(\X,\Y) = \frac{\trace((\mathbf{XX}'\mathbf{YY}')^{1/2})}{n-1}
......@@ -178,7 +169,9 @@ and $\varls(\X)$ as $\covls(\X,\X)$. $\rls$ can then be expressed as follow in E
\label{eq:Rls}
\end{equation}
Procrustean analyses have been proposed as a good alternative to Mantel's statistics for analyzing ecological data, and more generally for every high-dimensional data sets \citep{Peres-Neto:01:00}. Among the advantages of $\rls$, its similarity with the Pearson correlation coefficient $\rpearson$ \citep{Bravais:44:00} has to be noticed. Considering $\covls(\X,\Y)$ and $\varls(\X)$ respectively corresponding to the covariance of two matrices and the variance of a matrix, Equation~(\ref{eq:Rls}) highlight the analogy between both the correlation coefficients. Moreover, when $p=1 \text{ and } q = 1,\; \rls = \lvert \rpearson \rvert$.
Among the advantages of $\rls$, its similarity with the Pearson's correlation coefficient $\rpearson$ \citep{Bravais:44:00} has to be noticed. Considering $\covls(\X,\Y)$ and $\varls(\X)$, respectively corresponding to the covariance of two matrices and the variance of a matrix, Equation~(\ref{eq:Rls}) highlight the analogy between both the correlation coefficients. Besides, when $p=1 \text{ and } q = 1,\; \rls = \lvert \rpearson \rvert$. When squared $\rls$ is an estimate, like the squared Pearson's $\rpearson$, of the amount of variation shared between the two datasets. This property allows for developing variance analyzing of matrix data sets.
Moreover, Procrustean analyses have been proposed as a good alternative to Mantel's statistics for analyzing ecological data summarized by distance matrices \citep{Peres-Neto:01:00}. In such analyze distance matrices are projected into an orthogonal space using metric or non metric multidimensional scaling according to the geometrical properties of the used distances. Correlations are then estimated between these projections.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
......@@ -213,7 +206,7 @@ This expression illustrates that actually $\covls(\X,\Y)$ is the variance of the
\label{eq:ICovLs}
\end{equation}
Similarly the informative counter-part of $\varls(\X)$ is defined as $\ivarls(\X)=\icovls(\X,\X)$,
\sloppy Similarly the informative counter-part of $\varls(\X)$ is defined as $\ivarls(\X)=\icovls(\X,\X)$,
and $\irls(\X,\Y)$ the informative Procruste correlation coefficient as follow.
\begin{equation}
......@@ -221,9 +214,9 @@ and $\irls(\X,\Y)$ the informative Procruste correlation coefficient as follow.
\label{eq:IRLs}
\end{equation}
Like $\rls(\X,\Y)$, $\irls(\X,\Y) \in [0;1]$ with the $0$ value corresponding to not correlation and the maximum value $1$ reached for two strictly homothetic data sets.
Like $\rls(\X,\Y)$, $\irls(\X,\Y) \in [0;1]$ with the $0$ value corresponding to no correlation and the maximum value $1$ reached for two strictly homothetic data sets.
The corollary of $\icovls(\X,\Y)$ and $\ivarls(\X)$ definitions is that $\icovls(\X,\Y) \geqslant 0$ and $\ivarls(\X) > 0$. Therefore for $M=\{\mathbf{M}_1,\mathbf{M}_2,...,\mathbf{M}_k\}$ a set of $k$ matrices with the same number of row, the informative covariance matrix $\mathbf{C}$ defined as $\mathbf{C}_{i,j} = \icovls(\mathbf{M}_i,\mathbf{M}_j)$ for is definite positive and symmetrical. This allows for defining the precision matrix $\mathbf{P}=\mathbf{C}^{-1}$ and the related partial correlation coefficent matrix $\irls_{partial}$ using Equation~(\ref{eq:IRls.partial})
The corollary of $\icovls(\X,\Y)$ and $\ivarls(\X)$ definitions is that $\icovls(\X,\Y) \geqslant 0$ and $\ivarls(\X) > 0$. Therefore for $M=\{\mathbf{M}_1,\mathbf{M}_2,...,\mathbf{M}_k\}$ a set of $k$ matrices with the same number of row, the informative covariance matrix $\mathbf{C}$ defined as $\mathbf{C}_{i,j} = \icovls(\mathbf{M}_i,\mathbf{M}_j)$ is definite positive and symmetrical. This allows for defining the precision matrix $\mathbf{P}=\mathbf{C}^{-1}$ and the related partial correlation coefficent matrix $\irls_{partial}$ using Equation~(\ref{eq:IRls.partial})
\begin{equation}
\irls_{partial}(\mathbf{M}_i,\mathbf{M}_j) = \frac{\mathbf{P}_{i,j}}{\sqrt{P_{i,i}P_{j,j}}}
......@@ -364,6 +357,7 @@ To evaluate relative power of the three considered tests, pairs of to random mat
\vspace*{1pt}
\end{methods}
......@@ -380,7 +374,7 @@ To evaluate relative power of the three considered tests, pairs of to random mat
\begin{table}[!t]
\processtable{Estimation of $\overline{\rcovls(\X,\Y)}$ according to the number of random matrices (k) aligned.\label{tab:mrcovls}}{
% latex table generated in R 3.5.2 by xtable 1.8-4 package
% Tue Sep 3 11:26:56 2019
% Thu Sep 12 07:14:26 2019
\begin{tabular}{rrrrrrr}
\hline
& & \multicolumn{2}{c}{normal} & & \multicolumn{2}{c}{exponential}\\ \cline{3-4} \cline{6-7}p & k &\multicolumn{1}{c}{mean} & \multicolumn{1}{c}{sd} & \multicolumn{1}{c}{ } &\multicolumn{1}{c}{mean} & \multicolumn{1}{c}{sd}\\\hline\multirow{3}{*}{10} & 10 & 0.5746 & $1.3687 \times 10^{-2}$ & & 0.5705 & $1.1714 \times 10^{-2}$ \\
......@@ -401,7 +395,7 @@ To evaluate relative power of the three considered tests, pairs of to random mat
}{\ } % <- we can add a footnote in the last curly braces
\end{table}
Two main parameters can influence the Monte Carlo estimation of $\overline{\rcovls(\X,\Y)}$ : the distribution used to generate the random matrices and $k$ the number of random matrix pair. Two very different distribution are tested to regenerate the random matrices, the normal and the exponential distributions. The first one is symmetric where the second is not with a high probability for small values and a long tail of large ones. Despite the use of these contrasted distributions, estimates of $\overline{\rcovls(\X,\Y)}$ and of $\sigma(\overline{\rcovls(\X,\Y)})$ are identical if we assume the normal distribution of the $\overline{\rcovls(\X,\Y)}$ estimator and a $0.95$ confidence interval of $\overline{\rcovls(\X,\Y)} \pm 2 \, \sigma(\overline{\rcovls(\X,\Y)})$ (Table~\ref{tab:mrcovls}).
\sloppy Two main parameters can influence the Monte Carlo estimation of $\overline{\rcovls(\X,\Y)}$~: the distribution used to generate the random matrices and $k$ the number of random matrix pair. Two very different distribution are tested to regenerate the random matrices, the normal and the exponential distributions. The first one is symmetric where the second is not with a high probability for small values and a long tail of large ones. Despite the use of these contrasted distributions, estimates of $\overline{\rcovls(\X,\Y)}$ and of $\sigma(\overline{\rcovls(\X,\Y)})$ are identical if we assume the normal distribution of the $\overline{\rcovls(\X,\Y)}$ estimator and a $0.95$ confidence interval of $\overline{\rcovls(\X,\Y)} \pm 2 \, \sigma(\overline{\rcovls(\X,\Y)})$ (Table~\ref{tab:mrcovls}).
\subsection{Relative sensibility of $IRLs(X,Y)$ to overfitting}
......@@ -475,14 +469,14 @@ The simulated correlation network between the four matrices $\mathbf{A},\,\mathb
\subsection{$p_{value}$ distribution under null hyothesis}
As expected, $P_{values}$ of the $CovLs$ test based on the estimation of $\overline{RCovLs(X,Y)}$ are uniformely distributed under $H_0$.
whatever the $p$ tested (Table~\ref{tab:alpha_pvalue}). This ensure that the probability of a $P_{value} \leqslant \alpha\text{-risk}$ is equal to $\alpha\text{-risk}$. Moreover $P_{values}$ of the $CovLs$ test are strongly linerarly correlated with those of both the other tests ($R^2=0.996$ and $R^2=0.996$ respectively for the correlation with \texttt{vegan::protest} and \texttt{ade4::procuste.rtest} $P_{values}$). The slopes of the corresponding linear models are respectively $0.998$ and $0.999$.
whatever the $p$ tested (Table~\ref{tab:alpha_pvalue}). This ensure that the probability of a $P_{value} \leqslant \alpha\text{-risk}$ is equal to $\alpha\text{-risk}$. Moreover $P_{values}$ of the $CovLs$ test are strongly linerarly correlated with those of both the other tests ($R^2=0.996$ and $R^2=0.996$ respectively for the correlation with \texttt{vegan::\-protest} and \texttt{ade4::\-procuste.rtest} $P_{values}$). The slopes of the corresponding linear models are respectively $0.998$ and $0.999$.
\begin{table}[!t]
\processtable{$P_{values}$ of the Cramer-Von Mises test of conformity
of the distribution of $P_{values}$ correlation test to $\mathcal{U}(0,1)$
under the null hypothesis.\label{tab:alpha_pvalue}} {
% latex table generated in R 3.5.2 by xtable 1.8-4 package
% Tue Sep 3 11:26:59 2019
% Thu Sep 12 07:14:29 2019
\begin{tabular*}{0.98\linewidth}{@{\extracolsep{\fill}}crrr}
\hline
& \multicolumn{3}{c}{Cramer-Von Mises p.value} \\
......@@ -504,7 +498,7 @@ Power of the $CovLs$ test based on the estimation of $\overline{RCovLs(X,Y)}$ is
\begin{table}[!t]
\processtable{Power estimation of the procruste tests for two low level of shared variations $5\%$ and $10\%$.\label{tab:power}} {
% latex table generated in R 3.5.2 by xtable 1.8-4 package
% Tue Sep 3 11:26:59 2019
% Thu Sep 12 07:14:29 2019
\begin{tabular}{lcrrrrrrrrr}
\hline
& $R^2$ & \multicolumn{4}{c}{5\%} & &\multicolumn{4}{c}{10\%} \\
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment