Commit 0bbc9c4e by Eric Coissac

First corrections following E. Marcon comments

parent a50115d3
......@@ -4,3 +4,4 @@ main.blg
main.log
oldsty
SimulationRls.R
main_manuscrit.pdf
This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018) (preloaded format=pdflatex 2019.1.10) 18 APR 2019 22:22
This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018) (preloaded format=pdflatex 2019.1.10) 26 SEP 2019 17:35
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
......@@ -167,12 +167,24 @@ Overfull \hbox (20.0pt too wide) in paragraph at lines 25--26
Package: inputenc 2018/04/06 v1.3b Input encoding file
\inpenc@prehook=\toks14
\inpenc@posthook=\toks15
) (./crop.sty
Package: crop 2001/11/16 v1.6 cropmarks (mf)
\CROP@width=\dimen102
\CROP@height=\dimen103
)
(/usr/local/texlive/2018/texmf-dist/tex/latex/crop/crop.sty
Package: crop 2003/05/20 v1.9 crop marks (mf)
\stockwidth=\dimen102
\stockheight=\dimen103
\CROP@index=\count80
(/usr/local/texlive/2018/texmf-dist/tex/latex/graphics/color.sty
Package: color 2016/07/10 v1.1e Standard LaTeX Color (DPC)
(/usr/local/texlive/2018/texmf-dist/tex/latex/graphics-cfg/color.cfg
File: color.cfg 2016/01/02 v1.6 sample color configuration
)
Package color Info: Driver file: pdftex.def on input line 147.
(/usr/local/texlive/2018/texmf-dist/tex/latex/graphics-def/pdftex.def
File: pdftex.def 2018/01/08 v1.0l Graphics/color driver for pdftex
))
(/usr/local/texlive/2018/texmf-dist/tex/latex/graphics/graphics.sty
Package: graphics 2017/06/25 v1.2c Standard LaTeX Graphics (DPC,SPQR)
......@@ -183,17 +195,13 @@ Package: trig 2016/01/03 v1.10 sin cos tan (DPC)
File: graphics.cfg 2016/06/04 v1.11 sample graphics configuration
)
Package graphics Info: Driver file: pdftex.def on input line 99.
(/usr/local/texlive/2018/texmf-dist/tex/latex/graphics-def/pdftex.def
File: pdftex.def 2018/01/08 v1.0l Graphics/color driver for pdftex
))
)
\CROP@offset=\count81
Package crop Info: Local config file crop.cfg used on input line 344.
Package crop Info: Local config file crop.cfg used on input line 605.
(./crop.cfg))
Package crop Info: using pdf(la)tex graphics driver on input line 31.
(./graphicx.sty
Package: graphicx 1999/02/16 v1.0f Enhanced LaTeX Graphics (DPC,SPQR)
(/usr/local/texlive/2018/texmf-dist/tex/xelatex/xetexconfig/crop.cfg))
(/usr/local/texlive/2018/texmf-dist/tex/latex/graphics/graphicx.sty
Package: graphicx 2017/06/01 v1.1a Enhanced LaTeX Graphics (DPC,SPQR)
(/usr/local/texlive/2018/texmf-dist/tex/latex/graphics/keyval.sty
Package: keyval 2014/10/28 v1.15 key=value parser (DPC)
......@@ -469,20 +477,15 @@ LaTeX Font Info: Redeclaring font encoding OMS on input line 713.
\mathdisplay@stack=\toks21
LaTeX Info: Redefining \[ on input line 2817.
LaTeX Info: Redefining \] on input line 2818.
) (./array.sty
Package: array 1998/05/13 v2.3m Tabular extension package (FMi)
) (/usr/local/texlive/2018/texmf-dist/tex/latex/tools/array.sty
Package: array 2018/04/07 v2.4g Tabular extension package (FMi)
\col@sep=\dimen115
\ar@mcellbox=\box28
\extrarowheight=\dimen116
\NC@list=\toks22
\extratabsurround=\skip44
\backup@length=\skip45
) (./color.sty
Package: color 1999/02/16 v1.0i Standard LaTeX Color (DPC)
(/usr/local/texlive/2018/texmf-dist/tex/latex/graphics-cfg/color.cfg
File: color.cfg 2016/01/02 v1.6 sample color configuration
)
Package color Info: Driver file: pdftex.def on input line 125.
\ar@cellbox=\box29
)
(/usr/local/texlive/2018/texmf-dist/tex/latex/amsfonts/amssymb.sty
Package: amssymb 2013/01/14 v3.01 AMS font symbols
......@@ -1250,10 +1253,10 @@ l.56 ...\leftrightharpoons} {\mathrel}{AMSa}{"0B}
(That makes 100 errors; please try again.)
Here is how much of TeX's memory you used:
1293 strings out of 492649
15663 string characters out of 6129622
79994 words of memory out of 5000000
5217 multiletter control sequences out of 15000+600000
1339 strings out of 492649
16965 string characters out of 6129622
80604 words of memory out of 5000000
5259 multiletter control sequences out of 15000+600000
3640 words of font info for 14 fonts, out of 8000000 for 9000
1141 hyphenation exceptions out of 8191
27i,1n,27p,232b,35s stack positions out of 5000i,500n,10000p,200000b,80000s
......
......@@ -78,8 +78,8 @@ $^{\text{\sf 2}}$Department, Institution, City, Post Code, Country.}
\maketitle
\fi
\abstract{\textbf{Motivation:} Molecular biology and ecology are producing many high dimension data.
Estimating correlation and shared variation between such data sets is an important step to distangle relationships among differents elements of a biological system. Unfortunatly when using classical measure, because of the high dimension of the data, high correlation can be falsly infered. \\
\abstract{\textbf{Motivation:} Molecular biology and ecology produce many high dimension data.
Estimating correlation and shared variation between such data sets is an important step to distangle relationships among differents elements of a biological system. Unfortunatly when using classical measures, because of the high dimension of the data, high correlation can be falsly infered. \\
\textbf{Results:} Here we propose a corrected version of the Procrustean correlation coeficient that is not sensible to the high dimension of the data. This allows for a correct estimation of the shared variation between two data sets and of the partial correlation coefficient between a set of matrix data.\\
\textbf{Availability:} The proposed corrected coeficients are implemented in the ProcMod R package available on \url{https://git.metabarcoding.org/lecasofts/ProcMod}\\
\textbf{Contact:} \href{eric.coissac@metabarcoding.org}{eric.coissac@metabarcoding.org}}
......@@ -131,7 +131,7 @@ set.seed(1)
Multidimensional data and even high-dimensional data, where the number of variables describing each sample is far larger than the sample count, are now regularly produced in functional genomics (\emph{e.g.} transcriptomics, proteomics or metabolomics) and molecular ecology (\emph{e.g.} DNA metabarcoding). Using various techniques, the same sample set can be described by several multidimensional data sets, each of them describing a different facet of the samples. This invites using data analysis methods able to evaluate mutual information shared by these different descriptions. Correlative approaches can be a first and simple way to decipher pairwise relationships of those data sets.
Since a long time ago, several coefficients have been proposed to measure correlation between two matrices \citep[for a comprehensive review see][]{Ramsay:84:00}. But when applied to high-dimensional data, they suffer from the over-fitting effect leading them to estimate a high correlation even for unrelated data sets. Modified versions of some of these matrix correlation coefficients have been already proposed to tackle this problem. The $\rv_2$ coefficient \citep{Smilde:09:00} is correcting the original $\rv$ coefficient \citep{Escoufier:73:00} for over-fitting. Similarly, a modified version of the distance correlation coefficient $\dcor$ \citep{Szekely:07:00} has been proposed by \cite{SzeKely:13:00}. $\dcor$ has the advantage over the other correlation factors for not considering only linear relationships. Here we will focus on the Procrustes correlation coefficient $\rls$ proposed by \cite{Lingoes:74:00} and by \cite{Gower:71:00}. Let define $Trace$, a function summing the diagonal elements of a matrix. For a $ \; n \; \times \; p \; $ real matrix $\X$ and a second $ \; n \; \times \; q \; $ real matrix $\Y$ defining respectively two sets of $p$ and $q$ centered variables caracterizing $n$ individuals, we define $\covls(\X,\Y)$ an analog of covariance applicable to vectorial data following Equation~(\ref{eq:CovLs})
For a long time, several coefficients have been proposed to measure correlation between two matrices \citep[for a comprehensive review see][]{Ramsay:84:00}. But when applied to high-dimensional data, they suffer from the over-fitting effect leading them to estimate a high correlation even for unrelated data sets. Modified versions of some of these matrix correlation coefficients have been already proposed to tackle this problem. The $\rv_2$ coefficient \citep{Smilde:09:00} is correcting the original $\rv$ coefficient \citep{Escoufier:73:00} for over-fitting. Similarly, a modified version of the distance correlation coefficient $\dcor$ \citep{Szekely:07:00} has been proposed by \cite{SzeKely:13:00}. $\dcor$ has the advantage over the other correlation factors for not considering only linear relationships. Here we will focus on the Procrustes correlation coefficient $\rls$ proposed by \cite{Lingoes:74:00} and by \cite{Gower:71:00}. Define $Trace$, a function summing the diagonal elements of a matrix. For an $ \; n \; \times \; p \; $ real matrix $\X$ and a second $ \; n \; \times \; q \; $ real matrix $\Y$ defining respectively two sets of $p$ and $q$ centered variables caracterizing $n$ individuals, we define $\covls(\X,\Y)$ an analog of covariance applicable to vectorial data following Equation~(\ref{eq:CovLs})
\begin{equation}
\covls(\X,\Y) = \frac{\trace((\mathbf{XX}'\mathbf{YY}')^{1/2})}{n-1}
......@@ -146,7 +146,7 @@ and $\varls(\X)$ as $\covls(\X,\X)$. $\rls$ can then be expressed as follow in E
\label{eq:Rls}
\end{equation}
Among the advantages of $\rls$, its similarity with the Pearson's correlation coefficient $\rpearson$ \citep{Bravais:44:00} has to be noticed. Considering $\covls(\X,\Y)$ and $\varls(\X)$, respectively corresponding to the covariance of two matrices and the variance of a matrix, Equation~(\ref{eq:Rls}) highlight the analogy between both the correlation coefficients. Besides, when $p=1 \text{ and } q = 1,\; \rls = \lvert \rpearson \rvert$. When squared $\rls$ is an estimate, like the squared Pearson's $\rpearson$, of the amount of variation shared between the two datasets. This property allows for developing variance analyzing of matrix data sets.
Among the advantages of $\rls$, its similarity with Pearson's correlation coefficient $\rpearson$ \citep{Bravais:44:00} has to be noticed. Considering $\covls(\X,\Y)$ and $\varls(\X)$, respectively corresponding to the covariance of two matrices and the variance of a matrix, Equation~(\ref{eq:Rls}) highlight the analogy between both the correlation coefficients. Besides, when $p=1 \text{ and } q = 1,\; \rls = \lvert \rpearson \rvert$. When squared $\rls$ is an estimate, like the squared Pearson's $\rpearson$, of the amount of variation shared between the two datasets. This property allows for developing variance analyzing of matrix data sets.
Moreover, Procrustean analyses have been proposed as a good alternative to Mantel's statistics for analyzing ecological data summarized by distance matrices \citep{Peres-Neto:01:00}. In such analyze distance matrices are projected into an orthogonal space using metric or non metric multidimensional scaling according to the geometrical properties of the used distances. Correlations are then estimated between these projections.
......@@ -421,7 +421,7 @@ for (i in seq_along(supplement_vars)) {
\subsection{Empirical assessment of the coefficient of determination}
The coefficient of determination ($\rpearson^2$) represente the part of shared variation between two variables. $\rls^2$ keeps the same meaning when applied to two matrices. But because of over-fitting $\rls$ and therefore $\rls^2$ are over-estimated.
\subsubsection*{between two matrices}
\subsubsection*{Between two matrices}
<<estimate_r2_setting, cache=TRUE, message=FALSE, warning=FALSE, include=FALSE>>=
n_sim <- 100 # Count of simulations for each condition
......@@ -532,7 +532,7 @@ r2_sims_all = do.call(bind_rows,
@
\subsubsection*{between two vectors}
\subsubsection*{Between two vectors}
<<estimate_r2_vector_setting, cache=TRUE, message=FALSE, warning=FALSE, include=FALSE>>=
n_sim <- 100 # Count of simulations for each condition
......@@ -605,7 +605,7 @@ r2_sims_vec_all = do.call(bind_rows,
@
\subsubsection*{partial determination coefficients}
\subsubsection*{Partial determination coefficients}
\begin{figure}[!tpb]%figure1
......@@ -668,7 +668,7 @@ grid.draw(venn.plot)
\label{fig:nested_shared_variation}
\end{figure}
To evaluate capacity of partial determination coefficient $\irls_{partial}^2$ to distangle nested correlations, four matrices $\mathbf{A},\,\mathbf{B},\,\mathbf{C},\,\mathbf{D}$ of size $n \times p = \Sexpr{n_indivdual} \times \Sexpr{p_q}$ are generated according to the schema: $\mathbf{A}$ shares $\Sexpr{round(r2_AB *100)}\%$ of variation with $\mathbf{B}$, that shares $\Sexpr{round(r2_BC *100)}\%$ of variation with $\mathbf{C}$, sharing $\Sexpr{round(r2_CD *100)}\%$ of variation with $\mathbf{D}$. These direct correlations induce indirect ones spreading the total variation among each pair of matricies according to Figure~\ref{fig:nested_shared_variation}. The simulation is repeadted $\Sexpr{n_sim}$ times, for every simutation $\irls_{partial}^2$ and $\rls_{partial}^2$ are estimated for each pair of matrices.
To evaluate the capacity of partial determination coefficient $\irls_{partial}^2$ to distangle nested correlations, four matrices $\mathbf{A},\,\mathbf{B},\,\mathbf{C},\,\mathbf{D}$ of size $n \times p = \Sexpr{n_indivdual} \times \Sexpr{p_q}$ are generated according to the schema: $\mathbf{A}$ shares $\Sexpr{round(r2_AB *100)}\%$ of variation with $\mathbf{B}$, that shares $\Sexpr{round(r2_BC *100)}\%$ of variation with $\mathbf{C}$, sharing $\Sexpr{round(r2_CD *100)}\%$ of variation with $\mathbf{D}$. These direct correlations induce indirect ones spreading the total variation among each pair of matricies according to Figure~\ref{fig:nested_shared_variation}. The simulation is repeadted $\Sexpr{n_sim}$ times, for every simutation $\irls_{partial}^2$ and $\rls_{partial}^2$ are estimated for each pair of matrices.
<<estimate_partial_r2, cache=TRUE, message=FALSE, warning=FALSE, include=FALSE, dependson="estimate_partial_r2_setting">>=
......@@ -771,11 +771,11 @@ rbind(partial_r2_sims.irlsp,
@
\subsection{Testing significance of $\irls(\X,\Y)$}
\subsection{Testing the significance of $\irls(\X,\Y)$}
Significance of $\irls(\X,\Y)$ can be tested using permutation test as defined in \cite{Jackson:95:00} or \cite{Peres-Neto:01:00} and implemented respectively in the \texttt{protest} method of the vegan R package \citep{Dixon:03:00} or the \texttt{procuste.rtest} method of the ADE4 R package \cite{Dray:07:00}.
The significance of $\irls(\X,\Y)$ can be tested using permutation test as defined in \cite{Jackson:95:00} or \cite{Peres-Neto:01:00} and implemented respectively in the \texttt{protest} method of the vegan R package \citep{Dixon:03:00} or the \texttt{procuste.rtest} method of the ADE4 R package \cite{Dray:07:00}.
It is also possible to take advantage of the Monte-Carlo estimation of $\overline{\rcovls(\X,\Y)}$ to test that $\icovls(\X,\Y)$ and therefore $\irls(\X,\Y)$ are greater than expected under random hypothesis. Let counting over the $k$ randomization when $\rcovls(\X,\Y)_k$ greater than $\covls(\X,\Y)$ name this counts $N_{>\covls}$. $\pvalue$ of the test can be estimated following Equation~(\ref{eq:pvalue}).
It is also possible to take advantage of the Monte-Carlo estimation of $\overline{\rcovls(\X,\Y)}$ to test that $\icovls(\X,\Y)$ and therefore $\irls(\X,\Y)$ are greater than expected under random hypothesis. Let us count over the $k$ randomization when $\rcovls(\X,\Y)_k$ greater than $\covls(\X,\Y)$ and name this counts $N_{>\covls}$. The $\pvalue$ of the test can be estimated following Equation~(\ref{eq:pvalue}).
\begin{equation}
\pvalue= \frac{N_{>\covls}}{k}
......@@ -791,8 +791,8 @@ p_qs <- c(10,20,50) # Number of variable foreach matrices (Column counts of the
n_rand <- 1000
@
To assess empirically the $\alpha\text{-risk}$ of the procruste test based on the randomisations realized during the estimation of $\overline{\rcovls(\X,\Y)}$, distribution of $P_{value}$ under the $H_0$ is compared to a uniform distribution between $0$ and $1$ ($\mathcal{U}(0,1)$). To estimate such empirical distribution, $k=\Sexpr{n_sim}$ pairs of $n \times p$ random matrices with $n=\Sexpr{n_indivdual}$ and
$p \in \{\Sexpr{p_qs}\}$ are simulated under the null hypothesis of independancy. Procruste correlation between whose matrices is tested based on three tests. Our proposed test ($CovLs.test$), the \texttt{protest} method of the vegan R package and the \texttt{procuste.rtest} method of the ADE4 R package. Conformance of the distribution of each set of $k$ $P_{value}$ to $\mathcal{U}(0,1)$ is assessed using the Cramer-Von Mises test \citep{CSoRgo:96:00} implemented in the \texttt{cvm.test} function of the R package \texttt{goftest}.
To assess empirically the $\alpha\text{-risk}$ of the procruste test based on the randomisations realized during the estimation of $\overline{\rcovls(\X,\Y)}$, the distribution of $P_{value}$ under $H_0$ is compared to a uniform distribution between $0$ and $1$ ($\mathcal{U}(0,1)$). To estimate such an empirical distribution, $k=\Sexpr{n_sim}$ pairs of $n \times p$ random matrices with $n=\Sexpr{n_indivdual}$ and
$p \in \{\Sexpr{p_qs}\}$ are simulated under the null hypothesis of independence. Procruste correlation between those matrices is tested based on three tests. Our proposed test ($CovLs.test$), the \texttt{protest} method of the vegan R package and the \texttt{procuste.rtest} method of the ADE4 R package. Conformance of the distribution of each set of $k$ $P_{value}$ to $\mathcal{U}(0,1)$ is assessed using the Cramer-Von Mises test \citep{CSoRgo:96:00} implemented in the \texttt{cvm.test} function of the R package \texttt{goftest}.
<<estimate_alpha, cache=TRUE, message=FALSE, warning=FALSE, include=FALSE, dependson="estimate_alpha_setting">>=
h0_alpha = array(0,dim = c(n_sim,length(p_qs),3))
......@@ -837,7 +837,7 @@ r2s <- c(1:2) * 5/100
n_rand <- 1000
@
To evaluate relative power of the three considered tests, pairs of to random matrices were produced for various $p \in \{\Sexpr{p_qs}\}$, $n \in \{\Sexpr{n_indivduals}\}$ and two levels of shared variations $\rpearson^2 \in \{\Sexpr{r2s}\}$. For each combination of parameters, $k = \Sexpr{n_sim}$ simulations are run. Each test are estimated based on $\Sexpr{n_rand}$ randomizations for the $CovLs$ test, or permutations for \texttt{protest} and \texttt{procuste.rtest}.
To evaluate relative the power of the three considered tests, pairs of to random matrices were produced for various $p \in \{\Sexpr{p_qs}\}$, $n \in \{\Sexpr{n_indivduals}\}$ and two levels of shared variations $\rpearson^2 \in \{\Sexpr{r2s}\}$. For each combination of parameters, $k = \Sexpr{n_sim}$ simulations are run. Each test are estimated based on $\Sexpr{n_rand}$ randomizations for the $CovLs$ test, or permutations for \texttt{protest} and \texttt{procuste.rtest}.
<<estimate_power, cache=TRUE, message=FALSE, warning=FALSE, include=FALSE, dependson="estimate_power_setting">>=
......@@ -1090,7 +1090,7 @@ $RLs$, like $RV$ and $dCor$, is sensible to overfitting which increase when $n$
\subsection{Evaluating the shared variation}
\subsubsection*{between two matrices}
\subsubsection*{Between two matrices}
$\rls$ can be considered for matrices as a strict equivalent of Pearson's $\rpearson$ for vectors. Therefore its squared value is an estimator of the shared variation between two matrices. But because of over-fitting the estimation is over-estimated. The proposed corrected vection ($\irls$) of that coefficient is able to provide a good estimate of the shared variation and is perfectly robust to the over-fitting phenomenon (Figure~\ref{fig:shared_variation}). Only a small over evalution is observable for the low values of simulated shared variation.
......@@ -1125,7 +1125,7 @@ r2_sims_all %>%
\label{fig:shared_variation}
\end{figure*}
\subsubsection*{between two vectors}
\subsubsection*{Between two vectors}
Vectors can be considered as a single column matrix, and the efficiency of $\irls^2$ to estimate shared variation between matrices can also be used to estimate shared variation between two vectors. Other formulas have been already proposed to better estimate shared variation between vectors in the context of linear models. Among them the one presented in Equation~\ref{eq:r2adj}, is the most often used and is the one implemented in R linear model summary function. On simulated data, $\irls^2$ performs better than the simple $R^2$ and its modified version $R^2_{adj}$ commonly used (Figure~\ref{fig:shared_variation_vector}). Whatever the estimator the bias decrease with the simulated shared variation. Nevertheless for every tested cases the median of the bias observed is smaller than with both other estimators, even if classical estimators well perfom for large values of shared variation.
......@@ -1161,7 +1161,7 @@ r2_sims_vec_all %>%
\label{fig:shared_variation_vector}
\end{figure}
\subsubsection*{partial coefficient of determination}
\subsubsection*{Partial coefficient of determination}
The simulated correlation network between the four matrices $\mathbf{A},\,\mathbf{B},\,\mathbf{C},\,\mathbf{D}$ induced moreover the direct simulated correlation a network of indirect correlation and therefore shared variances (Figure~\ref{fig:nested_shared_variation}). In such system, the interest of partial correlation coefficients and their associated partial determination coefficients is to measure correlation between a pair of variable without accounting for the part of that correlation which is explained by other variables, hence extracting the pure correlation between these two matrices. From Figure~\ref{fig:nested_shared_variation}, the expected partial shared variation between $\mathbf{A}$ and $\mathbf{B}$ is $480/(200+480)=0.706$; between $\mathbf{B}$ and $\mathbf{C}$, $64/(480+120)=0.107$; and between $\mathbf{C}$ and $\mathbf{D}$ $120/800=0.150$. All other partial coefficient are expected to be equal to $0$. The effect of the correction introduced in $\irls$ is clearly weaker and on the partial coefficient of determination (Figure~\ref{fig:nested_shared}) than on the full coefficient of determination (Figure~\ref{fig:shared_variation}). The spurious random correlations, constituting the over-fitting effect, is distributed over all the pair of matrices $\mathbf{A},\,\mathbf{B},\,\mathbf{C},\,\mathbf{D}$.
\begin{figure}[!tpb]%figure1
......@@ -1301,7 +1301,7 @@ print(tab,
\section{Discussion}
Correcting the over-adjustment effect on metrics assessing the relationship between high dimension datasets is a constant effort over the past decade. Therefore, $\irls$ can be considered as a continuation of the extension of the toolbox available to biologists for analyzing their omics data. The effect of the proposed correction on the classical $\rls$ coefficient is as strong as the other ones previously proposed for other correlation coefficients measuring relationship between vector data \citep[see Figure~\ref{fig:shared_variation}, e.g.][]{Smilde:09:00,SzeKely:13:00}. When applied to univariate data, $\rls$ is equal to the absolute value of the Pearson correlation coefficient, hence, and despite it is not the initial aim of that coefficient, $\irls$ can also be used to evaluate correlation between two univariate datasets. Using $\irls$ for such data sets is correcting for spurious correlations when the number of individual is small more efficiently than classical correction \citep[see Figure~\ref{fig:shared_variation_vector},][]{Theil:58:00}.
Correcting the over-adjustment effect on metrics assessing the relationship between high dimension datasets has been a constant effort over the past decades. Therefore, $\irls$ can be considered as a continuation of the extension of the toolbox available to biologists for analyzing their omics data. The effect of the proposed correction on the classical $\rls$ coefficient is as strong as the other ones previously proposed for other correlation coefficients measuring relationship between vector data \citep[see Figure~\ref{fig:shared_variation}, e.g.][]{Smilde:09:00,SzeKely:13:00}. When applied to univariate data, $\rls$ is equal to the absolute value of the Pearson correlation coefficient, hence, and despite it is not the initial aim of that coefficient, $\irls$ can also be used to evaluate correlation between two univariate datasets. Using $\irls$ for such data sets is correcting for spurious correlations when the number of individual is small more efficiently than classical correction \citep[see Figure~\ref{fig:shared_variation_vector},][]{Theil:58:00}.
The main advantage of $\irls$ over other matrix correlation coefficients is that it allows for estimating shared variation between two matrices according to the classical definition of variance partitioning used with linear models. This opens the opportunity to develop linear models to explain the variation of a high dimension dataset by a set of other high dimension data matrices.
......
No preview for this file type
......@@ -128,8 +128,8 @@ $^{\text{\sf 2}}$Department, Institution, City, Post Code, Country.}
\maketitle
\fi
\abstract{\textbf{Motivation:} Molecular biology and ecology are producing many high dimension data.
Estimating correlation and shared variation between such data sets is an important step to distangle relationships among differents elements of a biological system. Unfortunatly when using classical measure, because of the high dimension of the data, high correlation can be falsly infered. \\
\abstract{\textbf{Motivation:} Molecular biology and ecology produce many high dimension data.
Estimating correlation and shared variation between such data sets is an important step to distangle relationships among differents elements of a biological system. Unfortunatly when using classical measures, because of the high dimension of the data, high correlation can be falsly infered. \\
\textbf{Results:} Here we propose a corrected version of the Procrustean correlation coeficient that is not sensible to the high dimension of the data. This allows for a correct estimation of the shared variation between two data sets and of the partial correlation coefficient between a set of matrix data.\\
\textbf{Availability:} The proposed corrected coeficients are implemented in the ProcMod R package available on \url{https://git.metabarcoding.org/lecasofts/ProcMod}\\
\textbf{Contact:} \href{eric.coissac@metabarcoding.org}{eric.coissac@metabarcoding.org}}
......@@ -154,7 +154,7 @@ Estimating correlation and shared variation between such data sets is an importa
Multidimensional data and even high-dimensional data, where the number of variables describing each sample is far larger than the sample count, are now regularly produced in functional genomics (\emph{e.g.} transcriptomics, proteomics or metabolomics) and molecular ecology (\emph{e.g.} DNA metabarcoding). Using various techniques, the same sample set can be described by several multidimensional data sets, each of them describing a different facet of the samples. This invites using data analysis methods able to evaluate mutual information shared by these different descriptions. Correlative approaches can be a first and simple way to decipher pairwise relationships of those data sets.
Since a long time ago, several coefficients have been proposed to measure correlation between two matrices \citep[for a comprehensive review see][]{Ramsay:84:00}. But when applied to high-dimensional data, they suffer from the over-fitting effect leading them to estimate a high correlation even for unrelated data sets. Modified versions of some of these matrix correlation coefficients have been already proposed to tackle this problem. The $\rv_2$ coefficient \citep{Smilde:09:00} is correcting the original $\rv$ coefficient \citep{Escoufier:73:00} for over-fitting. Similarly, a modified version of the distance correlation coefficient $\dcor$ \citep{Szekely:07:00} has been proposed by \cite{SzeKely:13:00}. $\dcor$ has the advantage over the other correlation factors for not considering only linear relationships. Here we will focus on the Procrustes correlation coefficient $\rls$ proposed by \cite{Lingoes:74:00} and by \cite{Gower:71:00}. Let define $Trace$, a function summing the diagonal elements of a matrix. For a $ \; n \; \times \; p \; $ real matrix $\X$ and a second $ \; n \; \times \; q \; $ real matrix $\Y$ defining respectively two sets of $p$ and $q$ centered variables caracterizing $n$ individuals, we define $\covls(\X,\Y)$ an analog of covariance applicable to vectorial data following Equation~(\ref{eq:CovLs})
For a long time, several coefficients have been proposed to measure correlation between two matrices \citep[for a comprehensive review see][]{Ramsay:84:00}. But when applied to high-dimensional data, they suffer from the over-fitting effect leading them to estimate a high correlation even for unrelated data sets. Modified versions of some of these matrix correlation coefficients have been already proposed to tackle this problem. The $\rv_2$ coefficient \citep{Smilde:09:00} is correcting the original $\rv$ coefficient \citep{Escoufier:73:00} for over-fitting. Similarly, a modified version of the distance correlation coefficient $\dcor$ \citep{Szekely:07:00} has been proposed by \cite{SzeKely:13:00}. $\dcor$ has the advantage over the other correlation factors for not considering only linear relationships. Here we will focus on the Procrustes correlation coefficient $\rls$ proposed by \cite{Lingoes:74:00} and by \cite{Gower:71:00}. Define $Trace$, a function summing the diagonal elements of a matrix. For an $ \; n \; \times \; p \; $ real matrix $\X$ and a second $ \; n \; \times \; q \; $ real matrix $\Y$ defining respectively two sets of $p$ and $q$ centered variables caracterizing $n$ individuals, we define $\covls(\X,\Y)$ an analog of covariance applicable to vectorial data following Equation~(\ref{eq:CovLs})
\begin{equation}
\covls(\X,\Y) = \frac{\trace((\mathbf{XX}'\mathbf{YY}')^{1/2})}{n-1}
......@@ -169,7 +169,7 @@ and $\varls(\X)$ as $\covls(\X,\X)$. $\rls$ can then be expressed as follow in E
\label{eq:Rls}
\end{equation}
Among the advantages of $\rls$, its similarity with the Pearson's correlation coefficient $\rpearson$ \citep{Bravais:44:00} has to be noticed. Considering $\covls(\X,\Y)$ and $\varls(\X)$, respectively corresponding to the covariance of two matrices and the variance of a matrix, Equation~(\ref{eq:Rls}) highlight the analogy between both the correlation coefficients. Besides, when $p=1 \text{ and } q = 1,\; \rls = \lvert \rpearson \rvert$. When squared $\rls$ is an estimate, like the squared Pearson's $\rpearson$, of the amount of variation shared between the two datasets. This property allows for developing variance analyzing of matrix data sets.
Among the advantages of $\rls$, its similarity with Pearson's correlation coefficient $\rpearson$ \citep{Bravais:44:00} has to be noticed. Considering $\covls(\X,\Y)$ and $\varls(\X)$, respectively corresponding to the covariance of two matrices and the variance of a matrix, Equation~(\ref{eq:Rls}) highlight the analogy between both the correlation coefficients. Besides, when $p=1 \text{ and } q = 1,\; \rls = \lvert \rpearson \rvert$. When squared $\rls$ is an estimate, like the squared Pearson's $\rpearson$, of the amount of variation shared between the two datasets. This property allows for developing variance analyzing of matrix data sets.
Moreover, Procrustean analyses have been proposed as a good alternative to Mantel's statistics for analyzing ecological data summarized by distance matrices \citep{Peres-Neto:01:00}. In such analyze distance matrices are projected into an orthogonal space using metric or non metric multidimensional scaling according to the geometrical properties of the used distances. Correlations are then estimated between these projections.
......@@ -277,7 +277,7 @@ For $p=1$ random vectors with $n \in [3,25]$ are generated. As above data are dr
\subsection{Empirical assessment of the coefficient of determination}
The coefficient of determination ($\rpearson^2$) represente the part of shared variation between two variables. $\rls^2$ keeps the same meaning when applied to two matrices. But because of over-fitting $\rls$ and therefore $\rls^2$ are over-estimated.
\subsubsection*{between two matrices}
\subsubsection*{Between two matrices}
......@@ -287,7 +287,7 @@ To test how the $\irls$ version of the coefficient of determination $\irls^2$ c
\subsubsection*{between two vectors}
\subsubsection*{Between two vectors}
......@@ -304,7 +304,7 @@ To evaluate the strength of that over-estimation and the relative effect of the
\subsubsection*{partial determination coefficients}
\subsubsection*{Partial determination coefficients}
\begin{figure}[!tpb]%figure1
......@@ -319,17 +319,17 @@ To evaluate the strength of that over-estimation and the relative effect of the
\label{fig:nested_shared_variation}
\end{figure}
To evaluate capacity of partial determination coefficient $\irls_{partial}^2$ to distangle nested correlations, four matrices $\mathbf{A},\,\mathbf{B},\,\mathbf{C},\,\mathbf{D}$ of size $n \times p = 20 \times 200$ are generated according to the schema: $\mathbf{A}$ shares $80\%$ of variation with $\mathbf{B}$, that shares $40\%$ of variation with $\mathbf{C}$, sharing $20\%$ of variation with $\mathbf{D}$. These direct correlations induce indirect ones spreading the total variation among each pair of matricies according to Figure~\ref{fig:nested_shared_variation}. The simulation is repeadted $100$ times, for every simutation $\irls_{partial}^2$ and $\rls_{partial}^2$ are estimated for each pair of matrices.
To evaluate the capacity of partial determination coefficient $\irls_{partial}^2$ to distangle nested correlations, four matrices $\mathbf{A},\,\mathbf{B},\,\mathbf{C},\,\mathbf{D}$ of size $n \times p = 20 \times 200$ are generated according to the schema: $\mathbf{A}$ shares $80\%$ of variation with $\mathbf{B}$, that shares $40\%$ of variation with $\mathbf{C}$, sharing $20\%$ of variation with $\mathbf{D}$. These direct correlations induce indirect ones spreading the total variation among each pair of matricies according to Figure~\ref{fig:nested_shared_variation}. The simulation is repeadted $100$ times, for every simutation $\irls_{partial}^2$ and $\rls_{partial}^2$ are estimated for each pair of matrices.
\subsection{Testing significance of $\irls(\X,\Y)$}
\subsection{Testing the significance of $\irls(\X,\Y)$}
Significance of $\irls(\X,\Y)$ can be tested using permutation test as defined in \cite{Jackson:95:00} or \cite{Peres-Neto:01:00} and implemented respectively in the \texttt{protest} method of the vegan R package \citep{Dixon:03:00} or the \texttt{procuste.rtest} method of the ADE4 R package \cite{Dray:07:00}.
The significance of $\irls(\X,\Y)$ can be tested using permutation test as defined in \cite{Jackson:95:00} or \cite{Peres-Neto:01:00} and implemented respectively in the \texttt{protest} method of the vegan R package \citep{Dixon:03:00} or the \texttt{procuste.rtest} method of the ADE4 R package \cite{Dray:07:00}.
It is also possible to take advantage of the Monte-Carlo estimation of $\overline{\rcovls(\X,\Y)}$ to test that $\icovls(\X,\Y)$ and therefore $\irls(\X,\Y)$ are greater than expected under random hypothesis. Let counting over the $k$ randomization when $\rcovls(\X,\Y)_k$ greater than $\covls(\X,\Y)$ name this counts $N_{>\covls}$. $\pvalue$ of the test can be estimated following Equation~(\ref{eq:pvalue}).
It is also possible to take advantage of the Monte-Carlo estimation of $\overline{\rcovls(\X,\Y)}$ to test that $\icovls(\X,\Y)$ and therefore $\irls(\X,\Y)$ are greater than expected under random hypothesis. Let us count over the $k$ randomization when $\rcovls(\X,\Y)_k$ greater than $\covls(\X,\Y)$ and name this counts $N_{>\covls}$. The $\pvalue$ of the test can be estimated following Equation~(\ref{eq:pvalue}).
\begin{equation}
\pvalue= \frac{N_{>\covls}}{k}
......@@ -340,8 +340,8 @@ It is also possible to take advantage of the Monte-Carlo estimation of $\overlin
To assess empirically the $\alpha\text{-risk}$ of the procruste test based on the randomisations realized during the estimation of $\overline{\rcovls(\X,\Y)}$, distribution of $P_{value}$ under the $H_0$ is compared to a uniform distribution between $0$ and $1$ ($\mathcal{U}(0,1)$). To estimate such empirical distribution, $k=1000$ pairs of $n \times p$ random matrices with $n=20$ and
$p \in \{10, 20, 50\}$ are simulated under the null hypothesis of independancy. Procruste correlation between whose matrices is tested based on three tests. Our proposed test ($CovLs.test$), the \texttt{protest} method of the vegan R package and the \texttt{procuste.rtest} method of the ADE4 R package. Conformance of the distribution of each set of $k$ $P_{value}$ to $\mathcal{U}(0,1)$ is assessed using the Cramer-Von Mises test \citep{CSoRgo:96:00} implemented in the \texttt{cvm.test} function of the R package \texttt{goftest}.
To assess empirically the $\alpha\text{-risk}$ of the procruste test based on the randomisations realized during the estimation of $\overline{\rcovls(\X,\Y)}$, the distribution of $P_{value}$ under $H_0$ is compared to a uniform distribution between $0$ and $1$ ($\mathcal{U}(0,1)$). To estimate such an empirical distribution, $k=1000$ pairs of $n \times p$ random matrices with $n=20$ and
$p \in \{10, 20, 50\}$ are simulated under the null hypothesis of independence. Procruste correlation between those matrices is tested based on three tests. Our proposed test ($CovLs.test$), the \texttt{protest} method of the vegan R package and the \texttt{procuste.rtest} method of the ADE4 R package. Conformance of the distribution of each set of $k$ $P_{value}$ to $\mathcal{U}(0,1)$ is assessed using the Cramer-Von Mises test \citep{CSoRgo:96:00} implemented in the \texttt{cvm.test} function of the R package \texttt{goftest}.
......@@ -351,7 +351,7 @@ $p \in \{10, 20, 50\}$ are simulated under the null hypothesis of independancy.
To evaluate relative power of the three considered tests, pairs of to random matrices were produced for various $p \in \{10, 20, 50, 100\}$, $n \in \{10, 15, 20, 25\}$ and two levels of shared variations $\rpearson^2 \in \{0.05, 0.1\}$. For each combination of parameters, $k = 1000$ simulations are run. Each test are estimated based on $1000$ randomizations for the $CovLs$ test, or permutations for \texttt{protest} and \texttt{procuste.rtest}.
To evaluate relative the power of the three considered tests, pairs of to random matrices were produced for various $p \in \{10, 20, 50, 100\}$, $n \in \{10, 15, 20, 25\}$ and two levels of shared variations $\rpearson^2 \in \{0.05, 0.1\}$. For each combination of parameters, $k = 1000$ simulations are run. Each test are estimated based on $1000$ randomizations for the $CovLs$ test, or permutations for \texttt{protest} and \texttt{procuste.rtest}.
......@@ -374,7 +374,7 @@ To evaluate relative power of the three considered tests, pairs of to random mat
\begin{table}[!t]
\processtable{Estimation of $\overline{\rcovls(\X,\Y)}$ according to the number of random matrices (k) aligned.\label{tab:mrcovls}}{
% latex table generated in R 3.5.2 by xtable 1.8-4 package
% Thu Sep 12 07:14:26 2019
% Thu Sep 26 17:36:27 2019
\begin{tabular}{rrrrrrr}
\hline
& & \multicolumn{2}{c}{normal} & & \multicolumn{2}{c}{exponential}\\ \cline{3-4} \cline{6-7}p & k &\multicolumn{1}{c}{mean} & \multicolumn{1}{c}{sd} & \multicolumn{1}{c}{ } &\multicolumn{1}{c}{mean} & \multicolumn{1}{c}{sd}\\\hline\multirow{3}{*}{10} & 10 & 0.5746 & $1.3687 \times 10^{-2}$ & & 0.5705 & $1.1714 \times 10^{-2}$ \\
......@@ -424,7 +424,7 @@ $RLs$, like $RV$ and $dCor$, is sensible to overfitting which increase when $n$
\subsection{Evaluating the shared variation}
\subsubsection*{between two matrices}
\subsubsection*{Between two matrices}
$\rls$ can be considered for matrices as a strict equivalent of Pearson's $\rpearson$ for vectors. Therefore its squared value is an estimator of the shared variation between two matrices. But because of over-fitting the estimation is over-estimated. The proposed corrected vection ($\irls$) of that coefficient is able to provide a good estimate of the shared variation and is perfectly robust to the over-fitting phenomenon (Figure~\ref{fig:shared_variation}). Only a small over evalution is observable for the low values of simulated shared variation.
......@@ -438,7 +438,7 @@ $\rls$ can be considered for matrices as a strict equivalent of Pearson's $\rpea
\label{fig:shared_variation}
\end{figure*}
\subsubsection*{between two vectors}
\subsubsection*{Between two vectors}
Vectors can be considered as a single column matrix, and the efficiency of $\irls^2$ to estimate shared variation between matrices can also be used to estimate shared variation between two vectors. Other formulas have been already proposed to better estimate shared variation between vectors in the context of linear models. Among them the one presented in Equation~\ref{eq:r2adj}, is the most often used and is the one implemented in R linear model summary function. On simulated data, $\irls^2$ performs better than the simple $R^2$ and its modified version $R^2_{adj}$ commonly used (Figure~\ref{fig:shared_variation_vector}). Whatever the estimator the bias decrease with the simulated shared variation. Nevertheless for every tested cases the median of the bias observed is smaller than with both other estimators, even if classical estimators well perfom for large values of shared variation.
......@@ -452,7 +452,7 @@ Vectors can be considered as a single column matrix, and the efficiency of $\irl
\label{fig:shared_variation_vector}
\end{figure}
\subsubsection*{partial coefficient of determination}
\subsubsection*{Partial coefficient of determination}
The simulated correlation network between the four matrices $\mathbf{A},\,\mathbf{B},\,\mathbf{C},\,\mathbf{D}$ induced moreover the direct simulated correlation a network of indirect correlation and therefore shared variances (Figure~\ref{fig:nested_shared_variation}). In such system, the interest of partial correlation coefficients and their associated partial determination coefficients is to measure correlation between a pair of variable without accounting for the part of that correlation which is explained by other variables, hence extracting the pure correlation between these two matrices. From Figure~\ref{fig:nested_shared_variation}, the expected partial shared variation between $\mathbf{A}$ and $\mathbf{B}$ is $480/(200+480)=0.706$; between $\mathbf{B}$ and $\mathbf{C}$, $64/(480+120)=0.107$; and between $\mathbf{C}$ and $\mathbf{D}$ $120/800=0.150$. All other partial coefficient are expected to be equal to $0$. The effect of the correction introduced in $\irls$ is clearly weaker and on the partial coefficient of determination (Figure~\ref{fig:nested_shared}) than on the full coefficient of determination (Figure~\ref{fig:shared_variation}). The spurious random correlations, constituting the over-fitting effect, is distributed over all the pair of matrices $\mathbf{A},\,\mathbf{B},\,\mathbf{C},\,\mathbf{D}$.
\begin{figure}[!tpb]%figure1
......@@ -476,7 +476,7 @@ whatever the $p$ tested (Table~\ref{tab:alpha_pvalue}). This ensure that the pro
of the distribution of $P_{values}$ correlation test to $\mathcal{U}(0,1)$
under the null hypothesis.\label{tab:alpha_pvalue}} {
% latex table generated in R 3.5.2 by xtable 1.8-4 package
% Thu Sep 12 07:14:29 2019
% Thu Sep 26 17:36:30 2019
\begin{tabular*}{0.98\linewidth}{@{\extracolsep{\fill}}crrr}
\hline
& \multicolumn{3}{c}{Cramer-Von Mises p.value} \\
......@@ -498,7 +498,7 @@ Power of the $CovLs$ test based on the estimation of $\overline{RCovLs(X,Y)}$ is
\begin{table}[!t]
\processtable{Power estimation of the procruste tests for two low level of shared variations $5\%$ and $10\%$.\label{tab:power}} {
% latex table generated in R 3.5.2 by xtable 1.8-4 package
% Thu Sep 12 07:14:29 2019
% Thu Sep 26 17:36:30 2019
\begin{tabular}{lcrrrrrrrrr}
\hline
& $R^2$ & \multicolumn{4}{c}{5\%} & &\multicolumn{4}{c}{10\%} \\
......@@ -531,7 +531,7 @@ Power of the $CovLs$ test based on the estimation of $\overline{RCovLs(X,Y)}$ is
\section{Discussion}
Correcting the over-adjustment effect on metrics assessing the relationship between high dimension datasets is a constant effort over the past decade. Therefore, $\irls$ can be considered as a continuation of the extension of the toolbox available to biologists for analyzing their omics data. The effect of the proposed correction on the classical $\rls$ coefficient is as strong as the other ones previously proposed for other correlation coefficients measuring relationship between vector data \citep[see Figure~\ref{fig:shared_variation}, e.g.][]{Smilde:09:00,SzeKely:13:00}. When applied to univariate data, $\rls$ is equal to the absolute value of the Pearson correlation coefficient, hence, and despite it is not the initial aim of that coefficient, $\irls$ can also be used to evaluate correlation between two univariate datasets. Using $\irls$ for such data sets is correcting for spurious correlations when the number of individual is small more efficiently than classical correction \citep[see Figure~\ref{fig:shared_variation_vector},][]{Theil:58:00}.
Correcting the over-adjustment effect on metrics assessing the relationship between high dimension datasets has been a constant effort over the past decades. Therefore, $\irls$ can be considered as a continuation of the extension of the toolbox available to biologists for analyzing their omics data. The effect of the proposed correction on the classical $\rls$ coefficient is as strong as the other ones previously proposed for other correlation coefficients measuring relationship between vector data \citep[see Figure~\ref{fig:shared_variation}, e.g.][]{Smilde:09:00,SzeKely:13:00}. When applied to univariate data, $\rls$ is equal to the absolute value of the Pearson correlation coefficient, hence, and despite it is not the initial aim of that coefficient, $\irls$ can also be used to evaluate correlation between two univariate datasets. Using $\irls$ for such data sets is correcting for spurious correlations when the number of individual is small more efficiently than classical correction \citep[see Figure~\ref{fig:shared_variation_vector},][]{Theil:58:00}.
The main advantage of $\irls$ over other matrix correlation coefficients is that it allows for estimating shared variation between two matrices according to the classical definition of variance partitioning used with linear models. This opens the opportunity to develop linear models to explain the variation of a high dimension dataset by a set of other high dimension data matrices.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment