Digest

Chapter I

Microbiome, defined as the collection of microbes that inhabit a given environment, and metagenome, defined as the collection of their genes, have been the focus of growing attention for over a decade. In the human gut, microbial communities are responsible for carbohydrate degradation (Flint et al., 2012; Rowland et al., 2018) and help the immune system to regulate inflammation (Blander et al., 2017). However, microbiome deregulation can also lead to moderate to severe disease, like Crohn’s disease (Morgan et al., 2012), depression (Foster & Neufeld, 2013), or necrotizing enterocolitis (Mai et al., 2011). Food and pharmaceutical industries saw the tremendous potential of the gut microbiome began developing product to take advantage of it. Among a lot of different methods, probiotics to lower depressive symptoms (Pinto-Sanchez et al., 2017), prebiotics to enhance specific bacteria and lower obesity’s risk (Sakwinska et al., 2017) or fecal transplant to fight against lethal forms of diarrhea (Van Nood et al., 2013) have been tested and proved to be highly efficient.

Several techniques are available to establish the composition of the microbiome. The marker gene sequencing approach, also called metabarcoding, aims at targeting a universal but rapidly evolving gene (usually the 16S rRNA gene when studying bacteria) whose sequence acts as a barcode to identify the species it originates from (Morgan & Huttenhower, 2012). However, it suffers form several drawbacks like the restriction to bacteria and archaea or the impossibility to obtain any information on the biological functions present in or expressed by the community. To bypass those limitations, the whole genome shotgun (WGS) sequencing approach amplifies all the genetic material to reconstruct a functional and more complete view of of the sample. It comes however with added complexity: the sequences must be mapped to preconstructed reference catalogues (Quince et al., 2017).

By construction, metagenomics data consist of counts (number of reads per species or per genes) and can be modeled as such. Several models try to take into account overdispersion by using negative binomial models (Zhang et al., 2017) or overrepresentation of zeros by using zero-inflated models (Xinyan et al., 2016). However, several authors also suggest that the counts are misleading and that metagenomics data are compositional in essence and should be analyzed using an appropriate framework (Gloor & Reid, 2016).

Chapter II

In order to detect taxa that are differentially abundant between groups, several procedures have been developed. Among them, Wilcoxon-Mann-Whitney (Mann & Whitney, 1947; Wilcoxon, 1992) and Kruskall-Wallis (Kruskal & Wallis, 1952) tests are generic rank-based tests whose null hypothesis is a common count distribution in all groups. There also are specific tests tailored to the count or compositional aspect of sequencing-based omics data like edgeR (Robinson et al., 2010), DESeq2 (Love et al., 2014), mbzinb (Chen et al., 2018) or ALDEx2 (Fernandes et al., 2014). The former two were first designed for transcriptomics data and then imported to microbiome data whereas the latter two were designed from the ground up for microbiome data.

All the previously mentioned tests are univariate: they test one taxon at the time for differential abundance and require a subsequent step of multiple testing correction to avoid a high number of false discoveries. The Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) is the best known and most popular of those correction procedures: it control the expected proportion of false discoveries \(\EE{\frac{FP}{TP + FP}}\) at a certain level, defined a priori.

Recently, new procedures have been proposed to take a hierarchical structure (mainly phylogeny) into account during differential abundance analyses. TreeFDR (Xiao et al., 2017) is a two-steps procedure based on smoothing of \(z\)-scores followed by a permutational correction technique. The smoothing is based on a hierarchical model whose hyperparameters control the level of smoothing. Hierarchical FDR, theorized by Yekutieli (2008) and implemented by Sankaran & Holmes (2014), considers instead a top-down approach: it progresses down the tree, from the root to the leaves, and tests increasingly smaller groups of taxons until it can not reject the null hypothesis anymore. Finally, Huang et al. (2020) developed treeclimbR, which select differentially abundant clades based on a composite score computed for each internal node of the tree.

Chapter III

This chapter addresses the questions of which tree to consider in hierarchical approaches. Phylogeny reflects the evolution history between taxa and is created from nucleotidic sequences from the dataset. Taxonomy is a highly polytomic tree of nested ranks that is available in reference databases (Geer et al., 2010). We introduce the correlation tree, a tree built from the pairwise correlation matrix of abundance data using a hierarchical clustering procedure.

To compare those trees, we consider several distances. Formally, rooted trees are directed acyclic networks: they can be compared using the Robison-Foulds distance (Robinson & Foulds, 1981) which focuses solely on topologies (trees without branches lengths), the cophenetic distance (Sokal & Rohlf, 1962) which focuses on path lengths in the tree, or the Billera-Holmes-Vogtmann distance (Billera et al., 2001) which embeds trees in a geometric space and consider shortest paths in that space.

We first investigate whether the phylogeny (or the taxonomy) is close to the correlation tree. We perform pairwise BHV distance computation on forest of trees covering the phylogeny, the correlation tree, bootstrapped versions of correlation tree and the random trees followed by a PCoA. It appears that the phylogeny is neither in the confidence region of the correlation tree, nor closer to the correlation tree than a random tree.

We then apply hierarchical procedures with the different trees. The correlation tree performs better than the phylogeny, as it naturally captures similarity of abundance profiles and thus groups taxons in a relevant way. However, the classical BH procedure outperforms the hierarchical procedure, no matter what tree is used.

Chapter IV

The Ornstein-Uhlenbeck process with optimal value \(\ou{\optim}\) is defined as the solution of the stochastic differential equation:

\[\begin{equation*} \dx{W_t} = - \ou{\alpha} (W_t - \ou{\optim}) \dx{t} + \ou{\sigma}\dx{B_t}. \end{equation*}\]

It is a well suited framework to model evolution of continuous phylogenetic traits (Freckleton et al., 2003), especially because of its \(\ou{\optim}\) centered Gaussian limit law. It has been used as a basic block to build more complex models, by considering OU processes on a tree and by considering piecewise linear functions for \(\ou{\optim}\) (Bastide et al., 2017).

We proposed a new model, zazou, where the \(z\)-scores arise from an OU process on a tree with shifts on its optimal values. Moreover, we assume that under \(\mathcal{H}_1\), \(\zs_i \sim \normal{\mu_i}{1}\) with \(\mu_i < 0\) (McLachlan & Peel, 2004) : finding the alternative hypotheses is equivalent to finding the non-zero components of \(\mu\). This can be reframed as a constrained version of the well-known lasso (Tibshirani, 1996) to have a point estimator of the \(\mu_i\). We then enhance this estimator using two desparsifications procedures from Zhang & Zhang (2014) and Javanmard & Montanari (2014) to debias the lasso estimate and build confidence intervals and in turn compute \(p\)-values for each of the components of \(\mu\). Those \(p\)-values act as tree-smoothed \(p\)-values. The last part of zazou is the application of a multiple testing correction designed for desparsified lasso (Javanmard et al., 2019) on the computed \(p\)-values.

We evaluate our procedure on both synthetic and real data. When the tree is informative in the simulations, zazou outperforms BH and TreeFDR in terms of TPR but does not control the FDR at the correct level. Using a threshold independent approach, we show that zazou also outperforms competing methods in terms of ROC curves and AUC values. However, when the tree is not informative, forcing an irrelevant constraint leads to a significant loss of AUC.

Chapter V

In this last chapter, we resolved three numerical analysis problems. The proposed algorithms are iterative and converge to a reasonable approximation of the solution.

The first one is a variation of the lasso with a linear constraint :

\[\begin{equation*} \left\{ \begin{aligned} \param^\star & = \argmin_{\param \in \RR^p} \left\|y - X\param\right\|_2^2 + \lambda \|\param\|_1 \\ & \text{s.t. } U\param \in \RR^q_{-} \end{aligned} \right. \end{equation*}\]

where \(y \in \RR^n\), \(X \in \RR^{n\times p}\) and \(U \in \RR^{q \times p}\).

The second one is a minimization problem of a quadratic form subject to an infinite norm constraint :

\[\begin{equation*} \left\{ \begin{aligned} x^\star & = \argmin_{x \in \mathbb{R}^{n}} \ x^TAx, \\ &\text{s.t. }\|Ax - e\|_{\infty} \leq \gamma, \end{aligned} \right. \end{equation*}\]

where \(A \in \RR^{n\times n}\) is a positive semidefinite matrix, \(e\in \RR^n\) and \(\gamma > 0\).

The final one is a feasibility problem, where the goal is to find an element \(x\) that satisfies:

\[\begin{equation*} \left\{x \in \RR^m : \|Mx-e\|_{\infty} < \gamma\right\}, \end{equation*}\]

with \(M \in \RR^{m\times n}\), \(e \in \RR^m\) and \(\gamma > 0\) sufficiently large.

Conclusions and outlooks

The issue considered in this manuscript is the detection of differentially abundant taxa among several groups. The inverse problem is a prediction problem where one wants to predict the groups from the taxa. Including hierarchical information for prediction has already been done in a weak manner for linear regression by Park et al. (2007). The hierarchical information was the partition of variables in groups with high covariance among them. They proved that it is more interesting to estimate one coefficient per group than one per variable.

Existing hierarchical procedures take precomputed \(p\)-values or \(z\)-scores as input. Saying that, hierarchical information always come as a second step, for smoothing or correction. To our knowledge, there is no test that use this information.

zazou suffers from a minor problem during model selection. Sometimes, the BIC selected model has no shifts. Alternative model selection procedures (Baudry et al., 2012; Khabbazian et al., 2016) have been tested but without success. The optimal model selection procedure would select a model with too many shifts, the useless one will be removed by the other steps.

Some theoretical work is still requires. During the desparsification step to constraint shifts on branches such that their sum on the leafs are non-positive, and to check if the hypothesis of the several combined methods are respected.