3 juillet 2014
Salle Mont-Blanc GIPSA-lab, ENSE3, Site Ampère, Batiment D 1er étage
Efficient RNA Isoform Identification and Quantification from RNA-Seq Data with Network Flows.
Several state-of-the-art methods for isoform identification and quantification are based on l1-regularized regression, such as the Lasso. However, explicitly listing the—possibly exponentially—large set of candidate transcripts is intractable for genes with many exons. For this reason, existing approaches using the l1-penalty are either restricted to genes with few exons, or only run the regression algorithm on a small set of pre-selected isoforms.
We introduce a new technique called FlipFlop which can efficiently tackle the sparse estimation problem on the full set of candidate isoforms by using network flow optimization. Our technique removes the need of a preselection step, leading to better isoform identification while keeping a low computational cost. Experiments with synthetic and real RNA-Seq data confirm that our approach is more accurate than alternative methods and one of the fastest available.
Averaging Stochastic Block Models to estimate the graphon function of a W-graph'.
W-graph refers to a general class of random graph models that can be seen as a random graph limit. It is characterized by both its graphon function and its motif frequencies. The stochastic block model is a special case of W-graph where the graphon function is block-wise constant. In this paper, we propose a variational Bayes approach to estimate the W-graph as an average of stochastic block models with increasing number of blocks. We derive a variational Bayes algorithm and the corresponding variational weights for model averaging. In the same framework, we derive the variational posterior frequency of any motif. A simulation study and an illustration on a social network complete our work.
The Random Subgraph Model for the Analysis of an Ecclesiastical Network in Merovingian Gaul.
In the last two decades, many random graph models have been proposed to extract knowledge from networks. Most of them look for communities or more generally clusters of vertices with homogeneous connection profiles. While the first models focused on networks with binary edges only, extensions now allow to deal with valued networks. Recently, new models were also introduced in order to characterize connection patterns in networks through mixed memberships. This work was motivated by the need of analyzing a historical network where a partition of the vertices is given and where edges are typed. A known partition is seen as a decomposition of a network into subgraphs that we propose to model using a stochastic model with unknown latent clusters. Each subgraph has its own mixing vector and sees its vertices associated to the clusters. The vertices then connect with a probability depending on the subgraphs only, while the types of the edges are assumed to be sampled from the latent clusters. A variational Bayes expectation-maximization algorithm is proposed for inference as well as a model selection criterion for the estimation of the cluster number. Experiments are carried out on simulated data to assess the approach. The proposed methodology is then applied to an ecclesiastical network in merovingian Gaul. An R package, called Rambo, implementing the inference algorithm is available on the CRAN.
Ref : C. Bouveyron, L. Jegou, Y. Jernite, S. Lamassé, P. Latouche & P. Rivera, The random subgraph model for the analysis of an ecclesiastical network in merovingian Gaul, The Annals of Applied Statistics, vol. 8(1), pp. 377-405, 2014.
Nowadays, the recent progress in computer science and storage require the management of huge dimensional data, so called big data. However, the processing of such amount of data is still challeng ing but also compulsory in order to be able to extract precise and correct information. The representation of this data using networks (also called graphs in mathematics) is a clever way to visualise and analyse the big data. The social networks where each node represents an individual and each connection is a link netween two individuals are an example of networks that can be of very high dimension. Here the objective is to characterize groups of individuals or communities where individuals act in the same way. The brain networks are an other example where a node corresponds to a brain regions and a connection between two nodes indicate a transfert of information between the pair of regions. The research on brain networks has recently increased in order to be able to link a default in the network to a pathology. These networks usually have topological characteristics such as the presence of hubs, communities, clustering ... The identification of these properties is a major challenge when studying changes in the dynamic evolution of networks. During this project, we propose to study different methods that allow us to simulate generative models for graphs with fixed topological characteristics. This simulated graphs should be close to real networks in so me sense. This will lead us to the construction of a basis of graphs where the graphs will be as different as possible. The objective of the project is to classify, define distance on graphs and decompose the observed graphs on the constructed basis.
One important task of population genetics is to estimate ancestry coefficients, i.e., the proportions of an individual's genotype originating from different ancestral populations. This is commonly solved by fitting Bayesian models and deriving maximum likelihood estimates or estimating coefficients using non-negative matrix factorization (NMF). A new NMF method is proposed for the estimation of ancestry coefficients that exploits the coefficients' local invariance. An underlying weighted graph is imposed which reflects these local similarities between individuals' ancestries. The method determines a factorization providing ancestry estimates with low deviation from the graph's invariance structure. Two algorithms are presented and compared using an extensive simulation study. Additionally, the new method is compared to an existing Bayesian model. The results from the simulation study indicate that this method may provide superior ancestry estimates.
This talk addresses the issue of detecting change-points in multivariate time series. The proposed approach differs from existing counterparts by making only weak assumptions on both the change-points structure across series, and the statistical signal distributions. Specifically change-points are not assumed to occur at simultaneous time instants across series, and no specific distribution is assumed on the individual signals. It relies on the combination of a local robust statistical test acting on individual time segments, with a global Bayesian framework able to optimize configurations from multiple local statistics (from segments of a unique time series or multiple time series). Using an extensive experimental set-up, our algorithm is shown to perform well on Gaussian data, with the same results in term of recall and precision as classical approaches, such as the fused lasso and the Bernoulli Gaussian model. Furthermore, it outperforms the reference models in the case of non normal data with outliers. The control of the False Discovery Rate by an acceptance level is confirmed. In the case of multivariate data, the probabilities that simultaneous change-points are shared by some specific time series are learned. We finally illustrate our algorithm with real datasets from energy monitoring and genomic. Segmentations are compared to state-of-the-art approaches based on fused lasso and group fused lasso.