SANITY (SAmpling Noise corrected Inference of Transcription activitY) is a unique Bayesian procedure for normalizing single-cell RNA-seq data. SANITY estimates log expression values and associated errors bars directly from raw UMI counts without any tunable parameters.

SANITY source code and installation instructions are available on GitHub.

We present MotEvo, a integrated suite of Bayesian probabilistic methods for the prediction of TFBSs and inference of regulatory motifs from multiple alignments of phylogenetically related DNA sequences which incorporates all features just mentioned. In addition, MotEvo incorporates a novel model for detecting unknown functional elements that are under evolutionary constraint, and a new robust model for treating gain and loss of TFBSs along a phylogeny. Rigorous benchmarking tests on ChIP-seq datasets show that MotEvos novel features significantly improve the accuracy of TFBS prediction, motif inference, and enhancer prediction.

Download: Source, Linux binary, Mac binary

The DWT-toolbox is a collection of software tools for performing motif finding and transcription factor binding site (TFBS) predictions with Dinucleotide Weight Tensors (DWTs). Besides a motif finder, and a program for predicting TFBSs with a given DWT in a given set of sequences, the toolbox also includes a program for constructing dilogos that visualize DWT motifs.

Download DWT-toolbox    DWT-toolbox online tool   

PhyloGibbs is an algorithm for discovering regulatory sites in a collection of DNA sequences, including multiple alignments of orthologous sequences from related organisms. Many existing approaches to either search for sequence-motifs that are overrepresented in the input data, or for sequence-segments that are more conserved evolutionary than expected. PhyloGibbs combines these two approaches and identifies significant sequence-motifs by taking both over-representation and conservation signals into account.

Download PhyloGibbs

Using the assumption that regulatory sites can be represented as samples from weight matrices (WMs), we derive a unique probability distribution for assignments of sites into clusters. Our algorithm, PROCSE (probabilistic clustering of sequences), uses Monte Carlo sampling of this distribution to partition and align thousands of short DNA sequences into clusters. The algorithm internally determines the number of clusters from the data and assigns significance to the resulting clusters.

Download PROCSE

We develop a computational method that uses Hidden Markov Models and an Expectation Maximization algorithm to detect cis-regulatory modules, given the weight matrices of a set of transcription factors known to work together. Two novel features of our probabilistic model are: (i) correlations between binding sites, known to be required for module activity, are exploited, and (ii) phylogenetic comparisons among sequences from multiple species are made to highlight a regulatory module. The novel features are shown to improve detection of modules, in experiments on synthetic as well as biological data.

Download STUBB

Spa is a computer program for aligning cDNA sequences to a genome. It uses a probabilistic Bayesian model to find the optimal alignment. To keep running times feasible we use the BLAT gfServer to identify genomic loci and return the best mapping from these loci.

Download SPA