We perform unsupervised classification (clustering) of spoligotyping patterns. Clustering (please do not confuse with finding clusters of two isolates with matching genotypes) is a process of grouping the objects in the data by some similarity measure. "Unsupervised" means that we assume no prior knowledge on what family each spoligotype belongs to. All we start with is our data, a collection of spoligotypes.
The clustering methods can be divided into two major categories: discriminative (distance-based methods) and generative (model-based) methods.
Distance-based methods require a metric of pairwise distance between data points. This distance measure is often difficult to define, especially when the data are complex, for example, biological sequences. Model-based clustering techniques, such as mixture models, have the advantages that they do not require the distance metric.
The model-based approach is the most appropriate for spoligotypes, because the distance measure between spoligotypes has not been determined yet. We cannot precisely determine how far is, say, M. tuberculosis Beijing from M. bovis, given their spoligotyping patterns. The belief is that spoligotypes develop by the deletion of spacers, but it is not clear in what order and how many spacers can be lost simultaneously; therefore, the distance can be probabilistically assessed as a probability of a "child" spoligotype having been evolved from a "parent" spoligotype by mostly losing but not gaining spacers.
We assume that a multivariate Bernoulli mixture model (MBMM) generates the data and that there is a one-to-one correspondence between mixture model components and spoligotype families. Each model component satisfies the Naïve Bayes assumption. Naïve Bayes assumption allows treating all the features (in our case, a feature is a presence/absence of a spacer - 1 or 0) as independent of each other given the class.
Bernoulli distribution has only one parameter, p, the probability of success (spacer in our case) in a trial with two possible outcomes. The probability of the absence of a spacer is (1-p). Thus each position in a spoligotype can be modeled as a Bernoulli distribution. There are 43 positions; therefore, multivariate Bernoulli distribution (naturally, with 43 parameters, each of which is a probability of a spacer) is used. Finally, we know that there are families within the spoligotyping data; therefore, some number of different multivariate Bernoulli distributions, each corresponding to a family, should model the data. Thus we have a mixture of components each corresponding to a distribution.
Now let us formalize our probabilistic framework.
Let
be a set of spoligotypes that we want to classify. Each spoligotype is a binary
43-dimensional vector:
. Let
be a mixture model, which consists of
components:
. Each mixture component
is described by parameters
, that are mixing weight of the
component,
, and 43 Bernoulli distributions. The mixing weights satisfy the constrains:
and |
(1) |
![]() |
(2) |
![]() |
(3) |
![]() |
(4) |
The parameters for finite mixture models are often estimated by the maximum likelihood (ML) approach. Expectation-maximization (EM) is the most commonly used algorithm for finding a local maximum of the likelihood of the observed data.