Biological Sequence Analysis, biologia

[ Pobierz całość w formacie PDF ]
Parametric Inference for Biological Sequence Analysis
Lior Pachter and Bernd Sturmfels
Department of Mathematics, University of California, Berkeley, CA 94720
April 8, 2004
Abstract
One of the major successes in computational biology has been the unication, using the graphi-
cal model formalism, of a multitude of algorithms for annotating and comparing biological sequences.
Graphical models that have been applied towards these problems include hidden Markov models for
annotation, tree models for phylogenetics, and pair hidden Markov models for alignment. A single al-
gorithm, the sum-product algorithm, solves many of the inference problems associated with different
statistical models. This paper introduces the
polytope propagation algorithm
for computing the Newton
polytope of an observation from a graphical model. This algorithm is a geometric version of the sum-
product algorithm and is used to analyze the parametric behavior of maximum a posteriori inference
calculations for graphical models.
1
Inference with Graphical Models for Biological Sequence Analysis
This paper develops a new algorithm for graphical models based on the mathematical foundation for statis-
tical models proposed in [18]. Its relevance for computational biology can be summarized as follows:
(a) Graphical models are a unifying statistical framework for biological sequence analysis.
(b) Parametric inference is important for obtaining biologicallymeaningful results.
(c) The polytope propagation algorithm solves the parametric inference problem.
Thesis (a) states that graphical models are good models for biological sequences. This emerging un-
derstanding is the result of practical success with probabilistic algorithms, and also the observation that
inference algorithms for graphical models subsume many apparently non-statistical methods. A noteworthy
example of the latter is the explanation of classic alignment algorithms such as Needleman-Wunsch and
Smith-Waterman in terms of the Viterbi algorithm for pair hidden Markov models [3]. Graphical models are
now used for many problems including motif detection, gene nding, alignment, phylogeny reconstruction
and protein structure prediction. For example, most gene prediction methods are now hidden Markov model
(HMM) based, and previously non-probabilistic methods now have HMM based re-implementations.
In typical applications, biological sequences are modeled as
observed random variables Y
1
Y
n
in a
graphical model. The observed random variables may correspond to sequence elements such as nucleotides
or amino acids.
Hidden random variables X
1
X
m
encode information of interest that is unknown, but
which one would like to infer. For example, the information could be an annotation, alignment or ances-
tral sequence in a phylogenetic tree. One of the strengths of graphical models is that by virtue of being
probabilistic, they can be combined into powerful models where the hidden variables are more complex.
For example, hidden Markov models can be combined with pair hidden Markov models to simultaneously
1
2
align and annotate sequences [1]. One of the drawbacks of such approaches is that the models have more
parameters and as a result inferences could be less robust.
For a xed observed sequence σ
1
σ
2
σ
n
and
xed parameters
, the standard inference problems are:
1. the calculation of
marginal probabilities
:
∑
h
1
h
m
p
σ
1
σ
n
Prob
X
1
h
1
X
m
h
m
Y
1
σ
1
Y
n
σ
n
2. the calculation of
maximum a posteriori log probabilities
:
δ
σ
1
σ
n
min
h
1
h
m
log
Prob
X
1
h
1
X
m
h
m
Y
1
σ
1
Y
n
σ
n
where the
h
i
range over all the possible assignments for the hidden random variables
X
i
. In practice, it is the
solution to Problem 2 that is of interest, since it is the one that solves the problem of nding the genes in a
genome or the “best” alignment for a pair of sequences. A shortcoming of this approach is that the solution
h
h
1
h
m
may vary considerably with a change in parameters.
Thesis (b) suggests that a
parametric
solution to the inference problem can help in ascertaining the
reliability, robustness and biological meaning of an inference result. By
parametric inference
we mean the
solution of Problem 2 for all model parameters simultaneously. In this way we can decide if a solution
obtained for particular parameters is an artifact or is largely independent of the chosen parameters. This
approach has already been applied successfully to the problem of pairwise sequence alignment in which
parameter choices are known to be crucial in obtaining good alignments [5, 12, 24]. Our aim is to develop
this approach for arbitrary graphical models. In thesis (c) we claim that the polytope propagation algorithm
is efcient for solving the parametric inference problem, and, in certain cases is not much slower than
solving Problem 2 for xed parameters. The algorithm is a geometric version of the sum-product algorithm,
which is the standard tool for inference with graphical models.
The mathematical setting for understanding the polytope propagation algorithm is
tropical geometry
.
The connection between tropical geometry and parametric inference in statistical models is developed in the
companion paper [18]. Here we describe the details of the polytope propagation algorithm (Section 3) in two
familiar settings: the hidden Markov model for annotation (Section 2) and the pair hidden Markov model
for alignment (Section 4). Finally, in Section 5, we discuss some practical aspects of parametric inference,
such as specializing parameters, the construction of single cones which eliminates the need for identifying
all possible maximum a posteriori explanations, and the relevance of our ndings to Bayesian computations.
2 Parametric Inference with Hidden Markov Models
Hidden Markov models play a central role in sequence analysis, where they are widely used to annotate DNA
sequences [2]. A simple example is the CpG island annotation problem [4, §3]. CpG sites are locations in
DNA sequences where the nucleotide cytosine (C) is situated next to a guanine (G) nucleotide (the “p” comes
from the fact that a phosphate links them together). There are regions with many CpG sites in eukaryotic
genomes, and these are of interest because of the action of DNA methyltransferase, which recognizes CpG
sites and converts the cytosine into 5-methylcytosine. Spontaneous deamination causes the 5-methylcytosine
to be converted into thymine (T), and the mutation is not xed by DNA repair mechanisms. This results in
a gradual erosion of CpG sites in the genome.
CpG islands
are regions of DNA with many unmethylated
3
CpG sites. Spontaneous deamination of cytosine to thymine in these sites is repaired, resulting in a restored
CpG site. The computational identication of CpG islands is important, because they are associated with
promoter regions of genes, and are known to be involved in gene silencing.
Unfortunately, there is no sequence characterization of CpG islands. A generally accepted denition due
to Gardiner-Garden and Frommer [8] is that a CpG island is a region of DNA at least 200bp long with a G+C
content of at least 50%, and with a ratio of observed to expected CpG sites of at least 0.6. This arbitrary
denition has since been rened (e.g. [23]), however even analysis of the complete sequence of the human
genome [16] has failed to reveal precise criteria for what constitutes a CpG island. Hidden Markov models
can be used to predict CpG islands [4, §3]. We have selected this application of HMMs in order to illustrate
our approach to parametric inference in a mathematically simple setting.
The CpG island HMM we consider has
n
hidden binary random variables
X
i
,and
n
observed random
variables
Y
i
that take on the values
A
C
G
T
(see Figure 1 in [18]). In general, an HMM can be charac-
terized by the following conditional independence statements for
i
1
n
:
p
X
i
X
1
X
2
X
i
1
p
X
i
X
i
1
p
Y
i
X
1
X
i
Y
1
Y
i
1
p
Y
i
X
i
The CpG island HMM has twelve model parameters, namely, the entries of the transition matrices
s
00
s
01
t
0
A
t
0
C
t
0
G
t
0
T
S
and
T
s
10
s
11
t
1
A
t
1
C
t
1
G
t
1
T
Here the hidden state space has just two states non-CpG
0andCpG
1 with transitions allowed between
them, but in more complicated applications, such as gene nding, the state space is used to model numerous
gene components (such as introns and exons) and the sparsity pattern of the matrix
S
is crucial.
In its
algebraic representation [18, §2], the HMM is given as the image of the polynomial map
f
:
R
12
R
4
n
S
T
∑
h
0
1
n
t
h
1
σ
1
s
h
1
h
2
t
h
2
σ
2
s
h
2
h
3
s
h
n
1
h
n
t
h
n
σ
n
(1)
The inference problem 1 asks for an evaluation of one coordinate polynomial
f
σ
of the map
f
. This can be
done in linear time (in
n
)usingthe
forward algorithm
[13], which recursively evaluates the formula
1
h
n
0
t
h
n
σ
n
1
∑
h
n
1
0
1
h
2
0
t
h
2
h
3
s
h
2
σ
2
1
h
1
0
t
h
1
h
2
s
h
1
σ
1
f
σ
s
h
n
1
h
n
t
h
n
1
σ
n
1
(2)
Problem 2 is to identify the largest term in the expansion of
f
σ
. Equivalently, if we write
u
ij
log
s
ij
and
v
ij
log
t
ij
then Problem 2 is to evaluate the piecewise-linear function
g
σ
min
h
n
v
h
n
σ
n
min
h
2
v
h
2
h
3
u
h
2
σ
2
min
h
1
u
h
1
h
2
v
h
1
σ
1
(3)
This formula can be efciently evaluated by recursively computing the parenthesized expressions. This is
known as the
Viterbi algorithm
in the HMM literature. The Viterbi and forward algorithms are instances of
the more general
sum-product algorithm
[14].
What we are proposing in this paper is to compute the collection of cones in
R
12
on which the piecewise-
linear function
g
σ
is linear. This may be feasible because the number of cones grows polynomially in
n
. Each
cone is indexed by a binary sequence
h
0
1
n
which represents the CpG islands found for any system of
parameters
u
ij
v
ij
in that cone. A binary sequence which arises in this manner is an
explanation for
σ in
the sense of [18, §4]. Our results in [18] imply that the number of explanations scales polynomially with
n
.
min
h
n
1
u
h
n
1
h
n
v
h
n
1
σ
n
1
4
Theorem 1.
For any given DNA sequence
σ
of length n, the number of bit strings
h
0
1
n
which are
explanations for the sequence
σ
in the CpG island HMM is bounded above by a constant times n
5
25
.
Proof.
There are a total of 2
4
4
12 parameters which is the dimension of the ambient space. Note,
however, that for a xed observed sequence the number of times the observation
A
is made is xed, and
similarly for
C
G
T
. Furthermore, the total number of transitions in the hidden states must equal
n
. Together,
these constraints remove ve degrees of freedom. We can thus apply [18, Theorem 7] with
d
12
5
7.
This shows that the total number of vertices of the Newton polytope of
f
σ
is
O
n
7
8
O
n
5
25
.
Figure 1: The Schlegel diagram of the Newton polytope of an observation in the CpG island HMM.
We explain the biological meaning of our parametric analysis with a very small example. Let us consider
the following special case of the CpG island HMM. First, assume that
t
iA
t
iT
and that
t
iC
t
iG
, i.e., the
output probability depends only on whether the nucleotide is a purine or pyrimidine. Furthermore, assume
that
t
0
A
t
0
G
, which means that the probability of emitting a purine or a pyrimidine in the non-CpG island
state is equal (i.e. base composition is uniform in non-CpG islands).
Suppose that the observed sequence is σ
AATAGCGG
.Weaskfor
all
the possible explanations for
σ, that is, for all possible maximum a posteriori CpG island annotations for all parameters. A priori, the
number of explanations is bounded by 2
8
256, the total number of binary strings of length eight. However,
of the 256 binary strings, only 25 are explanations. Figure 1 is a geometric representation of the solution
to this problem: the Newton polytope of
f
σ
is a 4-dimensional polytope with 25 vertices. The gure is a
Schlegel diagram
of this polytope. It was drawn with the software POLYMAKE [9, 10]. The 25 vertices
in Figure 1 correspond to the 25 annotations, which are the explanations for σ as the parameters vary. Two
annotations are connected by an edge if and only if their parameter cones share a wall. From this geometric
representation, we can determine all parameters which result in the same maximum a posteriori prediction.
5
3 Polytope Propagation
The evaluation of
g
σ
for xed parameters using the formulation in (3) is known as the Viterbi algorithm in
the HMM literature. We begin by re-interpreting this algorithm as a convex optimization problem.
Denition 2.
The Newton polytope of a polynomial
n
i
1
c
i
x
a
1
i
x
a
2
i
x
a
d
i
f
x
1
x
d
1
2
d
is dened to be the convex hull of the lattice points in
R
d
corresponding to the monomials in f :
NP
f
conv
a
1
1
a
2
1
a
d
1
a
1
n
a
2
n
a
d
n
Recall that for a xed observation there are natural polynomials associated with a graphical model,
which we have been denoting by
f
σ
. In the CpG island example from Section 2, these polynomials are the
coordinates
f
σ
of the polynomial map
f
in (1). Each coordinate polynomial
f
σ
is the sum of 2
n
monomials,
where
n
σ
. The crucial observation is that even though the number of monomials grows exponentially
with
n
, the number of vertices of the Newton polytope
NP
f
σ
is much smaller. The Newton polytope is
important for us because its vertices represent the solutions to the inference problem 2.
Proposition 3.
The maximum a posteriori log probabilities
δ
σ
in Problem 2 can be determined by minimiz-
ing a linear functional over the Newton polytope of f
σ
.
Proof.
This is nothing but a restatement of the fact that when passing to logarithms, monomials in the
parameters become linear functions in the logarithms of the parameters.
Our main result in this section is an algorithm which we state in the form of a theorem.
Theorem 4 (Polytope propagation).
Let f
σ
be the polynomial associated to a xed observation
σ
from
a graphical model. The list of all vertices of the Newton polytope of f
σ
can be computed efciently by
recursive convex hull and Minkowski sum computations on unions of polytopes.
Proof.
Observe that if
f
1
f
2
are polynomials then
NP
f
1
f
2
NP
f
1
NP
f
2
where the
on the right
hand side denotes the Minkowski sum of the two polytopes.
NP
f
1
NP
f
2
if
f
1
and
f
2
are polynomials with positive coefcients. The recursive description of
f
σ
givenin(2)
can be used to evaluate the Newton polytope efciently. The necessary geometric primitives are precisely
Minkowski sum and convex hull of unions of convex polytopes. These primitives run in polynomial time
since the dimension of the polytopes is xed. This is the case in our situation since we consider graphical
models with a xed number of parameters. We can hence run the sum-product algorithm efciently in the
semiring known as the
polytope algebra
. The size of the output scales polynomially by [18, Thm. 7].
Similarly,
NP
f
1
f
2
conv
Figure 2 shows an example of the polytope propagation algorithm for a hidden Markov model with all
random variables binary and with the following transition and output matrices:
s
00
1
s
00
1
S
and
T
1
s
11
1
s
11
Here we specialized to only two parameters in order to simplify the diagram. When we run polytope propa-
gation for long enough DNA sequences σ in the CpG island HMM of Section 2 with all 12 free parameters,
we get a diagram just like Figure 2, but with each polygon replaced by a seven-dimensional polytope.
[ Pobierz całość w formacie PDF ]

zanotowane.pl

doc.pisz.pl

pdf.pisz.pl

lemansa.htw.pl

Biological Sequence Analysis

Biological Sequence Analysis, biologia

Menu