Once that we have obtained the topology of the tree we can compute the length of the branches in a more accurate way. Recall that in an evolution phylogenetic tree (ept), the length of the branches are proportional to the PAM distance between the nodes. In other words, we want to build a tree such that the branch lengths conform the distance matrix as closely as possible.
Since we do not have information about the internal nodes, all our computations will be based on the external nodes (present day) information.
An ept will be one which reflects the distance constraints accurately. For example,
the distance from node 1 to node Y plus the distance from node Y to node 3 should be as close as possible to the PAM distance between the two sequences 1 and 3 In general, if is the closest common ancestor of the sequences i and j
The crucial observation necessary to compute the distances is that in a correct ept
is a random variable
with expected value 0 and variance 1.
Optimally all the would be zero, but this
is not possible to guarantee since we do not have an
exact measure of the distances but estimates.
Then it is natural to select the internode branch lengths
as the values which minimize the sum of the .
This will give us a least squares fit
of the distances.
This is a non-conventional least squares problem which
does not present any technical difficulties except for
attention to detail.
In Darwin the computation of an ept by the above method
is done by the function
MinSqTree: Usage: MinSqTree( Dist:matrix(numeric), Var:matrix(numeric), La
Returns a binary tree which approximates the given distances and
variances. Labels is an optional array for labeling the nodes.
MinSqTree computes the topology of the tree with
PhyloTree and then it solves the least
squares fit for the internode lengths and constructs the
> tree3 := MinSqTree( Dist, Var, ['1','2','3','4','5'] );
MinLen not set, assumed to be 0.1
tree fitting index 0.12 (poor is >1, good is <1)
tree3 := Tree(Tree(Leaf(3,-6.2611,3),-1.6669,Tree(Leaf(2,-8.3550,2),
The output of the above tree is shown in figure
Phylogenetic tree generated by MinSqTree)
There are three sources of difficulties in the least squares computation. The first difficulty is that one pair of distances is underdetermined. The distances between the root and left and right subtrees are not completely determined. Their sum is, but the precise location of the root between these first two descendants is not. This is due to the fact that every time that one of these distances appears, it appears added with the other. This can be quickly verified in our example where all the occurrences of and are of the form . This is a consequence of our symmetric model of evolution, and it also happens for the alignment of two individual sequences (see chapter 7).
We resolve this difficulty with an arbitrary constraint, that the average height of all the nodes in the right and left subtree be the same. This gives a reasonable shape to the tree, but it has to be remembered that the location of the root between its two first descendants is arbitrary.
The second difficulty arises when we have incomplete data and the least squares fit results underdetermined. This could happen because the all-against-all procedure may fail to align some pairs. When we do the all-against-all alignment we should only accept the matches with a similarity score above certain threshold (this threshold is usually between 80 and 100). As a result of being unable to align some sequences the tree may be disconnected. When the tree is disconnected, the connected components should be handled separately. However it may happen that the tree is connected but the number of successful alignments is not enough to resolve all internode distances. Notice that when we have n sequences, we have n-1 internal nodes which give 2n-2 internode unknown distances. One of the two distances at the root is not determined so we have 2n-3 unknowns. The all-against-all matching provides at most n(n-1)/2 data points. Consequently the internode distance computation
is exactly determined for n=2 and n=3 and overdetermined (if we have all the data points) for . On the other hand a tree could be connected with just n-1 alignments which will not be enough to resolve 2n-3 distances.
We resolve this problem by adding a term to the least squares which minimizes the sum of the square of the lengths of the branches. This sum of squares is multiplied by a very small factor ( ), so when the problem is fully determined, this extra term will not affect the results in any significant way, but when the system of equations is underdetermined this term will force undetermined lengths to be as short as possible. This solution is entirely satisfactory in practice. It should also be noted that this sort of problem happens rarely, and only when the sequences being analyzed are marginally related.
The third difficulty is that the least squares solution
may make some of the internode lengths
A negative length branch does not make sense in evolutionary
Such a situation arises from: (a) a bias in the
estimation of the PAM distances, or (b) alignments which
do not correspond to homology and hence the distance
estimates are incorrect and (c) an incorrect topology
(phase I of the construction failing).
When an internode branch is assigned a negative value,
this means that the least squares are not
attainable at that point, but in the limit of that branch
being of length zero.
A branch of length 0 is not impossible, it corresponds
to an ancestral sequence which split in a very short
period into three descendant families (two new offsprings)
instead of two (one new offspring).
In other words a ternary
as opposed to binary internal node.
The Darwin function
MinSqTree changes negative branch
lengths to very short (0.2 PAM) positive lengths.
The selection of a very small value as opposed to 0 is
inconsequential in the model, but allows better displaying of the
Negative branch lengths do not arise frequently and only happen with data which is difficult to align correctly. It is not a serious problem in practice.
The sum of the residual squares ( ) after the fit gives a rough idea of the quality of the fit. A small number indicates a good fit, a large number indicates a poor fit. Since this value is normalized with respect to its variance, it has an expected value 1. Values lower than 1 indicate that the data was better behaved than expected. A value greater than 1 indicates that either we were unlucky or the alignments are suspicious. In any case a value greater than 1 decreases our confidence on the phylogenetic tree.
A note to the complexity
of these procedures is in order.
PhyloTree function requires a pair of matrices
(double precision numbers) and works in time
proportional to .
Thus we can use the entire memory available
and expect the tree construction to complete quickly.
For example with 16Mb of memory we could resolve phylogenetic
trees up to 1000 sequences.
MinSqTree has to resolve a system of
linear equations of size .
This requires about double
precision numbers in total and
MinSqTree will be able to solve only