module LCS

How Diff Works (by Mark-Jason Dominus)

I once read an article written by the authors of diff; they said that they hard worked very hard on the algorithm until they found the right one.

I think what they ended up using (and I hope someone will correct me, because I am not very confident about this) was the ‘longest common subsequence’ method. In the LCS problem, you have two sequences of items:

a b c d f g h j q z
a b c d e f g i j k r x y z

and you want to find the longest sequence of items that is present in both original sequences in the same order. That is, you want to find a new sequence S which can be obtained from the first sequence by deleting some items, and from the second sequence by deleting other items. You also want S to be as long as possible. In this case S is:

a b c d f g j z

From there it’s only a small step to get diff-like output:

e   h i   k   q r x y
+   - +   +   - + + +

This module solves the LCS problem. It also includes a canned function to generate diff-like output.

It might seem from the example above that the LCS of two sequences is always pretty obvious, but that’s not always the case, especially when the two sequences have many repeated elements. For example, consider

a x b y c z p d q
a b c a x b y c z

A naive approach might start by matching up the a and b that appear at the beginning of each sequence, like this:

a x b y c         z p d q
a   b   c a b y c z

This finds the common subsequence +a b c z+. But actually, the LCS is +a x b y c z+:

      a x b y c z p d q
a b c a x b y c z