Thursday, October 15, 2015

Fun with recreating an evil merge

Sometimes we wish there were good ways to recreate a complex merge, replaying a previously resolved conflict resolution, and reapplying a previously done evil merge, of a side branch to an updated mainline.

For example, you have a side-branch that consists of two commits, A and B, and you create a merge with the mainline that ends with X, like so:

           A---B
          /     \
  ---o---O---X---M

resulting in a new merge commit M. When you created this merge, it could be that changes A and B overlapped (either textually or semantically) with what was done on the mainline since the side branch forked, i.e. what was done by X. Such an overlap, if it is textual, would result in a merge conflict. Perhaps X added a new line at the same place A and/or B added a different line, resulting in something like:

...
original line 1
<<<<<<< HEAD
line added by X
||||||| O (common ancestor)
=======
line added by A
>>>>>>> B
original line 2
...

which you may resolve, when recording M, to:

...
original line 1
line added by A
line added by X
original line 2
...

Expressed in "git show --cc" format, such a merge result would appear this way:

  ...
  original line 1
 +line added by A
+ line added by X
  original line 2
  ...

A line with two leading spaces are common lines that both branches agree with, a line with plus at the first column is from the mainline and a line with plus at the second column is from the side branch.

If the overlap were not just textual but semantic, you may have to further update parts of files that did not textually conflict. For example, X may have renamed an existing function F to newF, while A or B added new callsites of F. Such a change is not likely to overlap textually, but in the merge result M, you would need to change the new calls you added to F to instead call newF. Such a change may look like this:

  ...
  original line 1
 +line added by A
+ line added by X
  original line 2
  ...
 -a new call to F() added by A
++a new call to newF() added by A
  ...

A line with minus at the second column is what was only in the side branch but that does not appear in the result (i.e. the side branch added the line, but the result does not have it). A line with two pluses at the beginning is what appears in the result but does not exist in either branch.

A merge that introduces such a line that did not exist in either branch is called an evil merge. It is something that no automated textual merge algorithm would have produced.

Now, while you were working on producing the merge M, the mainline may have progressed and gained a new commit Y. You would like to somehow take advantage of what you have already done when you created M to merge your side branch to the updated mainline to produce N:

           .---A---B
          /         \
  ---o---O---X---Y---N

A good news is that, when the evil merge is in a file that also has textual conflicts to resolve, "git rerere" will automatically take care of this situation. All you need to do is to set the configuration rerere.enabled to true before attempting the merge between X and B and recording their merge M, and then attempt a new merge between B and Y. Without even having to type "git rerere", the mechanism is invoked by "git merge" to replay the recorded resolution (which is where the name of the machinery "rerere" comes from). A bad news is that when an evil merge has to be made to a file that is not involved in any textual conflict (i.e. imagine the case where we didn't have "line added by A" vs "line added by X" conflict earlier in the same file in the above example), "rerere" does not even kick in. The question is what to do, knowing B, X, and M, to recreate N while keeping the adjustment needed for semantic conflicts to record M.

One naive approach would be to take a difference between X and M and apply it to Y. In the previous example, X would have looked like:

...
original line 1
line added by X
original line 2
...

and the difference between X and M would be (1) addition of "line added by A", (2) addition of "a new call to newF() added by A", and (3) any other change made by A and B that did not overlap with what X did. Implementation-wise, it is unlikely that we would do this as a "diff | patch" pipeline; most likely we would do it as a three-way merge, i.e.

$ git checkout Y^0
$ git merge-recursive X HEAD M

to compute the state we would have obtain by making the same move as going from X to M starting at Y, using the index and the working tree.

While that approach would work in simple case where Y does not do anything interesting, it would not work well in general. The most obvious case is when Y is actually a merge between X and A:

           .---A---B
          /     \   \
  ---o---O---X---Y---N

The difference between X and M would contain all that was done by A and B, in addition to what was done at M to adjust for textual and semantic conflicts. Replaying that on top of Y, which already contains what was done by A but not B, would end up duplicating what A did. At best, we will get a huge and uninteresting merge conflict. At worst, we will get the same code silently duplicated twice.

I think the right approach to recreate the (potentially evil) merge M is to consider M as two steps.
The first step is to merge X and B mechanically, and make a tree out of the mechanical merge result, with conflict markers and all. Call it T. The difference between T and M is what the person who made M did to adjust for textual and semantic conflicts.

           A---B
          /     \
  ---o---O---X---T-M

Then, you can think of the process of recreating N in a way similar to M was made as a similar two step process. The first step is to merge Y and B mechanically, and create a tree out of the mechanical merge result, and call it S. Applying the difference between T and M on top of S would give you the textual and semantic adjustments the same way "git rerere" replays the recorded resolution.

           .---A---B
          /    (\)  \
  ---o---O---X---Y---S-N

This should work better whether Y is a merge with A.

$ git checkout X^0
$ git merge --no-commit B
$ git add -u
$ T=$(git write-tree)
$ git reset --hard Y^0
$ git merge --no-commit B
$ git add -u
$ S=$(git commit-tree $(git write-tree) -p HEAD -m S)
$ git checkout $S
$ git merge-recursive $T HEAD M

would compute the result using the index and the working tree, so after eyeballing the result and making sure it makes sense, the above can be concluded with a

$ git commit --amend

Of course, this article is only about outlining the idea. If this proves to be a viable approach, it would make sense to do these procedures inside "rebase --first-parent" or something.