Merge vs rebase: you really should use both

Considering how widespread usage of git has become, it is quite interesting that there is still huge variety in how it is actually used for development flows of different projects. There are various debates on topic but here I am going to focus specifically on "merge vs rebase vs squash" and general topic of integrating changes.

Despite lot of controversy, I am convinced this is actually a reasonably solved problem and there is a uniform approach that addresses majority of use cases - and that it is somewhat different from what services like Bitbucket or GitHub encourage.

This article assumes some pre-existing knowledge of how merge and rebase works, as well as general git proficiency. If this is not the case, already linked Atlassian article explains it with plenty of easy to follow visuals.

Goals

In context of organizing collaborative development effort I find pull request model to be most productive for anything but really huge projects. It evolves around a simple social flow - propose some code changes, discuss it with maintainers, adjust proposal as necessary, get it integrated with the main code base. There can be various other choices to make (for example, what is the target branch to integrate changes into or who is responsible for doing it) but essential process remains the same - propose, adjust, integrate.

With that in mind, I think the following goals are most important:

  • It should be easy to review proposed changes.
  • Once integrated, it has to be clear why a specific change was introduced.
  • Upon any regression it needs to be trivial to git bisect to the exact point where it was introduced
  • It must not screw up your CI system.

I will get back to this list later when considering various existing git flows.

Note: in the text below I use the term "pull request" for a branch with proposed set of changes as GitHub remains dominant right now. But the same reasoning also applies to any feature branch developed by a single contributor.

Available tools

In general, any time you have to integrate changes betwee two git branches, there are 3 main options of how it can be handled: merge, rebase and squash+pick.

Merge

  • Only way in git to group multiple commits together.

    Squashing everything into one commit can of course also be considered form of grouping, but that doesn't work if you want to preserve individual chunks of changes for any reason. If that is desired, merge has to be used somehow no matter what.

  • Does not affect existing commits.

    Cases like integrating two public release branches or picking a branch with GPG signed commits while preserving signatures - these simply can't be done without a merge.

    Even if there are any conflicts between branches, merge will resolve them as part of resulting merge commit, keeping original commits untouched. Important if latter is required.

  • Resolves differences between two branches by putting all required changes into actual merge commit.

    But the very same property of separating conflict resolution from actual changes makes it really problematic to track down regressions through history. If each proposed commit had one small conflict and they all get resolved as part of single merge commit - how can you possibly git bisect after that point? Your real patch becomes composed from the commit in the original merged branch and additional conflict resolution (of arbitrary complexity) in the merge commit.

  • Results in non-linear history.

    This seems to be a minor problem when amount of merges is small but easily gets out of hand when combined with previous problem and merges are used casually. Nightmare case for me is when merges are used to pick changes from master to feature branch - and each time it happens there are some conflicts to resolve. Simple question of "how/why this change was introduced" becomes an epic quest of navigating through the maze of merge commit relations.

Rebase

  • Rewrites history

    Probably most important thing to remember about rebase is that it doesn't really integrated changes between branches. Instead it creates a brand new commit history with actual changeset being similar to the original one. Hashes will change, anyone who has fetched this branch before will need to hard reset it, signing becomes pointless.

  • Resolves conflicts exactly where they appear

    Natural consequence of re-creating integrated branch from a latest state of a base one is that whenever some conflict appears, you will notice it at the point of commit which introduces it - and can resolve it right away. With rebases separate conflict resolution simply ceases to exist - you integrate commit history as if it was written in a compatible manner right from the very start.

  • Doesn't actually imply any branch integration

    When developers talk about rebase-based git flow, they often mean "rebase + fast-forward" model where you essentially replace HEAD of base branch with HEAD of intehrated branch because the latter is full superset of the former. However this is not the only option - rebase itself is not a branch integration and can be used with the very same merges or even squashes/cherry-picks.

Squash

  • Subset of rebase

    Squashing is not something distinct in git - it simply is one of operations available during rebase. Reason why it is often mentioned as a separate option is because squashing all of changes into single commit as part of rebase removes the notion of branch completely from the problem - it is reduced to topic of integrating just that one commit which is not much different from just creating one and pushing by hand.

  • Simplistic

    Commit-based integration of changes is how git was originally designed to work and how things still work in Linux kernel for individual patch contributions. And if you want to mirror that model, actual development process becomes surprisingly simple too - for any particular change you just need to create one giant patch and after it gets accepted this stops being your concern. Of course Linux kernel doesn't use squashing (or pull requests in GitHub sense) but setting process difference aside it all comes down to a case where individual contribution is always just one commit.

To better understand why I felt like this note was worth writing at all, I will go through some popular git models and explain why those don't fit me. The very same list of goals will be used as checklist:

  • easy to review proposed changes
  • easy to track changes in history
  • can pinpoint regressions with git bisect
  • works good with CI

GitHub default

Used to be only model model supported by web UI and remains the most popular one as far as I can see. Comes down to using merges for pretty much everything - accepting pull requests, updating pull request state from master, integrating branches in blessed repo.

  • easy to review proposed changes - NO

    The very moment pull request branch receives a merge from a master branch, reasoning about it becomes much harder as commits added before that point don't directly follow into commits coming after that point - and, depending on amount of conflicts, difference can be staggering.

  • easy to track changes in history - NO

    Same problem - just a tiny bit of bi-directional merges is enough to turn it into convoluted mess.

  • can pinpoint regressions with git bisect - MOSTLY

    Because all integrated changes come through a specific merge commit it becomes trivial to bisect through history of integrations - descending into history of individual commits after problematic integration was found. After that it can become harder and if regression comes from conflict resolution you are doomed, but that is manageable scope to investigate overall.

  • works good with CI - YES

    Commit tested by CI tends to be exactly the commit that lands base branch and when you bisect base branch you can be sure that you will only meet green commits until you find a problematic merge point.

Rebase + fast-forward

Started by a secret society "Against Non-Linear History", this model has become popular enough to become directly supported by GitHub too, though it is still somewhat rare to see. Uses rebases to rebuild pull request branch on top of latest base HEAD and fast-forward merge into base branch once accepted (with no merge commit).

  • easy to review proposed changes - YES

    I don't thin one can possibly get more pleasant reviewing experience than this. All commits have to derive directly from recent master state, adding new changes in a strictly linear fashion. Hardest part is to convince contributors to follow rules.

  • easy to track changes in history - MOSTLY

    History itself also becomes more linear and easy to read through. But any original grouping of commits is lost after integration and that means features (contrary to individual changes) can't be listed easily.

  • can pinpoint regressions with git bisect - MAYBE

    See the next point. Will work like a charm if everyting single commit is tested, will make bisection plain impossible otherwise.

  • works good with CI - NO

    Most commonly CI systems only test pull request head commits and/or potential merge commits. With fast-forward based approach only way to ensure all commits in blessed repository build and pass tests is to test every single commit in the pull request - something that is both hard to configure and can easily destroy CI system with an extra load.

Squash + pick

Also directly supported by GitHub these days. Comes down to "do whatever you want in pull request but upon integration it will all be squashed into a single commit".

  • easy to review proposed changes - MAYBE

    It depends on a presence of any additional requirements of how commits in pull requests are structured. If those use merges casually, it is not different from abovementioned merge model. If those are small an linear, it can be as good as in rebase model.

  • easy to track changes in history - NO

    This is probably the biggest problem of squash-based model. After integrating pull request, any internal commit history is just plain lost - not a big deal for a small bug fix, really painful for a non-trivial feature branch. Developers try to address that by making individual pull requests as small as possible (i.e. all formatting changes as a separate changes, no refactorings as part of new feature and so on) but this doesn't match well with actual development flow in my experience.

  • can pinpoint regressions with git bisect - MOSTLY

    Finding wrong commit becomes trivial because every change is an individual commit. Finding problematic change inside that one -1000 +1000 diff is a very different story though.

  • works good with CI - YES

    Not that different from a merge model from this PoV.

Can you have it all?

I used to be in "rebase + fast-forward" camp but encountering in practice some of issues listed above made me think it all over again. And after some experimenting I ended up with something that has not dissapointed me so far.

Basic principle

Idea is very simple - do everything like in rebase model, but integrate pull requests strictly using git merge --no-ff - and modify that merge commit title/description to match the pull request (but only merge if fast-forward one would be possible otherwise). That way merge commits are only used for grouping and "real" history remains linear - something like folding in a text editor.

Checklist:

  • easy to review proposed changes - YES

    At the point of reviewing it is not different from regular rebase model and thus has all the same benefits.

  • easy to track changes in history - YES

    Both linear history and PR-based grouping are present.

  • can pinpoint regressions with git bisect - YES

    Worst case is if problematic merge commit is found but individual commits inside it do not build making further bisection impossible - which is roughly the same as with merge or squash model. But the fact that those commits are linear and do not contain any recursive merges makes further manual search much more feasible.

  • works good with CI - YES

    Not that different from a merge model here.

Additional benefits

One extra thing I like about this approach is that one can get higher level overview of a branch by simply listing merge commits:

$ git log --merges
commit 8823efc9dec20254307a9d797b7641bf764fd74f (HEAD -> master)
Merge: 729632b 462f8eb
Author: Mihails Strasuns
Date:   Thu Apr 5 18:53:58 2018 +0100

    PR #1042: Fix bad bug

    Description

commit 729632b9bc1190513540a39d1142d21d558bb3df
Merge: cdb2131 03bf8f2
Author: Mihails Strasuns
Date:   Thu Apr 5 18:50:30 2018 +0100

    PR #1013: Implement feature 1

    Some abstract explanation

Add tiny bit of post-processing and this can literally becomes your release changelog. With a classical GitHub model this won't work because there will be really lot of "Merge branch 'master' into 'xxx'" commits you don't care about. And with classical rebase model there will be a lot of really fine grained commits like "Fixed formatting in module aaa" without any higher level reasoning.

Summary

Biggest pain for me so far is that there is no existing tooling to easily support this model. Closest you can get is semi-linear merge feature in GitLab - "merge commit is created for every merge, but the branch is only merged if a fast-forward merge is possible". That sounds exactly like my proposal but last time I tried it, there were several downsides:

  • Only available in Enterprise Edition
  • Still creates that useless "Merge branch aaa into bbb" commit message unless you manually override it each time
  • Will only literally allow merging if fast-forward is possible right here and now - meaning that in repository with high activity pull requests have to be updated all the time to be kept merge-ready. I wish it worked similar to "Rebase + Merge" button in GitHub.

It becomes better if you write our own merge command-line tool which works with hosting platform API and implements any custom procedure needed - company I work for also did it previously for rebase model. Don't have anything worth publish right now but I hope to get to it at some point.

Still this is quite far from having that kind of experience easily available right via hosting system.