Wednesday, November 27, 2013

Git 1.8.5

The latest release Git 1.8.5 is out. Among many incremental improvements, there are a handful of changes that are worth mentioning:

  • Magic pathspecs like ":(icase)makefile" (matches both Makefile and makefile) and ":(glob)foo/**/bar" (matches "bar" in "foo" and any subdirectory of "foo") can be used in more places.
  • The "http.*" configuration variables can now be specified for individual URLs. E.g

    [http]
     sslVerify = true
    [http "https://weak.example.com/"] sslVerify = false

    would turn on http.sslVerify for everybody, except when talking with the specified URL.
  • "git mv A B" when moving a submodule has been taught to relocate the submodule's working tree and to adjust the paths in the .gitmodules file.
  • "git blame" can now take more than one -L option to discover the origin of multiple blocks of lines.
  • The http transport clients can optionally ask to save cookies with the http.savecookies configuration variable.
  • "git push" learned a more fine grained control over a blunt "--force" when requesting a non-fast-forward update with the "--force-with-lease=<refname>:<expected object name>" option.
  • "git diff --diff-filter=<classes of changes>" can now take lowercase letters (e.g. "--diff-filter=d") to mean "show everything but these classes".  "git diff-files -q" is now a deprecated synonym for "git diff-files --diff-filter=d".
  • "git gc" exits early without doing any work when it detects that another instance of itself is already running.

Tuesday, November 26, 2013

The Codebreakers


The CodebreakersEvery once in a while, I receive gifts from satisfied Git friends, chosen from my Amazon Wish list. And today was such a day. As I have been fairly busy cleaning up the fallout from our recent move and finally things are beginning less hectic, it turns out to be a perfect distraction gift for me, too ;-)


I only read the first few sections so far (it is a big, thick book and it would take me forever to finish reading and then write about it and thanking the person). Thanks, MTM!

Friday, November 8, 2013

Git 1.8.4.3

The latest maintenance release Git v1.8.4.3 has been tagged and is available at the usual places (see the list of public repositories). The fixes that have already merged to the 'master' branch for the upcoming Git v1.8.5 feature release are all there.

Here are the highlights, relative to the previous maintenance release v1.8.4.2:

  • The interaction between use of Perl in our test suite and NO_PERL has been clarified a bit.
  • A fast-import stream expresses a pathname with funny characters by quoting them in C style; remote-hg remote helper (in contrib/) forgot to unquote such a path.
  • One long-standing flaw in the pack transfer protocol used by "git clone" was that there was no way to tell the other end which branch "HEAD" points at, and the receiving end needed to guess. A new capability has been defined in the pack protocol to convey this information so that cloning from a repository with more than one branches pointing at the same commit where the HEAD is at now reliably sets the initial branch in the resulting repository.
  • We did not handle cases where http transport gets redirected during the authorization request (e.g. from http:// to https://).
  • "git rev-list --objects ^v1.0^ v1.0" gave v1.0 tag itself in the output, but "git rev-list --objects v1.0^..v1.0" did not.
  • The fall-back parsing of commit objects with broken author or committer lines were less robust than ideal in picking up the timestamps.
  • Bash prompting code to deal with an SVN remote as an upstream were coded in a way not supported by older Bash versions (3.x).
  • "git checkout topic", when there is not yet a local "topic" branch but there is a unique remote-tracking branch for a remote "topic" branch, pretended as if "git checkout -t -b topic remote/$r/topic" (for that unique remote $r) was run. This hack however was not implemented for "git checkout topic --".
  • Coloring around octopus merges in "log --graph" output was screwy.
  • We did not generate HTML version of documentation to "git subtree" in contrib/.
  • The synopsis section of "git unpack-objects" documentation has been clarified a bit.
  • An ancient How-To on serving Git repositories on an HTTP server lacked a warning that it has been mostly superseded with a more modern way.


Wednesday, October 30, 2013

v1.8.5-rc0: An early preview of the upcoming release

There are many little changes everywhere.  All of the fixes that have already went into 1.8.4.2 maintenance release are also in this preview.

Foreign interfaces, subsystems and ports.

  • "git-svn" used with SVN 1.8.0 when talking over https:// connection dumped core due to a bug in the serf library that SVN uses.  Work it around on our side, even though the SVN side is being fixed.
  • On MacOS X, we detected if the filesystem needs the "pre-composed unicode strings" workaround, but did not automatically enable it.  Now we do.
  • remote-hg remote helper misbehaved when interacting with a local Hg repository relative to the home directory, e.g. "clone hg::~/there".
  • imap-send ported to OS X uses Apple's security framework instead of OpenSSL one.
  • Subversion 1.8.0 that was recently released breaks older subversion clients coming over http/https in various ways.
  • "git fast-import" treats an empty path given to "ls" as the root of the tree.

UI, Workflows & Features

  • "git grep" and "git show" pays attention to "--textconv" option when these commands are told to operate on blob objects (e.g. "git grep -e pattern HEAD:Makefile").
  • "git replace" helper no longer allows an object to be replaced with another object of a different type to avoid confusion (you can still manually craft such replacement using "git update-ref", as an escape hatch).
  • "git status" no longer prints dirty status information for submodules for which submodule.$name.ignore is set to "all".
  • "git rebase -i" honours core.abbrev when preparing the insn sheet for editing.
  • "git status" during a cherry-pick shows what original commit is being picked.
  • Instead of typing four capital letters "HEAD", you can say "@" now, e.g. "git log @".
  • "git check-ignore" follows the same rule as "git add" and "git status" in that the ignore/exclude mechanism does not take effect on paths that are already tracked.  With "--no-index" option, it can be used to diagnose which paths that should have been ignored have been mistakenly added to the index.
  • Some irrelevant "advice" messages that are shared with "git status" output have been removed from the commit log template.
  • "update-refs" learnt a "--stdin" option to read multiple update requests and perform them in an all-or-none fashion.
  • Just like "make -C <directory>", "git -C <directory> ..." tells Git to go there before doing anything else.
  • Just like "git checkout -" knows to check out and "git merge -" knows to merge the branch you were previously on, "git cherry-pick" now understands "git cherry-pick -" to pick from the previous branch.
  • "git status" now omits the prefix to make its output a comment in a commit log editor, which is not necessary for human consumption.  Scripts that parse the output of "git status" are advised to use "git status --porcelain" instead, as its format is stable and easier to parse.
  • Make "foo^{tag}" to peel a tag to itself, i.e. no-op., and fail if "foo" is not a tag.  "git rev-parse --verify v1.0^{tag}" would be a more convenient way to say "test $(git cat-file -t v1.0) = tag".
  • "git branch -v -v" (and "git status") did not distinguish among a branch that does not build on any other branch, a branch that is in sync with the branch it builds on, and a branch that is configured to build on some other branch that no longer exists.
  • A packfile that stores the same object more than once is broken and will be rejected by "git index-pack" that is run when receiving data over the wire.
  • Earlier we started rejecting an attempt to add 0{40} object name to the index and to tree objects, but it sometimes is necessary to allow so to be able to use tools like filter-branch to correct such broken tree objects.  "filter-branch" can again be used to to do so.
  • "git config" did not provide a way to set or access numbers larger than a native "int" on the platform; it now provides 64-bit signed integers on all platforms.
  • "git pull --rebase" always chose to do the bog-standard flattening rebase.  You can tell it to run "rebase --preserve-merges" by setting "pull.rebase" configuration to "preserve".
  • "git push --no-thin" actually disables the "thin pack transfer" optimization.
  • Magic pathspecs like ":(icase)makefile" that matches both Makefile and makefile can be used in more places.
  • The "http.*" variables can now be specified per URL that the configuration applies.  For example,

       [http]
           sslVerify = true
       [http "https://weak.example.com/"]
           sslVerify = false

    would flip http.sslVerify off only when talking to that specified site.
  • "git mv A B" when moving a submodule A has been taught to relocate its working tree and to adjust the paths in the .gitmodules file.
  • "git blame" can now take more than one -L option to discover the origin of multiple blocks of the lines.
  • The http transport clients can optionally ask to save cookies with http.savecookies configuration variable.
  • "git push" learned a more fine grained control over a blunt "--force" when requesting a non-fast-forward update with the "--force-with-lease=<refname>:<expected object name>" option.
  • "git diff --diff-filter=<classes of changes>" can now take lowercase letters (e.g. "--diff-filter=d") to mean "show everything but these classes".  "git diff-files -q" is now a deprecated synonym for "git diff-files --diff-filter=d".
  • "git fetch" (hence "git pull" as well) learned to check "fetch.prune" and "remote.*.prune" configuration variables and to behave as if the "--prune" command line option was given.
  • "git check-ignore -z" applied the NUL termination to both its input (with --stdin) and its output, but "git check-attr -z" ignored the option on the output side. Make both honor -z on the input and output side the same way.
  • "git whatchanged" may still be used by old timers, but mention of it in documents meant for new users will only waste readers' time wonderig what the difference is between it and "git log".  Make it less prominent in the general part of the documentation and explain that it is merely a "git log" with different default behaviour in its own document.

Performance, Internal Implementation, etc.


  • The HTTP transport will try to use TCP keepalive when able.
  • "git repack" is now written in C.
  • Build procedure for MSVC has been updated.
  • If a build-time fallback is set to "cat" instead of "less", we should apply the same "no subprocess or pipe" optimization as we apply to user-supplied GIT_PAGER=cat.
  • Many commands use --dashed-option as a operation mode selector (e.g. "git tag --delete") that the user can use at most one (e.g. "git tag --delete --verify" is a nonsense) and you cannot negate (e.g. "git tag --no-delete" is a nonsense).  parse-options API learned a new OPT_CMDMODE macro to make it easier to implement such a set of options.
  • OPT_BOOLEAN() in parse-options API was misdesigned to be "counting up" but many subcommands expect it to behave as "on/off". Update them to use OPT_BOOL() which is a proper boolean.
  • "git gc" exits early without doing a double-work when it detects that another instance of itself is already running.
  • Under memory pressure and/or file descriptor pressure, we used to close pack windows that are not used and also closed filehandle to an open but unused packfiles. These are now controlled separately to better cope with the load.


Wednesday, September 18, 2013

Fun with first parent history

If your history is cleanly maintained, the output from "git log --first-parent" will consist only of merges of completed topics and trivially correct updates made directly on top of it. It will give you a birds-eye view that shows what features and fixes are made during given period without going into too much details. A history, each of whose merge shows work done for a specific topic (theme, purpose, objective; use whatever word you prefer) into it, means that whoever made these merges is the integrator, the keeper of the main history. The first-parent view of the history is useful only when the keeper of the main history takes good care of the main history.

People who use the central repository workflow where there is a single repository used for everybody to fetch from and push to complain that "git pull" they do merges the history taken from their central repository into their own development history and the merge is made in the wrong direction. They often wish for an option to flip the order of parents around for this reason, but they do not realize that a first-parent-clean history needs a lot more than that.

When you are using the "central shared repository" workflow, if you had and used such an option to flip the heads of a merge to record what you have done so far as a side branch of what everybody else did, the first-parent view would make a bit more sense than what you currently get. For example, if you worked on a specific topic that required six individual commits to complete since you forked from the mainline, your history in your repository and the project's main history in the central repository may look like this:

     x---x---x---x---x---x     Your history
    /
---X---o---o---o---o---o       Project's history

If you try to "git push" at this point, it will stop you, lest you lose these commits represented with o by overwriting the history. Git will tell you to first integrate the project's history with yours with "git pull", but if you actually pull to merge, the commits x will form the first-parent chain of the resulting merge, and the sequence of commits (most likely, merges of topics unrelated to each other) o will appear as its side branch:

     x---x---x---x---x---x---M     Your history
    /                       /
---X---o---o---o---o---o----       Project's history

This is bad, and "flip the order of parents" may help to produce a history of this shape instead:

     x---x---x---x---x---x     Your history
    /                     \
---X---o---o---o---o---o---M   Project's history

However, there is another half of the problem that is not solved by such an option. People, especially those who work with the centralized workflow, tend to pull too often, just to catch up. Even with such a "flip the order of parents" option, what they would end up with in reality would often look more like this:

     x---x   x---x---x   x     Your history
    /     \ /         \ / \
---X---o---M---o---o---M---M   Project's history

The result fragments otherwise a logical and clean "single strand of pearls to fully address the issue, consisting of 6 commits", into three separate and seemingly unrelated pieces. Imagine that other people are working the same way, and the commits marked with o are merges of side branches they add their half-way work to the main history similar to what happened in the illustration above. You would get this history:

     x---x   x---x---x   x     Your history
    /     \ /         \ / \
---X---M---M---M---M---M---M   Project's history
      / \     / \ /
  ---y   y---y   y             Your colleague's history

Now, in "git log --first-parent" of the project's mainline history, there is nothing that links these six commits marked with x together and differentiates them from commits marked with y, and there is nothing that groups these M (merges) that pull in your disjoint steps to achieve a single goal and separates from other merges. Unless people stop doing that too many "pull"s that are used only to "catch up", even with the "flip the parents of a merge" option, you will not get a history that yields a good first-parent view.

As I wrote in an earlier entry (Fun with various workflows), when you "pull" and then "push" to the central repository, you are playing the role of the integrator, the keeper of the main history, and you are responsible for taking a good care of it yourself. If you make a 2+3+1=6 mess as depicted in the last illustration above, you are failing to do so. People who later read "git log --first-parent" would not be able to see that these six commits you did were to achieve a single coherent goal and should be read together to understand it.

One obvious way to solve it is to use a topic branch workflow, and you do a "git pull" from the shared repository while you are on your 'master', which is free of your 'x's until that 6-commit series is complete and ready. Then you locally merge that topic branch to your 'master' and push it back for everybody to see, which will give you the third picture in this message.

Incidentally, by doing so, you do not need the "flip the order of parents" option, either.

Friday, August 23, 2013

Git 1.8.4

The 1.8.4 release has finally been tagged and pushed out to the usual places. It contains 870+ changes from ~100 contributors (among which 33 people are new) since v1.8.3.

Due to regressions discovered at the last minute, two topics that have been in the master branch for a while had to be reverted. They are expected to come back after fixing the regressions in future releases.

Here are some highlights:

  • "git log" learnt the "-Lbegin,end:filename" option. This starts from the specified range and digs through the history. It may still have rough edges and memory leaks, though.
  • "git clean" learnt the interactive mode, modeled after "git add -i" interface.
  • "git check-mailmap" is a new command that lets you inquire your .mailmap file for the canonical username and e-mail address.
  • "git name-rev" learnt to name an annotated tag object name back to its tagname.
  • Various subcommands of "git submodule" now work even from a subdirectory.
  • "git submodule update" can optionally clone the submodule repositories shallowly.
  • The "push.default=simple" mode of "git push" has been updated to behave like "current" when you push to a remote that is different from where you fetch from (e.g. via remote.pushdefault), in order to better support the triangular workflow.
  • "git log" learnt the "--author-date-order" option.
  • The configuration variable color.ui defaults to "auto" now.
  • "git describe" learnt the "--first-parent" option.
  • "git fetch $remote $branch" used to avoid touching the remote-tracking branch (you could always be explicit and say "git fetch $remote $branch:refs/remotes/$remote/$branch"). The command now updates the remote-tracking branch (if configured).
  • Use of platform fnmatch(3) function (many places like pathspec matching, .gitignore and .gitattributes) have been replaced with wildmatch, allowing "foo/**/bar" to match "foo/bar", "foo/a/bar", etc.
Have fun.

Wednesday, August 14, 2013

Delaying Git 1.8.4 by a week

It appears that we need to revert two topics that cause regressions before the upcoming 1.8.4 release.

  • There is a corner case bug in git stash.  Suppose you have a path that is a regular file (or a symbolic link) in the committed state. You change it to a directory in your working tree, and have various new files in it. Some of them may be tracked, while others may not be. You issue git stash. The command needs to match the path to the committed state, hence it needs to remove the directory to resurrect the path. The new files in the directory you have git added will be in the stash so they are OK, but what happens to the untracked ones? They are killed. The same issue exists if you turned a tracked directory into a file and run the command without first running git add.
    An attempted fix was to ask
    git ls-files --killed to see if such a path exists that will be lost, but it turns out that this makes the command unusably slow in certain directories with very many untracked files.
  • There was an attempt to save typing four capital letters "H", "E", "A" and "D" by instead allowing you to type "@", e.g. git log @. The idea may have been a good one, but the change was executed poorly and incorrectly triggered when it shouldn't (e.g. having a branch whose name is @/foo made it into HEAD/foo or something insane).

Because we have already passed -rc3, I'd feel safer to add another rc week before the final. Updated Git Calendar is here.

Both of these changes meant well, and because we are not reverting them due to design mistakes (i.e. we are not saying that "we do not ever want to have such a feature or fix in our system"), hopefully these can be redone properly after the upcoming release is done.

Some leftover bits (I'll add more to this list later).

  • [DONE] Find out where ls-files --killed is unnecessarily wasting time, and fix it. This is a prerequisite to resurrect the stash corner case fix.
    Cf. $gmane/
    232113
  • Refactor run_hook() interface to be truly reusable by codepath other than git commit, resurrecting a "how about this" patch sent in the past.
    Cf. $gmane/192806, $gmane/212284
  • [IN PROGRESS] Extend the upload-pack protocol to tell what symbolic ref points at which other ref by resurrecting the idea outlined in 2008.
    Cf. $gmane/102039
  • [IN PROGRESS] Rethink how name-hash keeps track of names of directories and actual files to help case insensitive filesystems. Since 2092678c (name-hash.c: fix endless loop with core.ignorecase=true, 2013-02-28), there appears to be no reason why a directory name has to be registered to the hash with a trailing slash, which is the root cause why directory_exists_in_index_icase() reads past the end of the buffer.
    Cf. $gmane/232822
  • [DONE] Look into cvsserver permission bits regression between 1.8.1 and 1.8.3.
    Cf. $gmane/234476
  • Look into pathspec-limited revision traversal regression between 1.8.3 and 1.8.4.
    Cf. $gmane/234462
  • Checking out a branch X that does not have directory D (or worse, has a file D), while you are in the directory D, may want to fail.
    Cf. $gmane/234905
  • Allow extra options to "ssh" invocation made from connect.c, in a way that (ideally) does not break backward compatibility.
    Cf. $gmane/234624
  • Perhaps add a --post-service-hook to the git-daemon that can be used after a service finishes? The exit status from the service process means totally different thing from what the user of service perceives because the former has to say "successfully told the requester that the request is denied", it may not be such a useful mechanism as one naïvely would expect, though.
    Cf. $gmane/
    234706
  • git checkout $commit -- somedir should remove somedir/file that is not in $commit but is in the original index.
    Cf. $gmane/
    234935

Thursday, August 1, 2013

Git 1.8.4-rc1

The first release candidate for Git v1.8.4-rc1 is available for testing at the usual places.
For highlights, please refer to the previous post on v1.8.4-rc0.

Have fun.

Wednesday, July 24, 2013

Git 1.8.4-rc0

A release candidate preview Git v1.8.4-rc0 is now available for testing at the usual places.

As this cycle is a rather large update, please test this thoroughly. It contains 814 non-merge commits, from 90+ contributors (v1.8.3 consisted of 694 changes from 97 contributors).

Here are some highlights:

  • "git log" learnt the "-Lbegin,end:filename" option. This starts from the specified range and digs through the history. It may still have rough edges and memory leaks, though.
  • "git clean" learnt the interactive mode, modeled after "git add -i" interface.
  • "git check-mailmap" is a new command that lets you inquire your .mailmap file for the canonical username and e-mail address.
  • "git name-rev" learnt to name an annotated tag object name back to its tagname.
  • Various subcommands of "git submodule" now works even from a subdirectory.
  • "git submodule update" can optionally clone the submodule repositories shallowly.
  • The "push.default=simple" mode of "git push" has been updated to behave like "current" when you push to a remote that is different from where you fetch from (e.g. via remote.pushdefault), in order to better support the triangular workflow.
  • "git log" learnt the "--author-date-order" option.
  • The configuration variable color.ui defaults to "auto" now.
  • Instead of typing "HEAD", you can say "@" instead, e.g. "git log @".
  • "git describe" learnt the "--first-parent" option.
  • "git fetch $remote $branch" used to avoid touching the remote-tracking branch (you could always be explicit and say "git fetch $remote $branch:refs/remotes/$remote/$branch"). The command now updates the remote-tracking branch (if configured).
  • Use of platform fnmatch(3) function (many places like pathspec matching, .gitignore and .gitattributes) have been replaced with wildmatch, allowing "foo/**/bar" to match "foo/bar", "foo/a/bar", etc.
Have fun.

Monday, July 22, 2013

Git 1.8.3.4

The latest maintenance release Git v1.8.3.4 is now available at the usual places. This is mostly to propagate documentation fixes and test updates from the master front back to the maintenance track, but there are a handful of minor fixes as well:

  • The bisect log listed incorrect commits when bisection ends with only skipped ones.
  • The test coverage framework was left broken for some time.
  • The test suite for HTTP transport did not run with Apache 2.4.
  • "git diff" used to fail when core.safecrlf is set and the working tree contents had mixed CRLF/LF line endings. Committing such a content must be prohibited, but "git diff" should help the user to locate and fix such problems without failing.
These fixes are already on the 'master' branch to be included in upcoming Git 1.8.4. Hopefully we can do its zeroth release candidate preview early this week.

Have fun.

Monday, July 15, 2013

Git 1.8.3.3

The third maintenance release for 1.8.3.x series is now available at the usual places. It contains the following fixes that have already been applied to the 'master' branch for 1.8.4.
  • "git apply" parsed patches that add new files, generated by programs other than Git, incorrectly.  This is an old breakage in v1.7.11.
  • Older cURL wanted piece of memory we call it with to be stable, but we updated the auth material after handing it to a call.
  • "git pull" into nothing trashed "local changes" that were in the index.
  • Many "git submodule" operations did not work on a submodule at a path whose name is not in ASCII.
  • "cherry-pick" had a small leak in its error codepath.
  • Logic used by git-send-email to suppress cc mishandled names like "A U. Thor" <author@example.xz>, where the human readable part needs to be quoted (the user input may not have the double quotes around the name, and comparison was done between quoted and unquoted strings).  It also mishandled names that need RFC2047 quoting.
  • "gitweb" forgot to clear a global variable $search_regexp upon each request, mistakenly carrying over the previous search to a new one when used as a persistent CGI.
  • The wildmatch engine did not honor WM_CASEFOLD option correctly.
  •  "git log -c --follow $path" segfaulted upon hitting the commit that renamed the $path being followed.
  • When a reflog notation is used for implicit "current branch", e.g. "git log @{u}", we did not say which branch and worse said "branch ''" in the error messages.
  • Mac OS X does not like to write(2) more than INT_MAX number of bytes; work it around by chopping write(2) into smaller pieces.
  • Newer MacOS X encourages the programs to compile and link with their CommonCrypto, not with OpenSSL.

Friday, June 28, 2013

Git 1.8.3.2

The second maintenance release for 1.8.3.x series is now available at the usual places. It contains the following fixes that have already been applied to the 'master' branch for 1.8.4.

  • Cloning with "git clone --depth N" while fetch.fsckobjects (or transfer.fsckobjects) is set to true did not tell the cut-off points of the shallow history to the process that validates the objects and the history received, causing the validation to fail.
  • "git checkout foo" DWIMs the intended "upstream" and turns it into "git checkout -t -b foo remotes/origin/foo". This codepath has been updated to correctly take existing remote definitions into account.
  • "git fetch" into a shallow repository from a repository that does not know about the shallow boundary commits (e.g. a different fork from the repository the current shallow repository was cloned from) did not work correctly.
  • "git subtree" (in contrib/) had one codepath with loose error checks to lose data at the remote side.
  • "git log --ancestry-path A...B" did not work as expected, as it did not pay attention to the fact that the merge base between A and B was the bottom of the range being specified.
  • "git diff -c -p" was not showing a deleted line from a hunk when another hunk immediately begins where the earlier one ends.
  • "git merge @{-1}~22" was rewritten to "git merge frotz@{1}~22" incorrectly when your previous branch was "frotz" (it should be rewritten to "git merge frotz~22" instead).
  • "git commit --allow-empty-message -m ''" should not start an editor.
  • "git push --[no-]verify" was not documented.
  • An entry for "file://" scheme in the enumeration of URL types Git can take in the HTML documentation was made into a clickable link by mistake.
  • zsh prompt script that borrowed from bash prompt script did not work due to slight differences in array variable notation between these two shells.
  • The bash prompt code (in contrib/) displayed the name of the branch being rebased when "rebase -i/-m/-p" modes are in use, but not the plain vanilla "rebase".
  • "git push $there HEAD:branch" did not resolve HEAD early enough, so it was easy to flip it around while push is still going on and push out a branch that the user did not originally intended when the command was started.
  • "difftool --dir-diff" did not copy back changes made by the end-user in the diff tool backend to the working tree in some cases.


Friday, June 21, 2013

Fun with various workflows (2)

As I discussed in a separate post, even though Git is a distributed SCM, it supports the centralized workflow well, to help people migrating from traditional SCM systems. But of course, Git serves the distributed workflow well. The one that is used in the Linux kernel development, where you work based on Linus's or a subsystem maintainer's repository, and publish to your own repository to get it pulled by others (including Linus, if your work is very good).

You would first start by cloning from your upstream:

  $ git clone git://git.kernel.org/.../git/torvalds/linux.git

The only difference from the initial step in the centralized workflow is,... nothing.  You will get a "linux" directory that becomes your working area, where you will have the standard configuration, perhaps not very different from this:

  [remote "origin"]
    url = git://git.kernel.org/.../torvalds/linux.git
    fetch = +refs/heads/*:refs/remotes/origin/*
  [branch "master"]
    remote = origin
    merge = refs/heads/master


And your "master" branch, which was copied from the "master" branch of Linus's repository, is ready for you to build your work on it.

The only difference is that you would not "git push" back to Linus's repository.  The "git://" protocol will not usually let you push, and even if it did, Linus would not let you write into his repository.

After working on your changes on "master", the way you would push out what you did is to say something like this:

  $ git push git@github.com:me/linux.git master

This might get cumbersome to type every time, so you would add another remote, perhaps like this:

  [remote "me"]
    url = git@github.com:me/linux.git

By defining a short-hand for that URL, you can now say:

  $ git push me master

and push out the work you did on your master branch as the master branch of your public repository, so that other people can pull from it.

If you worked on a topic that was forked from Linus's master to enhance a specific feature or fix a specific bug, you may want to say:

  $ git checkout -b fix-tty-bug origin/master
  ... work work work ...
  $ git push me fix-tty-bug

to publish the result in your public repository as a branch.

By the way, do you recall the reason why upstream mode was appropriate when using the centralized workflow from the previous post?

While the purpose of the Linus's master branch is to advance the overall state of the Linux kernel to prepare for the next release, the purpose of your topic branch fix-tty-bug is a lot narrower. And you are usually not integrating the work other people did into your work before you push it out. Indeed, you are encouraged to pick one stable point in the official (i.e. Linus's) history, and build on top of it without rebasing or merging things unrelated to what you are trying to achieve yourself.

Unlike in the centralized workflow where you tentatively play the role of integrator and change the purpose of your topic branch into "advance the overall project status" (which is compatible with the purpose of the "master" branch you will be updating with your work in the centralized workflow) immediately before you push it out, the purpose of your topic branch will stay to be the same as the original purpose of the topic until and after you push it out, when you are working with the distributed workflow.

If you started your topic branch, fix-tty-bug, to fix a bug in the tty subsystem and named it after the purpose of the topic branch, it can and should keep the name in your public repository. There is no reason to publish the result as your master branch. You control the branch names in your public repository, and pushing it out as master will only lose information. The branch name fix-tty-bug told what the branch was about. The name master sounds as if you are trying to make everything better, but that is not what you did.

So in general, you would be pushing out your topic branches to your public repository under the same name. You can use the 'current' mode when push your work out, like this:

  $ git config push.default current

And then, you can lose that branch name from the command line when you push your work out:

  $ git push me

You run the above command while you have your fix-tty-bug branch checked out, and the current branch is pushed out to the destination repository (i.e. me) to update the branch of the same name.

Recently, we added a mechanism to help those who are too lazy to even type "me", i.e. it let you say:

  $ git push

To use this, you configure what remote you push to when you do not say from the command line, with a configuration variable, like this:

  $ git config remote.pushdefault me

This feature is available in Git 1.8.3 and later.

Thursday, June 20, 2013

Fun with various workflows (1)

Even though Git is distributed, you can still use it for projects that employ the centralized workflow, where there is a single central shared repository. Everybody pulls from it to obtain everybody else's work, and after integrating his own work with others' work, everybody pushes into it so that everybody else can enjoy the fruit of his work.

In the simplest workflow, you can start by cloning from the central repository:
  $ git clone our.site.xz:/pub/repo/project.git myproject
and the myproject directory becomes your working area, where you will have the standard configuration, perhaps not very different from this:
  [remote "origin"]
    url = our.site.xz:/pub/repo/project.git
    fetch = +refs/heads/*:refs/remotes/origin/*
  [branch "master"]
    remote = origin
    merge = refs/heads/master
and your "master" branch, which was copied from the "master" branch of the central shared repository, is ready for you to build your work on it.

If you run "git pull --rebase" (without any other argument), the configuration above left for you by "git clone" will tell Git that you would want to obtain the latest work from the central shared repository, and you would want to rebase your own work on top of their master branch.

If you say "git push" (without any other argument), the current default mode of pushing is to look at your local branches, and look at the branches the repository you are pushing to has, and update the matching branches. In this "simplest" case, you only have the 'master' branch, and the central repository does have its 'master' branch, so you will update its 'master' branch with the work you did on your 'master' branch.

In Git 2.0, this default mode will change to 'simple', which will push only the current branch to the branch at the central repository you integrate with, but only when they have the same name (so the example of working on 'master' and pushing it back to 'master' will still work).

If your project employs the centralized workflow, after learning Git enough to be comfortable with it, you may want to do
  $ git config push.default upstream
to choose to always update the branch at the central repository you integrate with, even if the branch names are different.  Note that you can do this (or use 'simple' instead of 'upstream'), and indeed you are encouraged to do so, without waiting for Git 2.0.

That will allow you to work on different things on different branches, e.g.
  $ git checkout -b my-feature -t origin/master
  $ git push
The first "checkout" will create a new "my-feature" branch, that is set to integrate with the master branch from your central repository. When using the upstream mode, you will push "my-feature" back to update the "master" branch over there.

An interesting thing to notice is that in the centralized workflow, because there is no central project maintainer (aka integrator), everybody is responsible for integrating his own work to advance the mainline of the project. The job of integration is indeed distributed when you use centralized workflow. It is a bit funny when you think about it.

But that is exactly why the upstream mode makes sense. In order to fully appreciate it, you need to realize what it means to have forked the "my-feature" branch out of the "master" branch of the central shared repository.

The purpose of the master branch at the shared central repository is to advance the state of the project in general, but the purpose of your local branch, my-feature, is a lot more specialized. It may be to fix this small bug, or add that neat feature. You would only be working on a small part of the project while on that branch.

But because you are the one who plays the top-level integrator role when you run "git pull --rebase" just before you "git push", when that "git pull --rebase" finishes, the tip of your my-feature branch is no longer about your small fix or neat feature. It temporarily becomes about advancing the state of the overall project. And that is the reason you would "git push" it to update the master branch, not the "my-feature" branch, at the central repository. Of course, if you want to publish it as "my-feature", perhaps because you want to show it to others before really updating the shared master branch, you can explicitly say:
  $ git push origin my-feature
Pushing my-feature that was forked from and still integrates with their master is not usually what you want to do every time in the centralized workflow, though. In fact, it often is the case that administrators of a project with centralized workflow flown upon people making random branches at their shared central repository willy-nilly (exactly because the central shared repository is a common resource and a feature branch like "my-branch" is often not of general interest).

Common things require less typing, and uncommon things are possible but you need to explicitly tell Git to do so.

The Git core itself is very much agnostic to what workflow you use, and you can also use it for projects that use "I publish my work to my public repository, others interested in my work can pull my work from there, and there is an integrator who pulls and consolidates good work from others and publishes the aggregated whole" distributed workflow. That will be a topic for a separate post.

Monday, June 10, 2013

Git 1.8.3.1

The first maintenance release 1.8.3.1 is out.

This is primarily to push out fixes to two regressions that seems to have affected many people recently.  Sorry about that.

  • With Git 1.8.3, an entry "!dir" in .gitignore to say "This directory's contents is not ignored, unless other more specific entries tells us otherwise" did not work correctly. This regression has been fixed.
  • With recent Git since 1.7.12.1 or so, "git daemon", when started by the root user and then switched to an unprivileged user, refused to run when ~root/.gitconfig (and XDG equivalent configuration files under ~root/.config/) cannot be read by the unprivileged user. The right way to start the daemon might be to reset its $HOME (where these configuration files are read from) to somewhere the user the daemon runs as, but it is cumbersome to set up. With 1.8.3.1, failure to access these files with EPERM is treated as if these files do not exist, which is not an error.
The release tarballs are available at the usual places:

Checking the current branch programatically

The git branch Porcelain command, when run without any argument, lists the local branches, and shows the current branch prefixed with an asterisk, like this:




$ git branch
* master
  next
$ git checkout master^0
$ git branch
* (no branch)
  master
  next

The second one with (no branch) is shown when you are not on any branch at all. It often is used when you are sightseeing the tree of a tagged version, e.g. after running git checkout v1.8.3 or something like that.

To find out what the current branch is, casual/careless users may have scripted around git branch, which is wrong. We actively discourage against use of any Porcelain command, including git branch, in scripts, because the output from the command is subject to change to help human consumption use case.

And in fact, since release 1.8.3, the output when you are not on any branch, has become something like this:
$ git checkout v1.8.3
$ git branch
* (detached from v1.8.3)
  master
  next

in order to give you (as a human consumer) a better information. If your script depended on the exact phrasing from git branch, e.g.




branch=$(git branch | sed -ne 's/^\* \(.*\/\1/p')
case "$branch" in
'('?*')') echo not on any branch ;;
*) echo on branch $branch ;;
esac

your script will break.

The right way to programatically find the name of the current branch, if any, is not to use the Porcelain command git branch that is meant for the human consumption, but to use a plumbing command git symbolic-ref instead:




if branch=$(git symbolic-ref --short -q HEAD)
then
  echo on branch $branch
else
  echo not on any branch
fi



Friday, May 24, 2013

Git 1.8.3 and even more leftover bits

The 1.8.3 release has finally been tagged and pushed out to the usual places. Also the release tarballs at kernel.org are back.

For a list of highlights, please see the previous post on -rc2; not much has changed since then.

During the last development cycle including its pre-release feature freeze, a few more interesting topics were discussed, and at this moment there aren't actual patches or design work.

[Previous list of "leftover bits" is here]
  • "git config", when removing the last variable in a section, leaves an empty section header behind. Anybody who wants to improve this needs to consider ramifications of leaving or removing comments.
    Cf. $gmane/219524
  • [STARTED AND THEN STALLED] Add "git pull --merge" option to explicitly override configured pull.rebase=true. Make "git pull" that does not say how to integrate fail when the result does not fast-forward, and advise the user to say --merge/--rebase explicitly or configure pull.rebase=[true|false]. An unconfigured pull.rebase and pull.rebase that is explicitly set to false would mean different things (the former will trigger the "fast-forward or die" check, the latter does the "pull = fetch + merge".
    Cf. $gmane/225326
  • Teach more commands that operate on branch names about "-" shorthand for "the branch we were previously on", like we did for "git merge -" sometime after we introduced "git checkout -".
    Cf. $gmane/230828
  • Proofread our documentation set, and update to reduce newbie confusion around "remote", "remote-tracking branch", use of the verb "to track", and "upstream".
    Cf. $gmane/230786.

Monday, May 13, 2013

Git 1.8.3-rc2

The second and planned-to-be-the-final release candidate for the upcoming 1.8.3 release was tagged today. Also, the release tarballs at kernel.org are back ;-)

Hopefully we can have the final late next week, but we might end up doing another release candidate. Please help testing the rc2 early to make sure you can have a solid release.

There are numerous little fixes, new features and enhancements that cannot be covered in a single article, so I'll only highlight some selected big-picture changes here. For the full list of changes, please refer to the draft Release Notes.

Preparation for 2.0

A lot of work went into preparing the users for 2.0 release that will come sometime next year, which promises large backward-incompatible UI changes:
  • "git push" that does not say what to push from the command line will not use the "matching" semantics in Git 2.0 (it will use "simple", which pushes your current branch to the branch of the same name only when you have forked from it previously; e.g. "git push origin" done while you are on your "topic" branch that was previously created with "git checkout -t -b topic origin/topic" will push your "topic" branch to origin).

    This default change will hurt old-timers who are used to the traditional "matching" (if you have branches A, B and C, and of the other side has branches A and C, then your branches A and C will update their branches A and C, respectively), and they can use "push.default" configuration variable to keep the traditional behaviour. I.e.

    $ git config push.default matching

    Recent releases since 1.8.0 has been issuing warnings when you do not have any push.default configuration set, and this release continues to do so.

  • "git add -u" and "git add -A" that is run inside a subdirectory without any other argument to specify what to add will operate on the entire working tree in Git 2.0, so that the behaviour will be more consistent with "git commit -a" (e.g. "edit file && cd subdir && git commit -a" will commit the change to the file you just edited which is outside the directory you ran "git commit" in).

    Users can say "git add -u ." and "git add -A ." (the "dot" means "the current directory") to limit the operation to the subdirectory the command is run in with the traditional versions of Git (and this will stay the same in Git 2.0 or later), so there will be no configuration variable to change the default.

    The 1.8.3 and later releases do not yet change the behaviour until Git 2.0, and limit these operations to the current subdirectory, but they do notice when you have changes outside your current subdirectory and warn, saying that if you were to type the same command to Git 2.0, you would be adding those other files to your index, and encourages you to learn to add that explicit "dot" if you mean to add changed or all files in the current subdirectory only.

  • "git add path" has traditionally been a no-op for removed files (e.g. "rm -f dir/file && git add dir" does not record the removal of dir/file to the index), but Git 2.0 will notice and record removals, too.

    The 1.8.3 and later releases do not yet change the behaviour until Git 2.0, but they do notice when you have removed files that match the path and warn, saying that if you were to type the same command to Git 2.0, you would be recording their removal, and encourages you to learn to use the --ignore-removal option if you mean to only add modified or new files to the index.

Tightening of command line verification

There are quite a many UI fixes that tie loose ends. Some commands assumed that the users were perfect and would never throw nonsense command line arguments at them, and some operations that need two parameters were happily carried out even when they got three parameters without diagnosing such a command line as an error (the excess one was simply ignored).

Many of them have been updated to detect and die on such errors.

Helping our friends at Emacs land

We expedited the update of the foreign SCM interface to bzr we have in the contrib/ area since 1.8.2, and included a version that is vastly modified from what we had before, with help from some Emacs folks. This code could be a bit rougher than the rest of Git that usually moves slowly and cautiously, but we decided that, given the circumstance, it is more important to push out some improved version early, in order to help our friends in Emacs land, who have been (reportedly) suffering from less than ideal response to the issues they are having with their SCM of choice.

A beginning of a better triangular workflow support

The recommended workflows to collaborate with others are either:
  • to have your own repository and push your work there while pulling from your upstream to keep up to date, or
  • to have a central repository where everybody pushes to and pulls from.
The latter was primarily to make those who are coming from centralized version control systems feel at ease, and the repository configuration mechanisms such as "remote.origin.url" variable were designed to help that workflow (there is one "origin" you pull from and push to). The former however is also important, and many people on Git hosting sites (e.g. GitHub) employ this workflow (you pull from one place and push to another place, but they are not the same).

A new configuration mechanism "remote.pushdefault" has been introduced to support such a triangular workflow. After you clone from somebody else's project, that upstream repository will still be your 'origin', but you can add the repository you regularly push to in order to publish your work (and presumably then you will throw a "pull request" at the upstream) as another remote, and set it to this configuration variable. E.g.
$ git clone git://example.com/frotz.git frotz
$ cd frotz
$ git remote add publish ssh://myhost.com/myfrotz.git
$ git config remote.pushdefault publish
After this, you can say "git push" and the push does not attempt to push to your origin (i.e. git://example.com/frotz.git)  but to your publish remote (i.e. ssh://myhost.com/myfrotz.git) because of the last configuration.


Tuesday, April 2, 2013

Where do evil merges come from?


A canonical example of where "evil merge" happens in real life (and is a good thing) is to adjust for semantic conflicts. It almost never have anything to do with textual conflicts.

Imagine you and your friend start with the same codebase where a function f() takes no parameter, and has two existing call-sites.

You decide to update the function to take a parameter, and adjust both existing call-sites to pass one argument to the function. Your friend in the meantime added a new call-site that calls f() still without an argument. Then you merge.

It is very likely that you won't see any textual conflict. Your friend added some code to block of lines you did not touch while you two were forked. However, the end result is now wrong. Your updated f() expects one parameter, while the new call-site your friend added does not pass any argument to it.

In such a case, you would fix the new call-site your friend added to pass an appropriate argument and record that as part of your merge.

Consider the line that has that new call-site you just fixed. It certainly did not exist in your version (it came from your friend'd code), but it is not exactly what your friend wrote, either (it did not pass any argument). That makes the merge an "evil merge".

With "git log -c/--cc", such a line will show with double-plus in the multi-way patch output to show that "this did not exist in either parent".

Thursday, March 21, 2013

Measuring Project Activities (2)

Continuing from an earlier article, let's see how you can compute some interesting stats on your own projects.

How much change did a release have?

As I said earlier, you can measure the extent of change to your codebase in two ways. A quicker and less precise way, and a more involved but more accurate way.

A quicker way is to ask git diff --numstat to count the deleted and added lines between the release tags, and add them up yourself. If you care about whole-file renames, you can add the -M option to the git diff command:

addremove2 () {
  git diff --numstat "$@" | {
    total=0 &&
    while read add remove path
    do
      total=$(( $total + $add + $remove ))
    done &&
    echo "$total"
  }
}

And with that helper, the main function we introduced in the earlier article can do this to compute the modified2 number for the entire release cycle and per each day:

handle () {
  old="$1^0" new="$2^0"
  ...
  modified2=$(addremove2 "$old" "$new")
  mod2perday=$( echo 2k "$modified2" "$days" / p | dc )
}

How much real change did a release have?

Counting number of added and removed lines using git diff --numstat is straightforward, but this tends to over-count changes. For example, when adding a new caller to an existing code, you may have to move that existing code up in the same file (especially if it is a file-local static function) to make the callee come before the caller, or move it to a different, more "library-ish" file, while making its visibility from static to extern. Both of these kind of changes unfortunately appear as a bulk deletion of existing block of lines and bulk addition of the same contents elsewhere in the codebase.

In order to count the true amount of work went into the new release, you would want to exclude such changes from your statistics.

This is where git blame can help. In the most basic form, it can trace each and every line of a file in the given commit back to its origin, i.e. which commit it came from. By default, it notices when the whole file gets renamed (e.g. the file hello.c you are running the command on in the current release may have been called goodbye.c in an earlier release), and employs no other fancy tricks, but you can tell it to notice code movement within a file (e.g. moving the callee up in the file) with the -M option, or code moves across files (e.g. moving a static function from a file that an existing caller lives in to a different "library-ish" file, to make it also visible to a new caller in another file) with the -C option. You can also tell it to ignore whitespace changes with the -w option like you can with git diff. For example:

  git blame -M -C -w -s v1.8.0..v1.8.1 -- fetch-pack.c

will show you which commit each and every line in the fetch-pack.c file came from; its output may begin like this:

745f7a8c fetch-pack.c           1) #include "cache.h"
^8c7a786 builtin/fetch-pack.c   2) #include "refs.h"
^8c7a786 builtin/fetch-pack.c   3) #include "pkt-line.h"


The first line is blamed to commit 745f7a8c, while the other lines are attributed to commit 8c7a786 (the leading caret ^ means it is attributed to a commit at the lower boundary of the range), which is the v1.8.0 release. Note that these old lines used to live in a different file builtin/fetch-pack.c in the older release, and would have been counted as additions if you used the approach based on git diff --numstat -M to count them, because there was no file renaming involved between these two releases.

Also notice that these lines may have been untouched since a commit that may be a lot older than v1.8.0, but we told the command to stop at v1.8.0 from the command line, so these are all attributed to that range boundary.

If you count the number of lines in the whole output from the above command, that will show the number of lines in the fetch-pack.c file at the v1.8.1 release. If you count the lines that do not begin with a caret, that counts the lines added in the new release.

added_to_file () {
  old="$1" new="$2" path="$3"
  git blame -M -C -w -s "$old".."$new" -- "$path" |
  grep -v '^^' |
  wc -l
}

This may be sufficient as a starting point, but we are not all interested in checking each and every commit between the two releases (e.g. the commit 745f7a8c in the above example is not the v1.8.1 release and the only thing we care about is that the line is new in the new release; we do not care where in the development cycle leading to the release it was added), so it is a waste of computational cycles.

Fortunately, you can tell git blame to pretend as if the commit tagged as v1.8.1 release were a direct and sole child of the commit tagged as v1.8.0 release with the -S option. First, you prepare a graft file to describe the parent-child relationship.

added_to_file () {
  old="$1" new="$2" path="$3"
  graft=/tmp/blame.$$.graft
  cat >"$graft" <<-EOF
  $new $old
  $old
  EOF
  git blame -M -C -w -s "$old".."$new" -- "$path" |
  ...
}


The graft file lists each commit object and its parent. The above snippet says that the $new commit has a single parent, which is $old, and $old commit does not have any parent. This lets us lie to git blame that our history consists of only two commits, and one is a direct child of the other.

With this, we can tell how much new material was introduced to the given path in the new release, but what about the material removed from the old release? We can compute it in a similar way with a twist. You take a path in the old release, and pretend as if the old release were the direct child of the new release. We compute what we have added if we started from release v1.8.1 and development led to the contents of v1.8.0, like this:

removed_from_file () {
  old="$1" new="$2" path="$3"
  graft=/tmp/blame.$$.graft
  cat >"$graft" <<-EOF
  $old $new
  $new
  EOF
  git blame -M -C -w -s "$new".."$old" -- "$path" |
  grep -v '^^' |
  wc -l
}

By tying these two helper functions with a list of paths that existed in the two releases, you can compute the amount of real changes made to reach the new release, but this article is getting a bit too long, so I'll leave it to another installment. We will use the added_to_file helper to construct added_to_commit function like this:

added_to_commit () {
  old=$(git rev-parse "$1^0")
  new=$(git rev-parse "$2^0")
  list_paths_in_commit "$new" |
  while read path
  do
    added_to_file "$old" "$new" "$path"
  done | {
    total=0
    while read count
    do
      total=$(( $total + $count ))
    done
    echo $total
  }
}

Monday, March 18, 2013

A bit annoyed by LinkedIn Endorsements

A few times a week, I get "X endorsed your skills and expertise" e-mail messages from LinkedIn, listing people from my past and present. One of the embarrassing ones I saw the other day was an endorsement on "Linux Kernel", made by somebody who used to work as a receptionist at a small company I used to be at several years ago. She didn't know (and need to know) what technical work I did back then, I do not think she changed her career to know what technical work I do these days, and most importantly, I do not do the Kernel X-<.

And then today I got endorsement from a few Git people on "Ruby", but I know they know I do not do Ruby (not that I hate the language or its ecosystem; it is just I didn't get around to touch it).

I was told by the former receptionist that LinkedIn nags every once in a while to give endorsement to others and it is very easy to click on it, only to dismiss the nagging message, and ending up giving such irrelevant endorsements.

It is mildly annoying. Just as annoying as that big red "unread count" number I see on the right top corner of the Gmail window.

Grumpy I am.

Thursday, March 14, 2013

Measuring Project Activities (1)

Earlier, I showed a handful of metrics to view the level of activities in Git project, grouped by its release cycle, and promised to expllain how you can compute similar numbers for your projects.

This is the first of such posts. This post covers the very basics.

How long did a cycle last?

Each release is given a release tag. The latest I tagged for Git project was v1.8.2 and the release before that was v1.8.1. The release cycle began when I tagged v1.8.1 and ended when I tagged v1.8.2. As each commit in Git records commit timestamp and author timestamp, we can use diffrence between the commit timestamps of the two release.

We can ask git log to give us the timestamp for one commit:

  git log -1 --format=%ct $commit

The --format= option lets us ask for various pieces of information, and %ct requests the committer timestamp, expressed as number of seconds since midnight of January 1, 1970 (epoch). You can use %ci for the same information but in ISO 8601 format, i.e. YYYY-MM-DD HH:MM:SS; see git log --help and look for "PRETTY FORMATS" for other possibilities.

So the part, given two commits, that computes the number of days between them, becomes something like this:

handle () {
  old="$1^0" new="$2^0"
  oldtime=$(git log -1 --format=%ct "$old")
  newtime=$(git log -1 --format=%ct "$new")
  days=$(( ($newtime - $oldtime) / (60 * 60 * 24) ))
  ...
}

We ask the commit timestamps for the two commits in seconds since epoch, take the difference, and divide that by number of seconds in a day.

How many commits do we have in the cycle?

This is a single-liner.
git log has a way to list commits in a specified range, and the range we want can be expressed as: "We are interested in commits that are ancestors of v1.8.2, but we are not interested in commits that are ancestors of v1.8.1" (as the latter is the set of commits that happened before v1.8.1).

In a merge-heavy project like Git, however, merge commits make up a significant part of the history. A logical change that consists of three patches may start its life as three commits on a topic branch to be tested, and later when it proves to be sound gets merged to the mainline with a merge commit, at which point the mainline gains 4 commits (the original three plus the merge commit). That means the real change is only 75% of the history in the example.

Of course, merging other people's work is an important part of the work done in the project, so you may want to count merge commits as well. The choice is up to you.

When I counted commits for Git project in the earlier article, I chose not to include merges, so the part that computes the number of commits between two given commits becomes:

handle () {
  old="$1^0" new="$2^0"
  ...
  commits=$(git log --oneline --no-merges "$old..$new" | wc -l)
  ...
}

Drop --no-merges if you want to count your merges. The --oneline option is to show a single line of output per commit; by counting the lines in the output from that command with wc -l, we can count the number of commits.

As we are not interested in the contents of the output (we are just counting the number of lines), we can also use git rev-list that only shows the commit object name, if you want.

  commits=$(git rev-list --no-merges "$old..$new" | wc -l)

How many contributors did we have in the cycle?

You can list the names and e-mails of people who authored commits in a specified range in two ways.

Using the git log --format we saw earlier, we can ask the name %an and e-mail %ae of the author, i.e.

  git log --no-merges --format="%an <%ae>" "$old..$new"

You can count the unique lines in the output from this command. That is the list of your contributors. The end result will become something like this:

  authors=$(git log ...the same as above... | sort -u | wc -l)

The other way is to use the git shortlog command designed specifically for this purpose.

  git shortlog --no-merges -s -e "$old..$new"

The command without -e option only shows the names (and with it, names and e-mails). It lists commits made by each author along with the author name when run without -s option (and with it, the number of commits and the author's name on the same line). So the number of lines in the output from the above command is the number of your contributors.

  authors=$(git shortlog --no-merges -s -e "$old..$new" | wc -l)

Again, if you want to count merges, drop --no-merges from the command line.

How many new contributors have we added during the cycle?

This is a bit trickier than the previous one. The idea is to list contributors we already had in the entire history before v1.8.1, and subtract that from the list of contributors in the entire history up to v1.8.2. The remainder are the newcomers you want to welcome when writing your release notes.

The contributors in the entire history leading to a commit can be listed with a helper function:

authors () {
  git log --no-merges --format="%an <%ae>" "$@" | sort -u
}

and we can write the entire thing using the helper function like so:

handle () {
  old="$1^0" new="$2^0"
  ...
  authors "$old" >/tmp/old
  authors "$new" >/tmp/new
  new_authors=$(comm -13 /tmp/old /tmp/new | wc -l)
  ...
}

The authors helper function will write the authors for the old history and new history into two temporary files, both in sorted order, and using comm -13, we list lines that only appear in /tmp/new to see who are the new contributors.

In the next installment of this series, let's count the changes made to the codebase by these commits we counted in this article.