Summary of Git Internals

April 23, 2019 · 4 minutes read

A senior engineer recommended I should read Chapter 10 Git Internals - Plumbing and Porcelain and assigned it to me as a task. I remember him saying that anyone who took the time to read this chapter would understand how git works to the core, and make themselves a better engineer. I can confirm that his statement was true. My understanding of git is much stronger now than it was and I intend on summarizing what I learned in this blog.

In a nutshell, git is a key-value data store located in the .git folder of your project. The data is stored as an SHA-1 hash, with the first two characters being the parent directory and the other 38 characters of the hash the filename (ex. .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4). Git models their tree objects like the UNIX filesystem. You can think of the tree objects like the folders and the blob object like the files. Essentially, as long as you have the .git folder of your project saved elsewhere then you can delete the root folder.

One thing that is interesting if you have an interest in cryptography is how the header is created for the SHA-1 hash. The header is created with the object type, a space and then the size in bytes of the content, and then a null byte header = "tree #{content.length}\0”. Then git concatenates the header and the content to create the final string before we calculate the SHA-1 hash. Finally, using the Digest:: SHA1.hexdigest() from the SHA1 digest library in Ruby. It compresses the new content with zlib before storing it in the file that I referred to before (first two characters of SHA-1 hash being the subdirectory and the last 38 characters being the filename in that directory). A branch in git is a simple pointer or reference to the head of a line of work, aka the SHA-1 value of the tree.

The HEAD file is a symbolic link to the actual SHA-1 hash, which is how git knows to run git branch <branch>. Running git commit creates a commit object, linking to the SHA-1 value the HEAD points to. A tag object contains a tagger, a date, a message, and a pointer. It links to a commit, unlike a commit object that links to a tree. A remote reference is read-only and does not link to a HEAD. It uses remote references as the last known state the branches were from the source.

When you push to the servers git will compress the entire tree into a single binary file, which they call packfiles. This can be manually done by using the git gc command in the root of the project. The git remote add origin command pulls all the references under refs/heads/ on the server and writes them to refs/remotes/origin/ on the machine that ran it.

$ git log origin/master
$ git log remotes/origin/master
$ git log refs/remotes/origin/master

This are three different ways to accomplish the same thing. Get the logs of the master branch on the server locally.

Git uses the dumb protocol and smart protocol, which both rely on HTTP. When using git clone it fetches (HTTP get) the newest refs/head/master SHA-1 hash. When using git push it uses the packfile it creates and uses an HTTP post to send the payload.

If you EVER lose a commit, no need to panic because there is a way to recover. First, use git log to list the SHA-1 hashes to see if the commit is there. Then using git reset --hard <SHA-1 Hash> to switch back to that commit, and lose the other two commits. To see where you have been, run git reflog which prints the latest commits and actions. Reflog can also be used to retrieve the two branches that were lost when doing a hard reset. The command is git branch <recover_branch_name> <commit hash>. It effectively creates a new branch out of the master of the commit hash used.

Debugging

# GIT_TRACE controls general traces, which don’t fit into any specific category. This includes the expansion of aliases and delegation to other sub-programs. #
$ GIT_TRACE=true git lga
# GIT_TRACE_PACK_ACCESS controls tracing of packfile access. The first field is the packfile being accessed, the second is the offset within that file. #
$ GIT_TRACE_PACK_ACCESS=true git status
# GIT_TRACE_PACKET enables packet-level tracing for network operations. #
$ GIT_TRACE_PACKET=true git ls-remote origin
# GIT_TRACE_PERFORMANCE controls the logging of performance data. The output shows how long each particular git invocation takes. #
$ GIT_TRACE_PERFORMANCE=true git gc
# GIT_TRACE_SETUP shows information about what Git is discovering about the repository and environment it’s interacting with. #
$ GIT_TRACE_SETUP=true git status