Unravelling Git’s Magic
By understanding how git stores file versions, you might get better insight into how git’s magic works for version control.
I was dealing with some messy merges. I wanted to better understand what happens in a git merge. Some old commits were getting lost.
I remembered that Linus Torvalds refers to his git repositories as “git trees”, like in his discussion about merges. How git merges two “trees” of files and folders is probably a worthwhile concept to try to grasp.
Files, commits, and other data are just real files
Mathieu Martin said the following in a comment on his post “The illustrated guide to recovering lost commits with Git”:
Once you begin to understand how all of those commits are just independent bits of data, and that branches, tags, and other refs (e.g. stash@{0}) are just pointers to those bits, all of that is less scary.
Hmmm… I wondered what he was talking about. I already new that branches are just pointers to certain commits. But what did he mean by “independent bits of data”?
Git has a folder of objects
Git has an “object store”.
Have a look for yourself in a repo: .git/objects
. That folder is the object store.
Inside that folder, there are a range of other folders. They’re named with two characters (two bytes). Inside each of those folders are binary files.
A commit is a file in the object store
Let’s go to your repo, and run $ git show
.
You’ll be given the details on your last commit.
Make note of the commit’s hash (in my case, the hash was 18fd395d9674885c3012e5cfb102489023fb52a0
).
Do you see that my commit’s has begins with 18
? That means the following file was created:
.git/objects/18/fd395d9674885c3012e5cfb102489023fb52a0
You can have a look for yourself for your own commit. (The trick is that the first two characters of the hash define the sub-folder the commit file is stored in, and the rest of the hash is used as the filename.)
The file is stored in the object store that describes the commit. Unfortunately, it’s compressed and in binary, so we can’t just has a look at text in that commit file.
When you run $ git show
, you will be shown the text contents of the commit file.
Your files are also stored in the object store
I knew all along that commits have hashes. But I didn’t realise that files themselves have hashes:
$ touch testing.txt
$ echo "Testing" > testing.txt
$ git add testing.txt
$ git hash-object README.markdown
73709ba6866a30a566a38ca40aa81d5f0928bce0
$ ls .git/objects/73/
709ba6866a30a566a38ca40aa81d5f0928bce0 # A file with compressed contents of testing.txt!
Git’s process is that a hash is the hash of the contents of a file (well, a header plus its contents). If the contents don’t change, the hash doesn’t change.
The git hash-object
command I ran above gave me git’s hash of the file README.md that’s in my repo.
You can run $ git show
on a file’s hash, just like you can with a commit’s hash:
$ git show 73709ba6866a30a566a38ca40aa81d5f0928bce0
Testing
git show
is more general that seeing a commit described. It’s for describing any “object” stored in the object store.
The object store, in effect, is a key-value data store.
Later on, when you commit changes to your file, it won’t be so easy to check the file’s hash and looking in the object store under that folder. I think this is because git stores the diff as you commit changes to a file. So it’ll be the diff’s hash that will be stored for future commits, not the plain new hash of the file’s new contents.
Trees have feelings too, you know
Just like commit objects and file objects, git has tree objects. This is where it gets interesting for me, since the concept of a tree directly affects your understanding of merges.
A git tree is an object that describes the filesystem at a certain point.
It’s just a list of pointers and filenames. Each file is already stored in the object store. So the tree stores a reference to the hash of the file at this point in time, plus the filename given to that file. A tree will also store the hashes of its sub-trees (which reflect sub-folders).
When you create a commit, a new git tree is also stored in the object store. The commit file stores a reference to that git tree file. The git tree is referenced by its own hash.
You can see this for yourself:
$ git show HEAD --pretty=raw
commit 4c4fcc9576378ae61ab4cb427b1cee2124bf2ff4
tree 8ec25d16e4e5830b89109794fc7bc68f32aaa51d
parent
[...]
$ git show 8ec25d16e4e5830b89109794fc7bc68f32aaa51d # tree hash
[list of files and directories in this tree]
So we’ve covered that the object store includes commit files, actual file contents and trees. Tags are the fourth type of objects stored in the store.
What his helps explain
- The existence of git tree objects explains why Linus was referring to the contents of his repo as a tree. For a given commit he has on his machine, it points to a tree object, which can be used to get the contents of all the files for that given point in time.
- That commits are just structured descriptions with pointers to other objects in the store
- That merge commits are simply commits that have pointers to more than one parent commit
- That merge commits point to a tree object. The tree object being pointed to is the tree object that resulted from the merge of two or more tree objects as referenced by the merged commits.
This leaves me with wanting to better understand how a merge of git trees actually works.
I’ll finish with a note from that same Linus discussion linked to above:
You need to understand what the impact of a merge is – and that while git makes merging technically pretty damn trivial most of the time, a merge should still be a big deal, and something you think about.