Git Objects 101: The Fundamentals of Git's Internal Architecture (Part 1 of 3)
Table of contents
Many of us use Git on a daily basis but how many of us know what goes on under the hood? For example:
What happens when we do
git commit
?What are git branches?
What are git trees and blobs?
What is SHA and why is it used?
In this 3 part series, I'll try to answer each one of these questions briefly.
Introduction
In this blog, I'll cover the main 3 objects, namely Blog, Tree and Commit. Let us start by thinking of Git as a repository for maintaining a file system, and more specifically snapshots of that file system.
A file system usually begins with a root directory which contains other directories, these directories contain other directories and so on...
Blob
In Git, the contents of files are stored in objects called blobs - Binary Large Objects. The difference between blobs and files is that, unlike blobs, files contain metadata, for example, it remembers when the file was created. Blobs on the other hand are just content, binary streams of data (F3 H2 45 9D).
A blob doesn't register its creation date, its name or basically anything but its content. Every blob in Git is identified by its SHA-1 Hash (We'll understand SHA-1 and Hashing algorithm in Part 2 of this series). SHA-1 Hash consists of 20 bytes, usually represented by 40 characters in hexadecimal form. In this blog, I'll represent the hash only by its first 5 characters.
Tree
In Git, the equivalent of a directory is a Tree, a Tree is basically a directory referring to blobs as well as other trees. Trees are identified by their SHA-1 Hashes. Referring to these objects (blobs, trees and commits) happens via the SHA-1 Hash of the object. Note that the Tree - ABCD5 points to the blobs K81R4 as photo.png and J72I5 as blog.txt.
This diagram is equivalent to a file system with the root directory that has two files, photo.png and blog.txt.
Now it's time to take a snapshot of that file system and store all the files that existed at that time along with their contents.
Commit
In Git a snapshot is a commit, a commit object includes a pointer to the main tree which is the root directory. The commit also stores metadata such as the commit author's name, a commit message and commit time. Of course, commit objects are also identified by their SHA-1 Hashes, these are the hashes we are used to seeing when we use git log
.
Note that every commit stores the entire snapshot, not just differences from the previous commit.
How does that work? Wouldn't that mean that we have to store a lot of data on every single commit? Well, let's examine what happens when we change the contents of a file.
Say that we edit blog.txt and add .com to it, that is we change the content from Showwcase to Showwcase.com.
Well, this change would mean that we have a new blob with a new SHA-1 Hash, this makes sense as the content of this new blog.txt which contains Showwcase.com is different from the previous blog.txt which contained Showwcase. Since we have a new Hash, the tree listing should also change, after all our tree no longer points to J72I5 Hash but rather the new blob with Hash C75N6. As we change the tree's content we also change its Hash and now since the hash of the main working tree is changed, consequently, we are almost ready to create a new commit object. And it seems like we are going to store the entire file system once more, but is that really necessary?
Actually, some objects, specifically photo.png haven't changed since the last commit. So as long as the object doesn't change, we don't store it again. In this case, we don't need to store hash K81R4 once more.
At this point, we can create our commit object. Since this commit is not the first commit, it has a parent, commits - B4848, which we committed earlier.
Conclusion
We learned three Git Objects:
Blob - It consists of the contents of a file.
Tree - A directory listing of Blobs and Trees.
Commit - It is a snapshot of the working tree.
In the first part of the blog, we covered the basic objects of git. In the next part, we'll learn about SHA-1 Hash Algorithm and understand how it works inside Git. After that, we'll understand branches and how they relate to the terms we covered in this blog.