Git SHA 101: The Fundamentals of Git's Internal Architecture (Part 2 of 3)

Git SHA 101: The Fundamentals of Git's Internal Architecture (Part 2 of 3)

In the previous chapter, we covered objects and got acquainted with hexadecimal digits known as hashes. Our aim in this blog is to explain what a hash is, specifically SHA-1, and what it does as well as how Git makes use of it.

How it all began

Cryptography is the science of creating secrets. The word Cryptography comes from the word Crypto and Graphy roughly translating to secret writing. In order to make information secret, you use a cipher, an algorithm that converts plain text into ciphertext, which is basically gibberish.

Ciphers have been used long before computers showed up. Julius Caesar used what's now called a Caesar Cipher to encrypt private correspondence. He would shift the letters in the message forward by 3 letters. So A became D and the word Crypto became FUBSWR.

img

Cryptography can take some useful bytes of data and then hash them with an algorithm making it nearly impossible for a computer to understand their true meaning.

Hashing Algorithm

The word Hash essentially means Mix or Scramble. Let's understand the term through an analogy:

Imagine a new deck of cards. You write a step-by-step procedure for shuffling them. The end result is a mixed up deck of cards. If you followed the same procedure for every new same deck of cards you would get the same result.

A hash function is like shuffling a deck of cards except you start with input and then pass it off to a hashing function, this function returns a fixed length value of what looks like alphanumeric gibberish.

The important thing here is that the same input will produce the same output just like the deck of cards. However, it is very difficult for the computer to reverse engineer the hash and find out what the original message was. Hence, it is used by developers to store passwords and secret data.

Implementing the SHA-1 Algorithm in Golang:

package main

import (
    "crypto/sha1"
    "encoding/hex"
    "fmt"
)

func main() {

    password := "Password123"
    algorithm := sha1.New()
    algorithm.Write([]byte(password))
    sha1_hash := hex.EncodeToString(algorithm.Sum(nil))

    fmt.Println(`'Password123' in hashed format is: `, sha1_hash)
}

You can run the code here https://go.dev/play/p/OIgkmeDJr1I

SHA1 in Git

The SHA1 or "Secure Hash Algorithm 1" is a cryptographic hashing function that takes an input and produces a 160-bit (20-byte/40-characters) hash value known as a message digest. SHA-1 was developed as an improvement on the original algorithm, called SHA-0. These hash algorithms are excellent for validating data integrity since even a small change to the data will result in a different hash output.

The following exercise will help you understand SHA in Git:

  • I'll start by opening two different terminal windows to create a directory in one window and watch how Git works in real time in another window.

To follow along, the commands will be the same for all the UNIX-based terminals, if you are in windows you can use Git Bash.

img

So in the left terminal, I created a directory by running mkdir git-sha and navigated to that directory by running cd git-sha in the terminal on the right side. Along with that, I ran the command watch -n 1 -d find . in the right side terminal to observe the contents of that directory.

  • Next I'll initialize the directory we previously created, Git basically creates a database in my local project to manage the changes I make to the file contents.

img

As soon as I do git init you can see on the right side, it created an initial layout of the file with template sample files in there, which I'm going to actually get rid of by running rm .git/hooks/*.sample

img

  • Alright, let's see what happens when we create a file.

img

  • You can see on the right side we created an object on the disk. Let's take a look at the contents inside that object.

img

If you remember what we discussed in the previous part of this series, the blob (or a file) just contains binary streams of data and that's what we see here.

  • To see the actual content of the file, we have to use git cat-file -p a733b9. "a733b9" is the initial hash of the object.

img

What's interesting is that file on the disk with just SHA has raw contents, no where does it actually here talk about the file name or its metadata. The actual file reference is in .git/index which is also the temporary staging directory.

  • For the final step, we'll commit the changes and see what happens.

img

Now we've got 3 objects on the screen out of which two have been created by the commit we just did. If we see the type of newly created SHA object by running git cat-file -t c4b5e1, it is referencing the main working tree and will contain the foo.txt file we added earlier.

img

Conclusion

In the previous blog, we learned about a few terms that you should now be able to understand more clearly. Furthermore, we also observed the local git database in real time and learned how it operates under the hood through an exercise.

You now have an understanding of key parts of SHA in Git, what it is, where it is used and how you use it. This should be more than enough for now to emphasize their importance. Now that you have learned some techniques, you are ready to dig deeper on your own! Stay tuned for the next blog where we'll learn Git branches :)

Did you find this article valuable?

Support Hamees Sayed by becoming a sponsor. Any amount is appreciated!