Building Your Own Git from Scratch in Go

Building Your Own Git from Scratch in Go

As a professional software engineer, you must be using Git every day as it is the most widely used version control software. Many people believe that Git is very complex software. I did, too. But it isn’t. It’s a big piece of software with many features at our disposal. So, how did I come to this conclusion? Well, I recently came across this course, “Build your own Git” on Codecrafters.

At first, the challenge seemed difficult. But, after reading the Git internals, I coded through it effortlessly. I am writing this article to share what I have learnt by demystifying a complex topic so that beginners do not feel overwhelmed by Git’s Internals.

Introduction

The following article is the first part of a series on how to build your Git in Golang. By “building your git,” we mean that this article will teach you the internals of some of the Git commands, and then show you how to mimic their behaviour through your code and test them by writing test cases. We’ll be using Golang to write the code.

If you are a Python enthusiast, I highly suggest you check out Thibault’s guide to “Write Yourself a Git”.

But, Even if you do not write Go or Python, give it a shot! The logic behind Git is universal, and learning how it works might inspire you to apply the concepts in your preferred language. We are gonna cover three basic git commands in this article.

  • git init

  • git cat-file

  • git hash-object

This article focuses on the tests for the custom implementation of specific Git commands. While the tests for the first two commands (init and cat-file) are available in the free trial of Codecrafters, this article aims to expand on how to write your tests and implement functionality in Go for additional commands.

And we’ll cover more commands in future articles. So, let’s get started!

Prerequisites

  • A basic understanding of Git

  • Familiarity with any programming language (preferably Go, but not mandatory)

  • A GitHub account for creating your profile on Codecrafters

git init

As per the official documentation of Git, git-init — Creates an empty Git repository or reinitialize an existing one.

When we run git init in our project, it creates a .git directory under your project’s root directory. .git and not git because dot(.) directories are considered hidden. That means the .git dir won’t be visible in your code editor, nor when running the ls command in your terminal.
But, if you cd into this directory, you’ll see it has other subdirectories objects, refs/heads, and refs/tags.

The objects dir stores the blob objects created when we commit our code files. We’ll learn more about this directory when we discuss the cat-file , and hash-object commands.
The other significant directory, refs, has more subdirectories, heads and tags, that reference various commit objects and point to specific commits in the repository’s history.

Okay, Let’s jump into the code!

Start by signing up for this challenge on Codecrafters and follow the steps to set up the repository.
If you follow through with the initial steps of codecrafter’s challenge, you should have a working directory set up for this. I’d suggest you go through the code, and if you look into this file — cmd/mygit/main.go — You’ll find that the code for the init command is commented out in the initial setup.

Alright, let’s walk through the code of the init command.

In your main() function, you can see that we have a if condition that throws an error when the length of os.Args is less than 2. By default, when we run a Go program, the first argument in the os.Args list is always the program itself. You can verify it by adding a print statement like this at the start of your main() function—
fmt.Println(“ARGS:”, os.Args[0])and you should see something like this when your tests run—
ARGS: /var/folders/wm/28kg3mj97fd6_6dmw_ffytvw0000gn/T/go-build87834999/b001/exe/main

It should clarify what os.Args is. The first element (0th-based index ) is the program name, and the second element ( os.Args[1] ) is the actual command that we want to run. In this case, it’s init . If we want to run the program from our terminal, we can type this —
go run main.go init

Alright, let’s dive into the switch case.

case “init”:

for _, dir := range []string{".git", ".git/objects", ".git/refs"} {
   if err := os.MkdirAll(dir, 0755); 
   err != nil {
    fmt.Fprintf(os.Stderr, "Error creating directory: %s\n", err)
   }
  }

The first thing we do is loop through a list of strings containing all the Git directories that we talked about earlier. The range keyword iterates over the array, returning the index ( _ ), which we can ignore using a blank identifier, and the actual element, which we store in the dir variable.
Once stored, we use the std os lib to create the required directories using the MkdirAll method.

According to the official go documentation, MkdirAll creates a directory named path, along with any necessary parents, and returns nil or an error. The permission bits perm (before umask) are used for all directories that MkdirAll creates. If the path is already a directory, MkdirAll does nothing and returns nil.

By the way, here’s a little trick. If you have Go installed, type this command to get the docs in your terminal.
go doc <lib name> <function name>
For example,
go doc os MkdirAll
It will stdout the above definition on your terminal.

The second argument that MkdirAll takes is the file permissions bit. In Linux, 0755 permission gives the owner of a file or directory the ability to read, write, and execute it. Other users can read and execute the file or directory but cannot modify or delete it.

headFileContents := []byte("ref: refs/heads/main\n")
  if err := os.WriteFile(".git/HEAD", headFileContents, 0644); err != nil {
   fmt.Fprintf(os.Stderr, "Error writing file: %s\n", err)
  }

  fmt.Println("Initialized git directory")

The next thing we are doing is writing the default branch of our repository in the HEAD file. As mentioned in CodeCrafter’s challenge, it can be either main or master. We use the WriteFile method from the standard os lib to write the content in the file with a permission bit — 0644, means the owner can read and write to it, while others can only read it. Finally, we will conclude this switch case by printing the statement, “Initialized git directory.

Submitting the code above will successfully pass all the required test cases.

blob

Before delving into the cat-file command, it is essential to gain a deeper understanding of Git and how it manages and stores data or files in our project. At its core, Git functions as a content-addressable filesystem. This means that Git operates as a straightforward key-value data store, allowing you to insert any content into a Git repository. In return, Git provides a unique key that can be used later to retrieve that specific content.
This unique key is generated by passing a specially formatted string through the SHA-1 algorithm. The string includes the actual content of the file represented as a byte array, preceded by the word ‘blob’, the length of the byte content, and a null byte. The structure of this string is as follows:
blob <length-of-content>\x00 <content>

The result — a 160-bit (20-byte) hash value — is the message digest. This hash is represented as a 40-digit hexadecimal number, uniquely identifying the content within the repository.
Now, the first two characters from the 40-digit hash become a directory inside thegit/objects, and the next 38 characters become the name of the compressed file from our project.

For example, if we have a .txt file with the content Hello, World! in our project, the format of the string before the hash should be like —
blob 13\x00 Hello, World!
Which should return a hash like this —
b45ef6fec89518d314f546fd6c3025367b721684.
It means the path of this file will be —
.git/objects/b4/5ef6fec89518d314f546fd6c3025367b721684

Now that we’ve covered the essentials of blob objects, including how hashes are created and how the directories are organized, you should have a solid understanding of the foundational pieces of Git’s object storage. This knowledge will be crucial as we dive into the cat-file command, which allows us to interact with these objects and view their content. So, let’s get started!

case “cat-file”:

If you have been following the Codecrafter’s challenge, the next challenge is “Read a blob object”. In this stage, you’ll add support for reading a blob using the git cat-file command.

Read more about the git cat-file command here.

According to the description of the tests written under the challenge, our code will be tested for:

  1. The tester will first use your program to initialize a new git repository and then insert a blob with random contents into the .git/objects directory.

  2. The tester will verify that the output of our program matches the contents of the blob.

The test will run our program like this:

$ /path/to/your_program.sh cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad

In the command, we can see the hash of the file whose content needs to be output. The -p flag indicates that file content should be pretty-printed to stdout.

Let’s dive into the code!

// Implement the cat-file command here
 // check if the len(args) < 4
  if len(os.Args) != 4 {
   fmt.Fprintf(os.Stderr, "usage: mygit cat-file -p <object-hash>\n")
   os.Exit(1)
  }

The first thing we are going to check is whether the length of os.Args is 4. If it isn’t, throw an error. As we read earlier, the first and default element in the os.Args list is the program name itself, followed by
cat-file, -p, and the <hash>

// check if the third argument is -p and fourth argument is a valid object hash
  readFlag := os.Args[2]
  objectHash := os.Args[3]
  if readFlag != "-p" && len(objectHash) != 40 {
   fmt.Fprintf(os.Stderr, "usage: mygit cat-file -p <object-hash>\n")
   os.Exit(1)
  }

// create the file path
  dirName := objectHash[0:2]
  fileName := objectHash[2:]
  filePath := fmt.Sprintf("./.git/objects/%s/%s", dirName, fileName)

Here, we are checking if the third and fourth arguments in the list are valid. After that, as explained, we create the file path from the provided hash.

// read the file
  fileContents, err := os.ReadFile(filePath)
  if err != nil {
   fmt.Fprintf(os.Stderr, "Error reading file: %s\n", err)
   os.Exit(1)
  }

Then, we use the ReadFile method of the std os lib to read compressed file content.

As per the official Go documentation, ReadFile reads the named file and returns the contents. A successful call returns err == nil, not err == EOF

// decompress the file contents
  b := bytes.NewReader(fileContents)
  r, err := zlib.NewReader(b)
  if err != nil {
   fmt.Fprintf(os.Stderr, "Error decompressing the file: %s\n", err)
   os.Exit(1)
  }

  decompressedData, err := io.ReadAll(r)
  if err != nil {
   fmt.Fprintf(os.Stderr, "Error reading decompressed data: %s\n", err)
      os.Exit(1)
  }
  r.Close()

The next step is to decompress the file contents using the std compress/zlib lib provided by the Go modules. We also continue to throw errors at every step and exit if necessary.

// Find the index of the null terminator
nullIndex := bytes.IndexByte(decompressedData, 0)
  if nullIndex == -1 {
   fmt.Fprintf(os.Stderr, "Invalid object format: missing metadata separator\n")
   os.Exit(1)
  }

  // Extract and print the actual content (everything after the null byte)
  content := decompressedData[nullIndex+1:]
  fmt.Print(string(content))

The last and final step is to find the index of the null byte from our decompressed content and then print everything after the null byte because, as we remember, every file content has this blob <length-of-content>\x00 prefix at the beginning.

You should pass the test cases after submitting this code.

gotest

Now, why did I choose to write tests for the next command? Well, here’s the thing: Codecrafters has an excellent premium plan that unlocks the rest of the challenges, but when writing this article, I wasn’t subscribed to it. So, instead of waiting around, I wrote my tests to make progress and solidify my understanding.
But, if you’re subscribed to the Codecrafters premium plan, you can try this challenge without writing your tests.

I recommend this resource to begin writing tests in Go. Covering the fundamentals will be sufficient for writing tests discussed further in this article. Overall, it is an excellent resource to learn Go.

Let’s see the description of our next challenge i.e. “Create a blob object”. In this stage, you’ll implement support for creating a blob using the
git hash-object command.
git hash-object is used to compute the SHA hash of a Git object. When used with the -w flag, it also writes the object to the .git/objects directory.

Read more about the git hash-object command here.

tests

The tests will verify that:

  • Your program prints a 40-character SHA hash to stdout

  • The file is written to .git/objects matches what the official git implementation would write

For our first test, we’ll create a text.txt file containing some content. Then, we’ll run the actual Git command on the file —
git hash-object -w text.txt
along with our implementation of the same command —
go run main.go hash-object -w text.txt.

In theory, both of these commands should generate the same hash, thus we’ll compare the outputs of both commands to determine whether our test case passes or fails.

After that, we’ll write our second test to check whether the blob has been created at the right location. In theory, our code should create the blob inside .git/objects but, for testing purposes, we’ll create the blob in a separate location like .mygit/objects . This is because if we try to store a blob with the same hash in our .git folder, it will skip the process as a file with the same hash must already exist. Thus, to check if our implementation is working as expected, the blob is stored at a different location.

At last, we’ll write our third test to see if the blob content and, the file content are the same.

Let’s dive into the code for our first test:

func TestHashObject (t *testing.T) {
 // Create a file with some content
 fileName := "text.txt"
 fileContents := []byte("Hello, World!")

 if err := os.WriteFile(fileName, fileContents, 0644); err != nil {
  t.Fatalf("Error writing to test file: %s\n", err)
 }

// further code snippets to be added here
}

As stated, we start by creating a text.txt file in our test function.

// more code to add in our test function
// Run the git hash-object command
 wantHash, gitErr := RunGitHashObject(fileName)
 if gitErr != nil {
  t.Fatalf("Error implementing git command: %s\n", gitErr)
 }

Here, we call the function that runs the git hash-object command on our text file and returns the resulting hash after storing the blob of our file in .git/objects . We have named the variable wantHash to compare it later with the hash generated by our code ( gotHash ).

Below is the definition of the function.

// function that runs the `git hash-object` command
func RunGitHashObject(filePath string) (string, error) {
 cmd := exec.Command("git", "hash-object", "-w", filePath)

 // Capture stdout
 var out bytes.Buffer
 cmd.Stdout = &out

 // Run the command
 err := cmd.Run()
 if err != nil {
  return "", err
 }

 // Return the hash (trim whitespace)
 return out.String(), nil
}

The function RunGitHashObject uses the exec package from std os lib. Package exec runs external commands. It runs the command, captures the stdout, and returns it as a string.

Now, we’ll call a function similar to the previous one that runs our implementation of the hash-object , and we’ll name the variable for the resultant hash as gotHash .

// more code to add in our test function
// call main function with the hash-object command
 gotHash, myGitHashObjectError := RunMainFuncWithHashObject(fileName)
 if myGitHashObjectError != nil {
  t.Fatalf("Error implementing mygit command: %s\n", myGitHashObjectError)
 }
// function that runs the `go run main.go hash-object` command
func RunMainFuncWithHashObject(fileName string) (string, error) {
 cmd := exec.Command("go", "run", "main.go", "hash-object", "-w", fileName)

 // Capture stdout
 var out bytes.Buffer
 cmd.Stdout = &out

 // Run the command
 err := cmd.Run()
 if err != nil {
  return "", err
 }

 // Return the hash (trim whitespace)
 return out.String(), nil
}

Visit this link to find the full implementation of my hash-object command.
The code is well-commented, making it easy to understand. However, if you’d like to implement it yourself, here are the key steps to guide you:

  1. Check if we have all the right args.

  2. Read the file content.

  3. Generate the hash using the syntax explained earlier ( use SHA1 to generate the hash ).

  4. Write the compressed content of the blob to the file path generated with the hash, for example: .mygit/objects/<first-two-hash-chars>/<remaining-hash> .

  5. stdout the generated hash.

Now that we have both wantHash and gotHash , we can write our first test case, to test whether the hash generated by our code is the same as the hash generated by git.

t.Run("Testing Hash creation", func(t *testing.T) {
  if gotHash != wantHash {
   t.Errorf("got %q want %q", gotHash, wantHash)
  }
 })

Run go test in your terminal. It should've passed! Just to check, try deliberately breaking the test by changing the want string.

Let’s write our second test case: check whether the blob has been created at the right location.

t.Run("Testing the blob object creation", func(t *testing.T) {
  gotHash = strings.TrimSpace(gotHash)
  filePath := fmt.Sprintf(".mygit/objects/%s/%s", gotHash[0:2], gotHash[2:])
  // read the file
  _, err := os.Stat(filePath)
  if err != nil {
   if os.IsNotExist(err) {
    fmt.Println("Error finding file: ", err)
    os.Exit(1)
   }
  }
 })

Run go test again in your terminal. If the test fails, it means the blob wasn’t created in the expected location. We can also verify this manually using your file explorer.

Last but not least, we’ll write our third test case to see if the blob content is the same as the file content. To do this, we’ll use the already-implemented cat-file command to print the contents of our blob object.

t.Run("Testing the contents of the blob", func(t *testing.T) {
  // call main function with the cat-file command
  gotContent, myGitCatFileError := RunMainFunctionCatFile(gotHash)
  if myGitCatFileError != nil {
   t.Fatalf("Error implementing mygit command: %s\n", myGitCatFileError)
  }
  gotContent = strings.TrimSpace(gotContent)
  wantContent := string(fileContents)
  if gotContent != wantContent {
   t.Errorf("got %q want %q", gotContent, wantContent)
  }
 })

Run go test for the final time in your terminal, you should see that all your tests pass without errors.

Now, if all your test cases are passing, I’d like to point out an interesting observation from our last test. When we are running the
cat-file command on our generated hash, it still actually reads the blob object from .git/objects , not from the .mygit/objects . And, our test case still passed because when git added our test file in its directory, it used the same hash as ours. This little observation was really helpful in verifying that our implementation is correct.

Conclusion

And that’s it! We’ve successfully implemented core Git functionalities like hash-object, cat-file, and init, diving deep into how Git manages objects under the hood. While we’ve only scratched the surface, this should give you a solid foundation to build upon.

If you’re interested in extending this further, try implementing the next stages of this challenge. And if you have any feedback, ideas, or improvements, feel free to comment or reach out!

Happy coding! 🚀

👉 Originally published on Medium: link