Git Internals
We will create a git repository named git_internals
from scratch with minimal use of conventional git commands. I use git version 2.8.1
.
Git init
First, we create an empty folder.
$ mkdir git_internals
$ cd git_internals/
Surely, it’s not a git repo yet so git status
blames us unhappily.
$ git status
fatal: Not a git repository (or any of the parent directories): .git
All the magics behind the scene of git happen in the .git/
folder in your repository. So we create the .git/
folder.
$ mkdir .git/
$ git status
fatal: Not a git repository (or any of the parent directories): .git
But it seems the .git/
folder alone is not enough. Let’s add some other things.
$ mkdir .git/objects
$ mkdir .git/refs
$ echo "ref: refs/heads/master" > .git/HEAD
The .git/
folder now looks like the following and git status
is happy with us.
$ tree .git/
.git/
├── HEAD
├── objects
└── refs
$ git status
On branch master
Initial commit
nothing to commit (create/copy files and use "git add" to track)
So far, we have initialized a git repo.
This is basically what git init
does for us under the scene.
But wait! What is the objects/
, refs/
and HEAD
in the .git/
folder?
Hold on for a minute and we will explore them later along the way.
But what I can tell you for now is that HEAD
stores the information about the branch you are currently on.
You can do the following and find out git status
tells you are on branch forked
now.
$ echo "ref: refs/heads/forked" > .git/HEAD
$ git status
On branch forked
Initial commit
Untracked files:
(use "git add <file>..." to include in what will be committed)
hello.txt
nothing added to commit but untracked files present (use "git add" to track)
# But let's go back to branch master
$ echo "ref: refs/heads/master" > .git/HEAD
Git add
Let’s create a file in our workspace. When I say “workspace”, I literally mean in the git_internals
folder.
$ echo "hello git" > hello.txt
$ git status
On branch master
Initial commit
Untracked files:
(use "git add <file>..." to include in what will be committed)
hello.txt
nothing added to commit but untracked files present (use "git add" to track)
Unsurprisingly, git status
knows that we have an untracked file named hello.txt
.
We’d like to add this file in to the git staging area, by which I mean to do git add hello.txt
.
$ git add hello.txt
$ git status
On branch master
Initial commit
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: hello.txt
$ tree .git
.git
├── HEAD
├── index
├── objects
│ └── 8d
│ └── 0e41234f24b6da002d962a26c2495ea16a425f
└── refs
Now our hello.txt
is staged. Two new files are created under .git/
:
- The
index
file, as its name indicates, stores the index of what files are staged. - The
objects/8d/0e41234f24b6da002d962a26c2495ea16a425f
file stores the actual data of the staged filehello.txt
.
If we delete the index file, git will lose the information about what we just staged.
# temporarily delete index
$ mv .git/index .git/index_
$ git status
On branch master
Initial commit
Untracked files:
(use "git add <file>..." to include in what will be committed)
hello.txt
nothing added to commit but untracked files present (use "git add" to track)
# recover index
$ mv .git/index_ .git/index
By the way, we can use git ls-files
to inspect what files are tracked in the index.
The output shows mode
, SHA1 hash
, stage number
and name
of the file.
We are not going to dig into the stage number, but I can briefly tell you that it’s used to distinguish conflicted files when merging and it’s usually set to be 0
if there’s no conflict.
$ git ls-files --stage
100644 8d0e41234f24b6da002d962a26c2495ea16a425f 0 hello.txt
Here, we notice the SHA1 value happens to be in the path of the object file objects/8d/0e41234f24b6da002d962a26c2495ea16a425f
. The first two hex digits of SHA1 are used as the bucket names to avoid too many files being created in the same folder and degrading the performance of the underlying file system.
We also have another command provided by git: git hash-object
, which reads a file and generates a SHA1 hash based on its contents and some metadata.
$ git hash-object hello.txt
8d0e41234f24b6da002d962a26c2495ea16a425f
Let’s write some Python code to demonstrate how this SHA1 is generated.
# hash_blob.py
import sys, hashlib
s = sys.stdin.read()
# FORMAT: "blob <length of content><NULL><content>"
blob = 'blob %d\x00%s' % (len(s), s)
print(hashlib.sha1(blob.encode('utf8')).hexdigest())
Our Python script reads content from the stdin and generates the same SHA1 hash as git hash-object
.
$ python hash_blob.py < hello.txt
8d0e41234f24b6da002d962a26c2495ea16a425f
What happens behind the scene is that git creates a blob object which has the format "blob <length of content><NULL><content>"
. (<NULL>
represents the bytes of \x00
).
This blob object is then stored as a compressed file.
We revise our Python script to perform the compression as well.
# hash_blob.py
import sys, hashlib, zlib
s = sys.stdin.read()
# FORMAT: "blob <length of content><NULL><content>"
blob = 'blob %d\x00%s' % (len(s), s)
print(hashlib.sha1(blob.encode('utf8')).hexdigest())
print(zlib.compress(blob.encode('utf8'), level=1))
We quickly check that our script can generate the identical output as git does.
$ python hash_blob.py < hello.txt
8d0e41234f24b6da002d962a26c2495ea16a425f
b'x\x01K\xca\xc9OR04`\xc8H\xcd\xc9\xc9WH\xcf,\xe1\x02\x006A\x05\xa3'
$ cat .git/objects/8d/0e41234f24b6da002d962a26c2495ea16a425f \
| python -c 'import sys; print(sys.stdin.buffer.read())'
b'x\x01K\xca\xc9OR04`\xc8H\xcd\xc9\xc9WH\xcf,\xe1\x02\x006A\x05\xa3'
Let’s create a new file hello2.txt
which has the same content as hello.txt
and stage it.
$ cp hello.txt hello2.txt
$ git add hello2.txt
$ git status
On branch master
Initial commit
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: hello.txt
new file: hello2.txt
Now if we look at .git
, we find something interesting: there’s only one blob object.
But if you dig into the index, we see two files pointing to the same SHA1 hash.
Interestingly, git uses the hash value to deduplicate objects with same contents.
$ tree .git
.git
├── HEAD
├── index
├── objects
│ └── 8d
│ └── 0e41234f24b6da002d962a26c2495ea16a425f
└── refs
$ git ls-files -s
100644 8d0e41234f24b6da002d962a26c2495ea16a425f 0 hello.txt
100644 8d0e41234f24b6da002d962a26c2495ea16a425f 0 hello2.txt
Lessons learnt:
git add
creates anindex
file and blob objects.- SHA1 is used as the unique ID of the object for deduplication.
Git commit
Finally, we are able to commit our changes. A lot of things happen in .git/
.
$ git commit -m "My first commit"
$ tree .git
.git
├── COMMIT_EDITMSG
├── HEAD
├── index
├── logs
│ ├── HEAD
│ └── refs
│ └── heads
│ └── master
├── objects
│ ├── 43
│ │ └── 3869c28e75b1b9648b7c9cce7d8f1622d930eb
│ ├── 58
│ │ └── 452a586535b0e636d91e4d08007f93e70a6591
│ └── 8d
│ └── 0e41234f24b6da002d962a26c2495ea16a425f
└── refs
└── heads
└── master
We find there is an object in objects/
whose SHA1 happens to be the hash value of our first commit 433869c28e75b1b9648b7c9cce7d8f1622d930eb
.
Actually, a git commit is also stored as an object, namely commit object.
$ git log
commit 433869c28e75b1b9648b7c9cce7d8f1622d930eb
Author: czheo <czheo1987@gmail.com>
Date: Sun Mar 18 19:40:58 2018 -0700
My first commit
Same as the blob objects, we can decompress the commit object for investigation.
$ cat .git/objects/43/3869c28e75b1b9648b7c9cce7d8f1622d930eb \
| python -c "import sys, zlib, hashlib; \
c = zlib.decompress(sys.stdin.buffer.read()); \
print('hash =', hashlib.sha1(c).hexdigest()); print(); \
print(c); print(); \
print(c.decode('utf8'))"
hash = 433869c28e75b1b9648b7c9cce7d8f1622d930eb
b'commit 170\x00tree 58452a586535b0e636d91e4d08007f93e70a6591\nauthor czheo <czheo1987@gmail.com> 1521427258 -0700\ncommitter czheo <czheo1987@gmail.com> 1521427258 -0700\n\nMy first commit\n'
commit 170tree 58452a586535b0e636d91e4d08007f93e70a6591
author czheo <czheo1987@gmail.com> 1521427258 -0700
committer czheo <czheo1987@gmail.com> 1521427258 -0700
My first commit
The format of a commit object looks as below.
The first field commit
shows the type of the object.
The second field stores the length of the payload.
Seperated by <NULL>
(b"\x00"
), the rest are the payload of the commit object.
Most information looks quite straightforward!
If you are wondering the difference between author and committer, here has the answer.
commit <length><NULL>
tree <hash of tree>\n
author <name> <email> <timestamp> <mode>\n
committer <name> <email> <timestamp> <mode>\n
\n
<commit message>\n
Here we notice that the commit object points to a “tree”, whose hash value is 58452a586535b0e636d91e4d08007f93e70a6591
, which also can be found in the objects/
folder.
Actually, it’s the SHA1 hash of a tree object.
Git uses tree objects to present the identical concept of folders.
$ cat .git/objects/58/452a586535b0e636d91e4d08007f93e70a6591 \
| python -c "import sys, zlib, hashlib; \
c = zlib.decompress(sys.stdin.buffer.read()); \
print('hash =', hashlib.sha1(c).hexdigest()); print(); \
print(c)"
hash = 58452a586535b0e636d91e4d08007f93e70a6591
b'tree 75\x00100644 hello.txt\x00\x8d\x0eA#O$\xb6\xda\x00-\x96*&\xc2I^\xa1jB_100644 hello2.txt\x00\x8d\x0eA#O$\xb6\xda\x00-\x96*&\xc2I^\xa1jB_'
A tree object stores a list of blob/tree objects. The format of a commit object looks as below.
tree <length><NULL>
<mode> <file name><NULL><object hash>
<mode> <file name><NULL><object hash>
...
The blob hashes are stored as binary. You are able to find the sequence of 8d0e41234f24b6da002d962a26c2495ea16a425f
if you look at the binary form.
$ cat .git/objects/58/452a586535b0e636d91e4d08007f93e70a6591 | python -c "import sys, zlib; \
sys.stdout.buffer.write(zlib.decompress(sys.stdin.buffer.read()))" | xxd
00000000: 7472 6565 2037 3500 3130 3036 3434 2068 tree 75.100644 h
00000010: 656c 6c6f 2e74 7874 008d 0e41 234f 24b6 ello.txt...A#O$.
^^^^^^^^^^^^^^^^^
00000020: da00 2d96 2a26 c249 5ea1 6a42 5f31 3030 ..-.*&.I^.jB_100
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00000030: 3634 3420 6865 6c6c 6f32 2e74 7874 008d 644 hello2.txt..
^^
00000040: 0e41 234f 24b6 da00 2d96 2a26 c249 5ea1 .A#O$...-.*&.I^.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00000050: 6a42 5f jB_
^^^^^^^
As we have seen, git has three types of objects: blob
, commit
and tree
.
They are compressed and stored under .git/objects/
with their SHA1 hashes in the paths.
Remember what is in the file .git/HEAD
? Yes, it’s ref: refs/heads/master
, which means it’s a reference to the path refs/heads/master
.
If we take a look at the file .git/refs/heads/master
, it’s a plain text file having a SHA1 hash pointing to our previously discorvered commit object.
$ cat .git/refs/heads/master
433869c28e75b1b9648b7c9cce7d8f1622d930eb
Under .git/refs/heads/
, git maintains the most recent commit (the head) for each branch.
We can manually create a new branch by create a new file under .git/refs/heads/
.
After executing the commands below, the head of newbr
should points to the same commit as master
.
$ git branch
* master
$ cp .git/refs/heads/master .git/refs/heads/newbr
$ git branch
* master
newbr
Remember how we switched branch at the beginning of this post?
Even without the head of the branch being created, we can switch to a branch.
Git determines which branch you are currently on by looking at the .git/HEAD
file.
$ echo "ref: refs/heads/newbr" > .git/HEAD
$ git branch
master
* newbr
There’s something we do not cover in this post: .git/logs/
.
If you are familiar with the git reflog
command, it should be not difficult to figure out what this folder does by your own.
Commit objects are linked lists
Let’s create a second commit.
$ echo "hello again" >> hello.txt
$ git commit -am "My second commit"
[master c9cb777] My second commit
1 file changed, 1 insertion(+)
$ git log
commit c9cb777b785095f1d61ba213cbe95a2191f1b530
Author: czheo <czheo1987@gmail.com>
Date: Sun Mar 18 21:52:11 2018 -0700
My second commit
commit 433869c28e75b1b9648b7c9cce7d8f1622d930eb
Author: czheo <czheo1987@gmail.com>
Date: Sun Mar 18 19:40:58 2018 -0700
My first commit
Three new objects are created under .git/objects/
:
c9cb777b785095f1d61ba213cbe95a2191f1b530
78d0dae4323facf43ec1abb2974dc6aed63b65d7
8ecb1fca678b22a883ceaa0655c0a104b8812b80
$ tree .git
.git
├── COMMIT_EDITMSG
├── HEAD
├── index
├── logs
│ ├── HEAD
│ └── refs
│ └── heads
│ └── master
├── objects
│ ├── 43
│ │ └── 3869c28e75b1b9648b7c9cce7d8f1622d930eb
│ ├── 58
│ │ └── 452a586535b0e636d91e4d08007f93e70a6591
│ ├── 78
│ │ └── d0dae4323facf43ec1abb2974dc6aed63b65d7
│ ├── 8d
│ │ └── 0e41234f24b6da002d962a26c2495ea16a425f
│ ├── 8e
│ │ └── cb1fca678b22a883ceaa0655c0a104b8812b80
│ └── c9
│ └── cb777b785095f1d61ba213cbe95a2191f1b530
└── refs
└── heads
├── master
└── newbr
Git has provided us another convenient command git cat-file
.
Given a hash, this command parses the object in .git/objects/
and pretty prints out the payload, which is similar to what we have done in our Python scripts.
# explore the commit object
$ git cat-file -t c9cb777b785095f1d61ba213cbe95a2191f1b530
commit
$ git cat-file -p c9cb777b785095f1d61ba213cbe95a2191f1b530
tree 78d0dae4323facf43ec1abb2974dc6aed63b65d7
parent 433869c28e75b1b9648b7c9cce7d8f1622d930eb
author czheo <czheo1987@gmail.com> 1521435131 -0700
committer czheo <czheo1987@gmail.com> 1521435131 -0700
My second commit
# explore the tree object
$ git cat-file -t 78d0dae4323facf43ec1abb2974dc6aed63b65d7
tree
$ git cat-file -p 78d0dae4323facf43ec1abb2974dc6aed63b65d7
100644 blob 8ecb1fca678b22a883ceaa0655c0a104b8812b80 hello.txt
100644 blob 8d0e41234f24b6da002d962a26c2495ea16a425f hello2.txt
# explore the blob object
$ git cat-file -t 8ecb1fca678b22a883ceaa0655c0a104b8812b80
blob
$ git cat-file -p 8ecb1fca678b22a883ceaa0655c0a104b8812b80
hello git
hello again
Notice that the commit object has a new record about its parent
whose value is the SHA1 of the previous commit object.
Because our first commit does not have a previous commit, there is no parent
record.
We can image that git log
can simply follow along the linked list of commit objects and print out the history.
$ git cat-file -p 433869c28e75b1b9648b7c9cce7d8f1622d930eb
tree 58452a586535b0e636d91e4d08007f93e70a6591
author czheo <czheo1987@gmail.com> 1521427258 -0700
committer czheo <czheo1987@gmail.com> 1521427258 -0700
My first commit
Furthermore, notice git stores the new file hello.txt
in 8ecb1fca678b22a883ceaa0655c0a104b8812b80
as its whole, instead of the change diff.
Git tag is “cheating”
When we create a tag in git, it creates a new file under .git/refs/tags/mytag
.
$ git tag mytag
$ tree .git
.git
...
├── objects
│ ├── 43
│ │ └── 3869c28e75b1b9648b7c9cce7d8f1622d930eb
│ ├── 58
│ │ └── 452a586535b0e636d91e4d08007f93e70a6591
│ ├── 78
│ │ └── d0dae4323facf43ec1abb2974dc6aed63b65d7
│ ├── 8d
│ │ └── 0e41234f24b6da002d962a26c2495ea16a425f
│ ├── 8e
│ │ └── cb1fca678b22a883ceaa0655c0a104b8812b80
│ └── c9
│ └── cb777b785095f1d61ba213cbe95a2191f1b530
└── refs
├── heads
│ ├── master
│ └── newbr
└── tags
└── mytag
If we look at it, it’s the same as files in .git/refs/heads/
.
$ cat .git/refs/tags/mytag
c9cb777b785095f1d61ba213cbe95a2191f1b530
Therefore, we can tell git branches and tags have little difference internally, although they seem to be quite different from the user’s perspective. Both of them are merely pointers to some commit objects.
The git command API makes tags appear immutable to the users, unlike branches. However, we are smart enough to mutate tags now.
$ git show mytag
commit c9cb777b785095f1d61ba213cbe95a2191f1b530
...
$ echo "433869c28e75b1b9648b7c9cce7d8f1622d930eb" > .git/refs/tags/mytag
$ git show mytag
commit 433869c28e75b1b9648b7c9cce7d8f1622d930eb
...