git
git copied to clipboard
An implementation of Git in Scala 3 with ZIO 2 with all episodes available on YouTube :tv:
Git implementation
Implementation of a subset of git features
Objectives
- Learn how git works in depth
- Try Scala3
- Have several loosely-coupled interchangeable components thanks to hexagonal architecture
- Try to integrate practices and patterns from DDD
- (double loop) TDD approach
Chapters
Chapter 1: Making a commit
:tv: Episode 1: Primitive Blob Hashing
- motivations and presentation of the objectives
- generated project
sbt new scala/scala3.g8
- hash a blob
- What is a blob?
- SHA1 of file with a prefix
blob <content_size>\0<content>
- Hash of a blob:
echo -n 'test content' | git hash-object --stdin
- Comparing with sha1 hash of the same string
echo -n 'blob 12\0test content' | shasum -a 1
- SHA1 of file with a prefix
- What is a blob?
:tv: Episode 2: Refactoring to use hexagonal architecture and introduce concepts like Command and UseCase
- refactoring and extension of the code to support other input options (file, write in database, type, etc.)
- setup domain and infrastructure packages (hexagonal architecture)
- write a test for Main
- introducing a
HashObjectCommand
:tv: Episode 3: Add ZIO with MockConsole
- add zio (resource management, streaming, retries, parallelism, etc.)
:tv: Episode 4: Hashing a stream of bytes (ZStream)
- objective of the chapter: making a commit
- hash stdin string - change the way the command is used:
-
hash-object --text "test content"
instead ofhash-object "test content"
-
- Fix the encoding issue
- Hashing a stream of bytes (ZStream and ZSink)
:tv: Episode 5: Hash files
- Write test to hash a file
- Refactor so the hash object usecase accepts several types of command
- Implement hashing a file
- Model the return type of the usecase with a richer type
- Update test to hash several files and implement
:tv: Episode 6: Refactor to introduce FileSystemPort and Adapter using ZLayer
- [Refactor/hexagonal arch.] extract reading a file and have the implementation in the infrastructure package.
- problem in the hash object usecase
- fixing the problem
:tv: Episode 7: Mock Object repository Part 1 - the repository should be called in the use case
- [Business Logic] write a blob in git objects directory
- [x] create an ObjectRepository
- [/] write a test for HashObjectUseCase verifying that the repository is called
:tv: Episode 8: Mock Object repository Part 2 - ObjectRepositoryFileSystemSpec
- [Business Logic] write a blob in git objects directory
- [x] create an ObjectRepository
- [x] write a test for HashObjectUseCase verifying that the repository is called
- [/] create the implementation for the repository and test
- what to test? we are looking to test compatibility with Git: right place, right format
:tv: Episode 9: Implementation of ObjectRepositoryFileSystem
- [Business Logic] write a blob in git objects directory
- Object Repository File System
- Refactor the ObjectRepositoryFileSystemSpec to generate a single hash to avoid a "cache" issue.
- Implement Object Repository File System
- Object Repository File System
:tv: Episode 10: HashObjectUseCase is not calling the ObjectRepository with the right value
- Check that hash object use case is calling the object repository with the right value (with the blob + size prefix)
:tv: Episode 11: 1st milestone! Writing a git object and read it back with git
- Put things together: hash and save a blob from the app and try to read it with git
- Test missing: not call the repository when the save option is false
- refactor main to extract the parsing and the formatting part
Chapter 2: Saving the current tree
:tv: Episode 12: Exploring git index binary format with #scodec #scala3
- [Business Logic] read and write git index file
- read the git index file
:tv: Episode 13: Structure the code via case classes
- [Business Logic] read and write git index file
- create a dummy index file and read it
- refactor the code to use case classes
:tv: Episode 14: Productionization of the git index parse code #scodec #scala3 (part 1)
- [Business Logic] read and write git index file
- productionize the code
Next:
- [Business Logic] write a tree in git object directory
- refactor the MainSpec to separate the concerns
- use a more specific type than string for dealing with files
- [Business Logic] write a tree in git object directory
- [Business Logic] write a commit (with a tree hash provided)
Git internals
Objects
Source: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelaigitn
Types of objects
Git uses the concept of Object. There 3 types of object:
- blobs. A blob basically represents the content of a file. It is stored in a file named after the hash of the content.
- trees. Trees are used to represent the hierarchy between blobs. A tree contains blobs and other trees with their names. For instance :
100644 blob dc711f442241823069c499197accce1537f30928 .gitignore
100644 blob e5d351c3cd44aa1d8c1cb967c7e7fde1dee4b0ad README.md
100644 blob 7a010b786eb29b895ba5799306052b996516d63b build.sbt
040000 tree 8bac5f27882165d313f5732bb4f140003156c693 project
040000 tree 163727ec9bd17ef32ee088a52a31fe0b483fa18f src
- there are different types of files:
-
100644
is a normal file, -
100755
is an executable file, -
120000
for symbolic links, -
040000
for tree -
160000
for sub-modules
-
-
commits. Commits are used to capture :
- the
tree
snapshot of the code - the
parent(s)
commits. Usually a commit has only one parent, but it can have 0 to n parents. The first commit does not have any parent. A merge commit has several parents (usually 2). - the
author
- the
commiter
- a blank line
- the commit
message
- the
How object are stored in the object repository
Prefixed by the first two characters of the hash
Those files are stored in .git/objects
. Each file representing either blob
s, tree
s or commit
s, are stored within directory named after the first two characters of the hexadecimal hash. For the hash dc711f442241823069c499197accce1537f30928
will be stored the in folder .git/objects/dc
.
The filename is the hash without the first two letters. For the hash dc711f442241823069c499197accce1537f30928
, the filename will be 711f442241823069c499197accce1537f30928
-- note that the prefix dc
has been removed here. The file corresponding to the hash dc711f442241823069c499197accce1537f30928
would be .git/objects/dc/711f442241823069c499197accce1537f30928
.
Zipped using ZLib
ZLib is a C library used for data compression. It only supports one algorithm: DEFLATE (also used in the zip archive format). This algorithm is widely used.
Git index
https://git-scm.com/docs/index-format
Useful git commands:
-
git cat-file
show information about an object-
-p <hash>
show the content of an object.hash
can bemaster^{tree}
to reference the tree object pointed to the last version of master. -
-t <hash>
show the type of object
-
-
git hash-object
(explicit) -
git update-index
Register file contents in the working tree to the index -
git write-tree
writes the staging area to a tree object -
git ls-files
-
--stage
or-s
show all files tracked
-
-
zlib-flate -uncompress < .git/objects/18/7fbaf52b4fdebd0111740829df5b51edc8b029
other program that deflates files
Useful links:
- https://git-scm.com/book/sv/v2/Git-Internals-Git-Objects
- https://stackoverflow.com/questions/4084921/what-does-the-git-index-contain-exactly
- https://git-scm.com/docs/gitglossary
- https://github.com/git/git/blob/master/Documentation/technical/index-format.txt
- https://git-scm.com/book/en/v2/Git-Internals-Packfiles
- Good explanations about the format of git tree https://stackoverflow.com/questions/14790681/what-is-the-internal-format-of-a-git-tree-object