Git is a version control system that is decentralized by design. Anyone can run git daemon
in a repository to start a Git server. You can also host your repository using a regular web server and HTTP infrastructure. More commonly, though, repositories are distributed through centralized hub services like BitBucket, GitHub, and GitLab. Itâs quick, easy, and free to âthrow your code up on GitHubâ and call it a day. However, there is also a growing number of peer-to-peer (P2P) distributed options to consider as well.
What if you could distribute your Git repository using the BitTorrent P2P protocol without the need for a central server? Without having to depend on a commercial businessâ hosting-generosity and infrastructure. Thatâs the idea behind GitTorrent, an experimental Git helper and overlay protocol for transferring Git repositories over the popular P2P protocol.
GitTorrent does away with the idea of a central code distribution server. Instead, it relies on the people whoâre using and participating in the project to contribute bandwidth and handle its distribution.
Similar concepts have popped up around other peer-to-peer protocols including Dat Protocol and IPFS. Each implementation has made different implementation choices and ended up with systems that appear similar at first glance but have fundamentally different trade-offs and priorities. In this article, Iâll explore these differences in-depth and do a comprehensive comparison.
Iâll kick off with the following comparison table with some key features and limitations. Thereâs a lot to digest in it and Iâll discuss each item in turn below the table.
Feature | Client | |||
---|---|---|---|---|
GitTorrent | HyperGit | igis-remote | ipld-remote | |
Protocol | BitTorrent | Dat | IPFS | |
git-remote | gittorrent: | hypergit: | ipns: | ipld: |
Project activity | Inactive (2015) | Inactive (2018) | Active (2020) | Active (2019) |
Runtime | Node.js | Go | ||
Peer discovery | DHT (not bootstrapped) | Tracking server, mDNS-SD | DHT, mDNS-SD | |
Repo. updates | Git server, side-channel, DHT | Peer swarm | IPNS | Side-channel |
Repo mutability | Mutable | Immutable | ||
Packing strategy | On-demand packing | Unpacked | ||
Data de-duplication | None, compressed | None, un-compressed | Global, un-compressed | |
File size limit | RAM | No inherent limits. | 2âŻMB | |
Data loss | Of course not. | Your repository is deleted when IPFS runs GC. | ||
Hash algorithm | SHA-1 | Ed25519 | SHA-256 | SHA-1/SHA-256 |
Iâll start by discussing the status of each project and then move on to discuss how they do things differently.
Project activity
GitTorrent saw a burst of development in by the project seems to have been abandoned by its creator by the beginning of . Youâll notably start out with several security and deprecation warnings if you try to install and run it. It requires some tweaks to dependencies and its code to work with todayâs version of the Node.js runtime.
HyperGit similar saw an initial burst of development in and also appears to have been abandoned. It also requires some fixes minor fixes to install on a recent version of Node.js. HyperGit seems to be the least polished option of the ones discussed in this article.
IPFS/IPLD has seen steady development since . The IPFS/IGIS fork came along in and addressed many of the limitations of the IPFS/IPLD implementation. Both IPFS implementations are excruciatingly slow to process pushes even though they take place locally on your computer.
Peer discovery
GitTorrent uses an implementation of the BitTorrent mainline Distributed Hash Table (DHT) to discover others whoâre sharing the repository you want to download. Instead of querying a peer database on a centralized server, you query the other participants in the DHT to discover which peers host the Git repository youâre interested in.
To connect to the DHT you need to go through what is known as a bootstrap/introduction server. GitTorrentâs bootstrap server has been offline since . Iâve discussed previously how DHT can be made more resilient. You can configure a different mainline DHT-compatible bootstrap server. However, other people using GitTorrent must also manually configure a bootstrap server on the same DHT. This makes it harder to adopt GitTorrent in a project.
HyperGit relies on the Dat Protocol projectâs tracking server. A tracking server is a centralized database server that fulfills the same function as a DHT. It keeps track of which clients have which repositories and answers queries from other clients. As evidenced by GitTorrentâs bootstrapping server being offline, the tracking server is a single-point-of-failure in otherwise distributed systems. Iâve recorded over 21 outages of the Dat Protocolâs tracking server in 2018 and 2019.
IPFS also uses DHT. Its DHT bootstrapping process could benefit from increased resilience the same way the others can. Notably, HyperGit and GitTorrent use the DHT to discover a Git repository â and then query the peers it discovers for who has which parts or âchunksâ, of the repository. IPFS, on the other hand, uses a DHT for every single data chunk globally. This design is part of the projectâs goal of global data de-duplication (more on that later).
However, IPFS architecture creates an enormous overhead of DHT traffic compared to the other protocols. It also fails to benefit from the assumed knowledge that peers who have one chunk of the repository youâre interested in are likely to also have more chunks youâre interested in.
Lastly, Dat and IPFS can discover peers on the immediate local network (LAN) through Multicast DNS Service Discovery (mDNS-SD). This process â also known as zero-configuration networking (Zeroconf), or Apple Bonjour or Rendezvous â can be useful in office settings where everyone interested in the Git repository is connected to the same local network. Itâs not as relevant or useful in these times of remote work, however.
Updates and mutability
Dat archives â as used by HyperGit â are append-only file systems. An archiveâs creator holds a special private cryptographic key that allows them to append new data to the end of the archive. They can add new files and new revisions of existing files to it, but canât remove or change an old version from the file system log. Everything is versioned.
Peers in the network announce to each other whatâs the newest version theyâve got of an archive. At the same time, they query the network to discover even newer versions.
On the other hand, âarchivesâ on BitTorrent â as used by GitTorrent â and IPFS are immutable. You normally canât make changes to a âtorrentâ transfer or an IPFS file ones it has been created. Both protocols use a fileâs cryptographic hash as its network address. Change the file and you change its hash.
GitTorrent solved this by building mutable torrents on top of BEP-44: Storing arbitrary data in the DHT. This âarbitrary dataâ is signed with the same cryptographic key as the main torrent. To push updates, the private key-holder pushes the hash of the newest commit to the DHT.
Clients initially download the full Git repository as of the time the torrent was created. Clients can then query the DHT to find the latest commit. They can then query other peers for the commits between the latest version they already have and the latest commit it found in the DHT. Iâll discuss this a bit more in the next section.
The IPFS project has an experimental side-project called the InterPlanetary Name Service (IPNS) address. Itâs like the Domain Name System (DNS) but stored in the DHT. Like DNS, IPNS can translate one address to another type of address. In the case of IPNS, it turns one mutable hash into an immutable hash. You can also use something called DNSLink to piggyback on the same type of look-up using DNS instead of IPNS.
IPNS has been plagued by unreliability and poor performance since its inception. Iâd recommend you use DNSLink instead of IPNS with IPFS. Remember, DNS is designed to be decentralized through features like secondary authoritative servers and caching recursive revolvers.
Your IPNS address or your DNSLink-enabled domain name would resolve to the IPFS hash of the repositoryâs newest commit. The IPFS/IGIS implementation support doing this automatically for IPNS on Git pushes. The IPFS/IPLD implementation requires you to update your IPNS or DNSLink manually or communicate updates through another side-channel.
Packing strategy and data de-duplication
Git normally packs individual object files (commits) into packed single-file objects. These are deflate-compressed on disk to de-duplicate repeated data within the same pack and shrink their file size. This greatly reduces the disk storage requirement of your Git repository. Git may need to repack these pack files when commits are orphaned (e.g. from a dropped branch), or to improve packing-efficiency.
You donât want to make changes to existing data in a distributed file system over time, though. Needlessly changing data that everyone already has a copy of requires them to redownload the same data within a slightly different packaging. The data payload of a Git commit is supposed to be immutable (unchangeable). This is where an unpacked Git repository comes into play.
You can simply choose to not use object packing within your repository. An unpacked repository stores each commit in a set of separate files rather than being packed all neatly into one compressed file. That might sound like a trivial difference. However, the Git software project repository (as of commit 07d8ea56f2) is 118âŻMB packed and 2,8âŻGB unpacked. Thatâs a massive 2273âŻ% increase in the amount of data people will need to download to retrieve a copy of the Git project repository.
IPFS objects are content-addressable and immutable. You canât modify a file without changing its IPFS address. IPFSâ whole deal, however, is global de-duplication of content-addressable data. Two IPFS nodes that add the exact same file would end up with the same content-address for it. This also means that â assuming the Git repository is unpacked â each individual Git object is de-duplicated globally.
Global data de-duplication, in relation to peer-to-peer distribution, is most interesting with regards to forked repositories. An upstart project that forks off from an established project will share commit history (and hosting) with its parent repository for eternity. The more people that are interested in the same content chunks, the greater its availability and longevity in the IPFS network. Every project gains increased availability in the network by having more shared data chunks.
IPFS pinning service can help increase the availability of your Git repository. However, theyâre likely to overcharge for duplicated chunks.
BitTorrent and Dat, on the other hand, are entirely focused around the model of a âtorrentâ or âDat archiveâ. Data is only exchanged around one of these objects and it never crosses over. BitTorrent has a vaguely defined standard for leaching chunks off another somehow-related torrent that the downloader is assumed to maybe have previously downloaded. This isnât implemented in many clients and itâs not found in GitTorrent either.
However, GitTorrent is smarter than your average BitTorrent client. It can request that peers pack and transfer a set of Git objects it needs. For example, if the last commit it has is commit aaaa
and that the newest commit is dddd
, it can request that a peer packs all the commits between those two commits. The sender will need to spend extra processing time on assembling and compressing a pack for each receiver. However, this approach significantly reduces the disk I/O and network overhead involved in sending loads of tiny files. This is similar to how a âGit smartâ web server works.
HyperGit is built on top of HyperDB â an append-only database â and Dat. Existing database entries are immutable. HyperGit stores Git objects directly in the database unpacked. Although the Git objects are âpackedâ into a single database file, it canât take advantage of compression. The database itself is also immutable. Like with the IPFS implementations, you end up with the same file size bloating effecting both transfer sizes and increased storage requirements.
Data loss
P2P can be great for distributing redundant copies of your repositories. Anyone interested in it will, at least temporarily, also participate in hosting and distributing it. Every project collaborator will, at least intermittently, participate in its distribution and has a complete backup copy of the project in its entirety.
However, the IPFS options come with a huge caveat. An IPFS node will cache and distribute all data that passes through it. This can consume a lot of local storage capacity. IPFS nodes use a garbage collector to clean out ephemeral cached data from the local IPFS repository to free up disk space as needed. The garbage collector will delete every object from the repository that hasnât been âpinnedâ.
Both the IPLD and IGIS implementation pins Git objects when you initialize or push to an IPFSâGit repository. However, neither pin the root directory of the repository! The root directory is the collection of files that together make up the repository. You donât lose your data per se, as the Git objects are safe. However, you do lose the primary object that holds it all together. This is also the hash youâd share directly with other contributors for them to pull complete copies of your repository. You might be able to retrieve a copy of the repository root directory from another contributor who hasnât run the garbage collector.
File size limit
WebTorrent, which GitTorrent is based upon, stores files in memory. Individual files you download through GitTorrent canât exceed your available memory capacity.
That might sound bad, but the InterPlanetary Linked Data (IPLD), a mapping layer between Gitâs object hashes and the corresponding IPFS objects, is limited to just under 2âŻMB. This should be fine for smaller projects as long as you donât refactor the entire project in one go or add large art or other binary assets to the repository.
Neither HyperGit or IGIS has any inherent file size limits. HyperGit can randomly produce error messages saying something about 8âŻMB being the maximum. This is a temporary problem in the underlying Dat implementation and not is only tangentially related to your Git repository. Git itself can become slow when dealing with large files, however.
Hash algorithm
The BitTorrent protocol uses SHA-1 to identify and locate file transfers. SHA-1 has been deprecated for years, however. Itâs considered a weak hashing signature at best. It has even been demonstrated that itâs possible to produce a controlled hash collision; an identical hash from different input data.
BitTorrent protocol version 2 migrates the protocol to SHA-256. Itâs 264 times less likely to get a hash collision with SHA-256. Version 2 has been on the book for years, but there havenât been many implementations. GitTorrent, being based on the WebTorrent project, uses a WebRTC-variant of protocol version 1.
Git also uses SHA-1 internally to reference commits. However, itâs also possible to sign commits with a GPG to shore up security.
IPFSâ sister-project, InterPlanetary Linked Data (IPLD), is a mapping layer between Gitâs object hashes and the corresponding IPFS object (SHA-256) for the same data. To replace a Git commit, youâd need to create a collision for both the SHA-1 Git object and the corresponding IPFS object. The InterPlanetary Git Service (IGIS) implementation doesnât bother with the IPLD translation layer and relies on SHA-256 exclusively.
Conclusions
Peer-to-peer Git hosting may sound appealing to some. At least, it sounded very appealing to me! However, many of the current implementations sound less appealing after digging deeper into the subject.
I donât think anyone should use any of the P2P options unless theyâre committed to also working to improve the tools. Itâs too complicated to get started and theyâre hard to understand enough to confidently deploy using them. You donât want your projectâs distribution method to be so complicated that it becomes an unreasonable burden to its adoption.
GitTorrent seems to have made the best implementation of a P2P overlay for Git. Unlike the other options, it doesnât have a huge storage and transfer-size overhead from relying on unpacked Git repositories. However, it canât be used out-of-the-box and it would require some work to resolve security issues and update its dependencies.
Bonus: Decentralized options
So, maybe the distributed options for Git isnât quite there yet. However, you can also consider using a decentralized option instead. Instead of relying on a network of peers, decentralized options rely on one or more servers.
If you just donât want to host your next project on GitHub; host it on your web server. Itâs quick and easy to do and it helps increase the diversity in the Git hosting ecosystem.
There are more decentralized options available than there are P2P options. Itâs easier to implement a decentralized option as one or more centralized servers that take care of a lot of the complexity youâd introduce to use a P2P implementation. The two most notable options are Git itself and Secure Scuttlebutt (SSB).
Client | git (+ any web server) | git-ssb | |
---|---|---|---|
Protocol | Git, HTTPS | HTTPS | Secure Scuttlebutt |
git-remote | git:, git+https: | https: (âdumbâ) | git-ssb: |
Status | Active (2020) | ||
Packing strategy | On-demand packing | Packed or Unpacked | Unpacked |
File size limit strategy | Unlimited | 5âŻMB (soft) |