Many home-grown deployment solutions make use of tools like Robocopy that can perform a "sync" when performing a deployment. This saves time and bandwidth, as it means only the files that have changed need to be pushed around the network. Octopus Deploy on the other hand uses NuGet packages, and overall we think that having a single file -- the package -- as the unit of deployment has many advantages over synchronizing a loose directory of files.
This decision comes at a cost though - we might upload a 100 MB package to a server today, just to make a small change to a few files in the package, only to have to re-upload the entire package again. Multiply this over many servers and it's clear that there are potential bandwidth savings to be had.
Wouldn't it be great if we could get the best of both worlds?
I spent some time last week reading two great papers on delta compression:
- Microsoft Research's paper on Remote Differential Compression, which led me to:
- Andrew Tridgell's original paper that introduces the remote delta algorithm used in rsync (HTML version)
The concept is pretty easy when applied to Octopus and Tentacle:
The key benefit is that when deploying a 100 MB file in which only 1 MB of data has actually changed, we'd transfer ~1.3% of the file as a signature (1.3 MB), plus the delta, plus slightly extra due to padding, but all up no more than ~3 MB.
Once I understood the concepts, I started to look for implementations that we could use. I really liked the look of rdiff, which as a command line tool had the kind of semantics that we needed, but support for running on Windows didn't seem to be great. Besides, If we are going to do this, it's the kind of code I think we'd want to maintain ourselves, and it's a while since I got to sit down and write code like this.
Introducing Octodiff
Octodiff.exe is a 100% managed implementation of remote delta compression. Usage is inspired by rdiff, and like rdiff the algorithm is based on rsync. Octodiff can make deltas of files by comparing a remote file with a local file, without needing both files to exist on the same machine.
Octodiff is on GitHub and is licensed under the Apache license. It has three different commands - creating a signature, creating a delta, and applying a delta (patch).
I need your help
NuGet packages are ZIP files, and I assumed that ZIP files wouldn't work very well for delta compression - I assumed that changing a single file in the ZIP would result in the entire ZIP being different. However, this proved not to be the case - ZIPs (or at least the ones created using System.IO.Packaging
) use block-level compression, so changes to a single file in the package tend to be constrained to a single area in the package file. In all of my testing, Octodiff works very well.
Adding Octodiff to Octopus and Tentacles comes with some risks though, so before we embark on implementing it just to find out that in the real world it doesn't actually improve anything, I'd really appreciate if you could help me:
- Download Octodiff
- Try it against some of the packages that you deploy with Octopus, especially the larger ones, but also smaller ones
- Tell me the results - how big is the delta, and how much bandwidth would you save
- Let me know if you think this feature is going to help you
Once we have some real-world data to go by we'll be much more ready to integrate it into Octopus.
Also, currently it does take a few seconds to create deltas on larger files - it's a shame no one reading this blog post has the skills to optimize it and send a pull request ;-)