Features

How we built a new, fast file transfer protocol.

Tldr: Large file transfer today is still remarkably slow, especially across large geographic distances. Turns out you can build protocols that are much more efficient than FTP (or other TCP based approaches) for moving data but that it is hard! What made this difficult for us was the lack of tooling: It is difficult to test, and simulate new protocols. If your users regularly move a lot of data you should consider using our transfer acceleration product.

Background:

Back in early, 2021 my cofounder and I came across a blog post that had been shared on Hacker News, a plea for an alternative to Aspera. [1] Having experienced the pain of transferring large video, and image files before, the post really resonated with us and we wondered why our uploads were rarely close to our theoretical ISP speeds, and how products like s3 transfer acceleration worked.

After a few weeks of rumination and exploration, the idea of a fast, developer friendly file transfer service caught our imagination and Tachyon Transfer was born. What followed was a few months of furious building, many mistakes and lessons learnt, many experiments and finally an approach to building protocols that we think is worth sharing.

Problems with File Transfer today:

For many of you reading this article today you may be perplexed as to what the problem actually is. Maybe at work you have no issues sharing large files across the organization, coworkers just download files from databases whenever they need. The problems with file transfer really have to do with transferring files from an access network to the internet core, and then across the world.

Imagine you are producing a television show, transferring terabytes of video from a shoot in LA to an editor in London. First you have to upload your data from your camera to your pc, then you need to upload very large files from your pc, often in your home network (the access network), to an FTP server or specialized service that runs in the cloud (we can treat this as the internet core). This transfer is often a bottleneck, and depending on how physically far away you are from the end server, and whether or not you are using wifi or some other noisy internet link, this transfer can take remarkably long. Why? Most file transfer today relies on some way on TCP as the data transfer protocol and TCP throughput suffers massively under latency and packet loss [2]. If your internet connection is 1Gps and you are transferring a 10Gb file, it should theoretically take 10 seconds to transfer, but if you experience 1% packet loss, ie 1% of data is dropped, file transfer under standard TCP variants like Cubic can be 99% slower. Other variants like BBR have improved this but even then can loose more than 50% of its throughput, and there are still issues with fairness with the newer variants. [3]

How would an ideal protocol act?

Before we can answer this question we need to understand two concepts, bandwidth utilization (BU) and fairness (TCP fairness to be more precise). Bandwidth Utilization is simply what % of your theoretical bandwidth a file transfer is using. So if you have 1Gps connction 100% BU would mean the file transfer is occuring at 1Gps, 50% would mean at 500Mps etc. Fairness is the idea that flows share a bottleneck link equally [4] - it is important to be fair to other TCP flows, as they back off exponentially in the face of congestion.

An ideal protocol would do 2 things, 1) Maximise bandwidth utilization 2) Be fair to other flows. There are other things to be aware off including goodput, and min-max fairness but for brevity we will not get into them here.

Our approach - Picking UDT and building a network simulator:

After a literature review, we decided to build on top of UDP to create a reliable high speed protocol. TCP has a number of well documented issues on high BDP transfers, and we decided we could have more control if we built something ourselves. Instead of starting from scratch however we examined the open source libraries that were available. Two main contenders rose to the top out of the many options out there: Tsunami UDP, and UDT. Both projects are used today in commercial products* and after some analysis we decided to built on top of UDT. We read the papers associated with the project [6] and were optimistic we could improve the code and have a near ideal protocol in a month - how wrong we were!

It took us a little while to build UDT and get it to work, given the project is relatively old and hasn’t been updated since around 2014, but we were able to set it up on a few machines in the cloud and test a few file transfers.

One of our first problems was accurately determining how fast, and efficient the protocol was. When we set up a file transfer between a machine in Singapore and a machine in NY we found that it was about one and half times as fast as scp in moving a 1Gigabyte file but in some other geographies was just as slow, and occasionally slower. We needed a more thorough network simulator to test a variety of conditions quickly.

We looked at a number of projects that made network simulation possible, and spent a lot of time in particular setting up NS3 [7], but found after a while that it would not be possible to use UDT with NS3 given the fact that UDT uses multiple threads. As NS3 is a discrete event simulator, the idea of step-by-step simulation didn’t really work with UDT’s multi threaded nature. This was a major issue, and we looked into a number of other packages including protocol labs’ testground [8] but in the end decided to build our own ‘WAN emulator’. The basic idea was to set up three machines in the cloud, A, B, C where A is sending traffic and C is receiving. All traffic from A goes through B, essentially a virtual router, which forwards it to C. On B we ‘shaped’ the traffic with the linux command line utilities ts and netem to add latency and packet loss in any way we wanted. We had guaranteed bandwidth from A through to C of 5Gbp/s and using our virtual router were able to quickly simulate 100’s of different conditions. This emulator allowed us to measure Bandwidth utilization and fairness in a concrete, good-enough way [9]. We logged the speeds, and packet sending and receiving dynamics, and used jupyter notebooks to graph and visualize our results.

Building this infrastructure took a substantial amount of time but ultimately allowed us set up a build, test, iterate loop that sped up our protocol development.

Our approach - Understanding UDT’s strengths and weaknesses:

UDT’s author Yunhong Gu deserves a lot of credit for his great work; for creating an application level protocol that saturates high BDP links. Additionally the code is set up such that it is relatively easy to change the congestion control algorithm. When we ran simulations to test UDT across a number of network conditions we did however,  find many areas where the protocol could be improved. UDT performed very well in high RTT situations across the board, and was fair both with itself and TCP. But UDT’s main shortcoming was its performance under random packet loss - which is seen both in long distance file transfers as well as across media like wifi and LTE.

Overcoming Random Packet loss:

We performed another extensive literature review, reading about TCP reno, cubic, santa cruz, vegas, and Google’s BBR amongst others and decided that we needed a richer congestion signal. We prototyped many different signals, and changed the congestion control algorithm to take RTT into account. This was a lot more involved that we initially thought it would be, and we ended up building another  addition to our simulation, test, iteration loop to evaluate our ideas. We created a number of python notebooks that let us simulate how our signals would perform on real world data, before implementing changes in UDT code itself to run on real world links. In a way this work was akin to creating a backtesting platform that financial engineers and quant traders use to create trading algorithms. This approach let us find suitable candidates for congestion measures, but also showed how finicky and capricious many potential signals were - for example how you calculate means, standard deviation and moving averages could make or break the value of a signal that had a great theoretical underpinning.

When we added our signal to UDT, and ran real world tests, we saw an improvement in Bandwidth Utilization across the board, but a smaller change than was forecasted. Identifying this issue was a long process, and we graduated from simple print statements and gdb to using tools like Intel’s Vtune. The profiler was incredibly useful in tracking high level differences in how UDT was performing under different network conditions and it turned out, beyond making changes to the congestion control algorithm there were many changes that needed to be made to the mechanics of UDT’s threads itself. This required a more substantive rewrite and we ended up changing the acknowledgment logic and how the sender and receiver threads worked over a period of a few months of work.

Final testing

Before rolling out our new algorithm it was important that we ensured our protocol played nicely with others; it is relatively easy to create an algorithm that unfairly takes up bandwidth from others, but creating a performant fair algorithm is another matter all together! Under simulations in our emulator we performed fairly, and in real world tests did so as well. Finally our theory was sound - we still used an AIMD algorithm that according to Jain et al leads to fairness [10].

Conclusion and Uses:

While there are still a large number of improvements to be made, we are proud of the new and improved performance and if anyone is interested in trying the tachyon transfer algorithm we offer a storage transfer acceleration API like AWS does. Our SDK includes node, c++ and objc and could be used in a wide variety of applications.

We are particularly excited by our mobile offering - if your users are regularly uploading video content, traditional transfer is very slow so consider reaching out. When we compared our upload speeds on mobile we outperformed aws transfer acceleration on average by 30% for a variety of file sizes across wifi.

As a final note if you offer blob storage ( GCP, Wasabi, Cloudflare) and want a transfer acceleration product consider reaching out!

Notes:

  1. https://www.ccdatalab.org/blog/a-desperate-plea-for-a-free-software-alternative-to-aspera/
  2. https://atoonk.medium.com/tcp-bbr-exploring-tcp-congestion-control-84c9c11dc3a9
  3. https://www3.cs.stonybrook.edu/~arunab/papers/imc19_bbr.pdf
  4. http://www.cs.newpaltz.edu/~easwaran/CCN/Week9/tcpFairness.pdf
  5. We believe signiant is/was built on top of tsunami, and haivision uses udt
  6. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.5963&rep=rep1&type=pdf
  7. https://www.nsnam.org/
  8. https://docs.testground.ai/
  9. Why good enough? We found for adding delay and packet loss for speeds above 1Gps, the machine would start acting oddly we believe this has to do with kernel specifics
  10. AIMD flows converge to roughly equal bandwidth, given that they share the same RTT and flows have an equal probability of experiencing packet loss.

Latest