Software and Computer Engineering

Practical One-Billion Row Challenge

The One-Billion Row Challenge consists in writing a program to process a plain-text file with one billion entries, optimizing for speed and trying to see how fast you can make it.

Instead of starting with a program, profiling it, and reaching at some optimized version, I'm curious about how the choice of programming language and runtime influences the total run time.

I think that in most practical cases, one can hardly justify investing too much time at optimizing an one-off data-crunching program. But there is an underlying aspect that -- I hypothesize -- has a huge impact and requires barely any time investment: the underlying programming language (and runtime) choice.

For further simplifying the challenge and to get a lower-bound of runtime, the only goal of the program will be only to count the number of lines in the text file.


You can count the number of lines in a text file with with the wc -l Unix command. I'm using the wc command as implemented by GNU coreutils 9.4. The result from this will serve as the baseline.

For the contenders, I chose some popular languages that I am proficient in and wrote some obvious “simple-as-possible” code in each. The contenders are:

For the execution environment, I'm using Arch Linux 6.7.9, and two different machines: a 1st generation Surface Book with a 6th generation i5 processor, and a desktop computer with a 12th generation i5 12600K.

For measuring, I'm using hyperfine with one round of warm-up. All programs implemented are single-threaded.


Surface Book

The dataset is around 12GB~13GB in size, which doesn't fit all at once in the 8GB RAM of the Surface Book. This means that the disk I/O will play some role. Still, I suspect this challenge is still mostly CPU-bound.

wc:          17.3s +- 0.1s
go:          31s   +- 1s
dotnet AOT:  46s   +- 3s
dotnet JIT:  52s   +- 6s
pypy:        2m20s
cpython:     3m56s
node:        4m58s

NOTE: For the last three runs in the table above (pypy, cypython and node), I only used the built-in time command in the Bash shell, as hyperfine would take too much time to complete the full 10 runs.

Desktop i5 12600K

This machine has 32GB of RAM, so the whole dataset fits in RAM and is cached by the kernel after the first warmup run by hyperfine. The bottleneck will be mostly how quickly the CPU can crunch through the program and the data.

wc:           0.96s +- 0.02s
go:           8.0s  +- 0.1s
dotnet AOT:  16.4s  +- 0.1s
dotnet JIT:  16.7s  +- 0.4s
pypy:        52s    +- 0.4s
python:      66s    +- 2s
node:        78s    +- 2s

NOTE: All measurements above were done with 1 warm-up run and 10 measured runs with hyperfine.


First of all, I'm happy that my 1-year-old desktop is about 18 times faster at this task than my 9-years-old 2-in-1 computer.

I'm surprised how poorly Node performed relative to Python. I was expecting it to outperform Python, but I guess the outcome makes sense in the light of that we're dealing with a CPU-bound task, not an IO-bound task.

I'm also surprised that using pypy instead of CPython only cut the runtime down by between 20% and 40%. I was expecting an order of magnitude improvement, but I guess my mental model of how pypy works is not quite correct.

I was also expecting the dotnet AOT compilation to more significantly outperform the JIT'ed version. In the results seen, it improved runtime between 11% and 2%, which might not be worth it at this time given how experimental it still is.

Go seems to be a great choice for data crunching, paying the smaller runtime penalty for many improvements over old-school C programming: memory-safety, concurrency primitives, modern tooling, and so on.

Dotnet is not bad as well, and comparable to Go, only between 50% and 106% times slower for this CPU-bound data-crunching task.

Python and Node are great to jot down quick one-off scripts and get tasks done quickly, but the dynamic typing and respective runtimes impose some heavy performance penalties.

Given how high-level and ergonomic both Go and .NET are, it is might worth considering using those languages if data-crunching is involved and the whole dataset can't be fit into memory.