Reading notes for "Writing High-Performance .NET Code", by Ben Watson (2018)

2024-05-29

This is a great book from someone who spent countless hours looking at profiler results, troubleshooting hairy issues, and delivering .NET applications in production with high performance characteristics, and it shows.

For myself, at work I'm not assigned to improving perf at org-wide fundamental platform, so for the most part I can't justify the time investment for the careful measuring, profiling and experimentation that serious performance-critical work requires, but still, I hoped to learn from the author's experience in adopting more general rules applicable to me.

Learning #1 -- Getting to know the Garbage Collector

Something that surprised me is that the .NET Garbage Collector (GC) handles short-lived small objects at a very low overhead. Before reading this book, I thought that objects were all treated the same and all incurred some similar overhead, regardless of size. This is not the case.

.NET boasts a generational GC, which leverages one fact of most applications: most objects fall into two categories, either they are short-lived or live for the whole application runtime.

If an object has “survived” a GC cycle, it is promoted to a later generation, which gets collected less often. This has the consequence of “bubbling up” longer-lived objects to later generations, and keeping most of the GC time collecting the earlier generations, which are more likely to contain objects to be collected.

There's also a Large Object Heap (LOH) for larger objects which occupy large contiguous memory regions, such as memory buffers, which get handled differently. For those, it is assumed these will be fewer in number, so a generational GC approach is not applied.

Learning #2 -- Getting to know Parallel

When writing backend or scheduled job code, there are often times where your code is waiting on the network for some response. It is also often the case these requests could be made with some degree of parallelism. Since doing this by hand is complicated to review and error-prone, I've often avoided it unless proven really necessary.

The book outlines the built-in Parallel functions, which make it easy to iterate through some numeric range, or some enumerable list, with N workers, each worker performing each task to completion in parallel.

Say, if you have a list with 10K requests to make, making those all at the same time in parallel will cause a large spike in load, issues in the process, thread and I/O scheduling, as requests will step on each other's toes, and overall it will take longer. The correct approach is to select a maximum number of requests you are willing to do in parallel and concurrently, then work through the list performing a new request each time a previous one is completed, which is exactly what Parallel.ForEachAsync implements. This is the optimal approach to get most out of I/O bound tasks.

Learning #3 -- Getting to know structs vs classes

I often used class everywhere to declare a new type of object, since this is what everyone else was doing, but it is good to know the exact intention of the struct: it is essentially “syntactic sugar” for passing by value all of it's constituents, it just groups them together in a single value and name.

Related to this concept, there's also the semantic difference that when we pass an object to a function, it's a reference to that object (pass by reference), but when we pass a struct to a function, we are passing its constituent values (pass by value).

As mentioned before, .NET handles small objects just fine so, unless proven advantageous, we won't use a struct, but good to know exactly what it means when it does pop up in code.

Learning #4 -- Concurrent containers

I've seen in code the ConcurrentDictionary and ConcurrentBag, but until know I have found little use to them due to mostly writing single-threaded code. Now that I'm using Parallel where applicable, those thread-safe containers are essential to allow each thread of computation to hand-off their results without data race issues. These containers are also very nice since they don't require manual handling of lower-level thread synchronization primitives, such as mutexes and locks.

Finally, it's also good to know that for “read-only” values (or values that will only be read by multiple threads during the computation lifetime), it's fine to keep them traditional (meaning, using the traditional non-concurrent variant). The concurrent containers are really only necessary when there's mutation (which introduces data racing).