Optimizing string.Count all the way from LINQ to hardware accelerated vectorized instructions
I have always been very interested in code optimizations, even when achieving the best possible execution time is not really necessary — I consider trying to improve my code and squeezing out as much performance as possible to be both fun and a useful learning exercise.
I have recently rewritten a simple method to count the number of occurrences of a given char into a string, and in doing so I also had a chance to experiment with vectorized instructions, which is something that I had always considered fascinating but that I had never taken the time to read up on well enough to actually use them in my own code.
This blog post is a brief introduction to the world of vectorized instructions in C#, and it is meant to be accessible for developers that have never really used them before on their own, just like I had not until today. The code that is included in this post is by no means anything extremely advanced — I’m just sharing the steps I personally took in optimizing this specific algorithm, and what I learnt along the way.
If you happen to have never spent much time optimizing code, I hope this post will make you even a tiny bit more interested in this topic!
LINQ to the rescue
The first implementation that everyone will probably come up with when implementing a similar algorithm (myself included) uses the Count<T>(IEnumerable<T>, Func<T, bool>) API from the System.Linq namespace. This API is very helpful as it provides an easy way to solve this problem, which is extremely compact to write and still reasonably fast. We just need to invoke this method with a function that checks each character in our string against the one we are looking for:
This works just fine and has the big advantage of being virtually impossible to get wrong — once this method is implemented you likely will never find a bug coming from this piece of code, which is certainly a plus when working with larger projects. But, there are two major drawbacks:
- Memory allocation: the Func<T, TResult> instance we’re using in this method is generating a closure, since it’s capturing our parameter char c from the caller method. If you’re not familiar with closures, you can read up more on them in my other blog post, here. Suffice to say, it involves the runtime allocating an object every time we use this method, which causes unnecessary memory usage and GC pressure. Additionally, we’re also allocating memory for the string iterator that the LINQ method is working on.
- Slow execution: the LINQ method we’re using goes over all the characters through the iterator, which is much slower than doing the same with a traditional for loop. In this case we’re also invoking our Func<T, TResult> delegate for each individual character, which again has a higher overhead compared to simply comparing each character directly.
Return to the past
Let’s leave the user friendly land of functional programming and go back to a classic, imperative style implementation of our algorithm:
First, there is an interesting optimization that the C# compiler is doing for us: the string type (together with T arrays) is treated in a special way when used in a foreach loop, and results in that code being transformed into a classic for loop. This has the advantage of being easier to write for us, since we don’t have to deal with manual indexing of characters in the string, while still having exactly the same performance as a classic loop. Plus, since the compiler knows we are iterating exactly within the bounds of the input string, it will emit a nicely optimized assembly code with no unnecessary bounds checks. This implementation is much faster than the LINQ version, and with it the memory allocation problem has already been solved entirely.
Full steam ahead
The previous implementation has one bottleneck: each loop iteration has one conditional branch, specifically that if statement on line 7. Branches should always be kept at a minimum in high performance code, and for a more in-depth explanation of the rationale behind this I highly suggest reading up this StackOverflow question and answer. To sum it up, the CPU will try to anticipate what the result of a branch instruction will be, and every time it gets it wrong there will be a relatively high performance penalty. Luckily, in this case we can just completely remove that branch by leveraging how bool variables are stored in memory during execution.
From this point on, the snippets will make use of ref variables and methods from the Unsafe and MemoryMarshal class. I have added detailed comments over every single line to make it easy to follow, but if you want to go more in depth on this feel free to read up my previous blog post or the official MS docs.
This is another welcome improvement, but we are not quite there yet. This is about as fast as we can get without delving into more advanced techniques. And it is now time to do just that, by adding vectorization.
Some of you might correctly point out that this latest implementation might still be improved by adding loop unrolling. That is certainly true, but for the purposes of this post we will leave that part out.
What are vectorized instructions anyway?
Vectorized instructions, often called SIMD instructions (single instruction multiple data) are instructions that leverage special registers in modern CPUs to speedup certain types of operations. These registers can hold a series of values next to each other, so that a single operation can be executed on all of them at the same time.
One important thing to note is that the size of a SIMD register depends on the specific processor, and that it also influences the number of values it can hold at any time. To be precise, there are different SIMD instructions that work on different SIMD registers, each of a specific size, but the C# APIs we will use provide a nice abstraction over this and expose these registers to us as a vector type that can vary in size depending on the specific device we are executing our code on. For instance, it we have 128 bits registers at our disposal, we will be able to store either 2 double values (64 bits each), or 4 int or float values (32 bits each), 8 ushort or short values (16 bits each) or 16 byte or sbyte values (8 bits each). SIMD registers might also not be available at all, so we should always manually check whether that’s the case before executing our vectorized code.
Suppose we had two int arrays and we wanted to write to another int array of the same size the sum of each pair of values. Instead of going through each element one by one, we could load a chunk of consecutive values from each array into two separate SIMD registers, sum those registers together in a single instruction, and then copy that resulting register to the right location into the int array to return from the method. The type we need is Vector<T>, which represents a SIMD register with elements of a given type. Here is a sample with both the traditional implementation of this algorithm using a for loop, and one using the Vector and Vector<T> APIs:
With this proof of concept, we can now move on to an implementation of our original Count method that leverages the power of SIMD instructions.
Back to our original problem, there is a small complication: the Vector<T> type does not have an API that just counts the number of occurrences of a value in a SIMD register — we will have to implement this ourselves. Luckily, this only requires a couple of bit operations on top of the basic SIMD structure we have used in our previous SumSimd method:
Was it worth it?
Here are the results when running the 4 different Count implementations on a sample text (a 200 lines document with the code for the 4 Count versions described above) and looking for the ‘\n’ character. Each implementation has been executed in a loop with N iterations each time, to minimize variations.
We can see that the SIMD-powered Count implementation is 110x faster than the LINQ version, and 13x faster than the foreach version. Additionally, we can see that both the foreach and SIMD versions use no memory at all, whereas the LINQ version allocates 120 bytes of memory for each individual invocation.
Whether or not these improvements are worth the time and effort is entirely up to you (especially since even the LINQ version is still only taking a few microseconds to run), but the speedup is indeed noticeable at least on paper, and doing all of this was definitely a fun experiment!
As with my previous blog post, I have added links to every single API that I have used across all the code snippets, if you want to read up more about them. I hope you enjoyed reading this post, please let me know if you spot any mistakes in any of the method implementations!