Announcing the Microsoft.Toolkit HighPerformance 7.0 release
The 7.0 update of the Windows Community Toolkit is finally available, including a huge number of new features (like the new MVVM Toolkit, which we presented at .NET Conf), bug fixes and lots of changes and improvements also implemented thanks to the amazing support and feedbacks from the whole community. It also includes the second major public release of the Microsoft.Toolkit.HighPerformance package, with a large number of brand new APIs being included, performance improvements and .NET 5 support out of the box.
As a small recap on the HighPerformance package itself: this is a library part of the Windows Community Toolkit and targeting a number of different runtimes and profiles (ranging from .NET Standard 1.4 to .NET 5), that includes a collection of extensions and helper methods that can help when working on performance and memory usage critical applications. A brief introduction and the official docs are available in the MS docs website.
In this blog post I wanted to through some of the new features being included in this new version of the HighPerformance package, explain how they work and how they can be used to help in performance oriented scenarios 🚀
Memory2D<T> and Span2D<T> types
T he Memory<T> and Span<T> are two widely used types from the .NET BCL (Base Class Library) that implement a nice abstraction over arbitrary buffers, but they offer no way to properly extend that same level of abstraction to arbitrary 2D memory areas. This is both due to the fact that they lack useful helper methods and indexers to work in two dimensions, and because they inherit some restrictions from other parts of the ecosystem that make them not ideal for working with large underlying buffers, such as the fact that they can only store a length of size int. This becomes a particularly noticeable constraint when working in multiple dimensions, as it is easy to exceed the maximum supported value even with individual indices on each axis being relatively small. For instance, a 2D memory area of size 50.000x50.000 will be enough to exceed the int.MaxValue limit.
The Memory2D<T> and Span2D<T> types are meant to be a natural extension for Memory<T> and Span<T>, but working in 2 dimensions. They include several key features such as offering a flexible abstraction of discontiguous buffers due to their ability to slice 2D memory areas, meaning that each row might not be adjacent to the end of the previous one. They also do not share the same int.MaxValue limitation as standard BCL types — while the length of each axis is still restricted to int for consistency with .NET arrays, the total size is exposed as a nint value. All the internal indexing operations are performed using nint values too, so that on 64 bit architectures it is possible to index much larger memory areas than it was before.
Since one line of code is worth a thousand words, here is an example of how all these various APIs can be used to perform different operations:
These are just some of the features enabled by these new types, and there’s plenty more available APIs that can be used to easily work with these new types as well as to interoperate with the existing types from the BCL!
The StringPool type implements a fast, thread-safe and configurable pool for string instances, using a least recently used (LRU) caching policy. It can be used to minimize allocations when creating multiple string instances from buffers of char or byte values, for instance when parsing text documents with many repeated values, like a CSV database, or when parsing tokens from web requests. It provides a mechanism that is somewhat similar to string interning, with the main differences being that the pool is configurable, can be reset, is implemented with an automatic replacement policy and doesn’t require to pre-instantiate string objects, so it can save the initial allocations as well when working on temporary buffers.
We put this to the test with a benchmark parsing the taxi-fare-train.csv file, a 24MB CSV database from dotnet/machinelearning containing many duplicated text labels. The full benchmark gist can be found here, but the important bit is the difference in code to parse each string value:
Here StringPool will take care of transcoding the input UTF-8 sequence into UTF-16 (working on buffers rented from a pool), and then it will try to find an existing string instance matching that content. If it is found, its timestamp is updated and that instance is returned, otherwise a new one is created and added to the internal map. This is all completely transparent to users, that just need to change a single line in this case. This is just one of the available StringPool APIs — there are overloads taking a ReadOnlySpan<char> as input (and doing no transcoding), or a string (similarly to how string.Intern(string) works), and more. For now, let’s look at a benchmark to see how this change changed the performance and memory usage of our code (using both the two possible solutions displayed above):
We can see that the performance is just a bit worse when using Encoding.UTF8, which makes perfect sense considering both versions of the code still need to do the same transcoding from UTF-8 to UTF-16, but the StringPool one also needs to deal with the additional logic to update its internal data structures — despite this the difference is still just 8%. On the other hand, doing the decoding manually with Utf8.ToUtf16 lets us reduce the overhead and gives us better performance than the original version, even including the extra work needed to manage the pooling. And in both cases, if we look at the memory usage, we can see that using StringPool reduced the total allocations from 67 MB to just 512 B: this new implementation is using over 100,000x less memory! If we use the profiler to inspect the allocations, we can verify that this new version is only ever allocating the unique string instances one for each label we encountered, and then working with no additional allocations from that point on.
How does it work?
We can briefly go into some more details on how the StringPool type actually works. Here is a diagram that illustrates a simplified representation of the internal architecture being used:
There are 3 main components that make up the StringPool type:
- Each StringPool instance contains a number of buckets, that is dynamically calculated upon construction based on the maximum capacity of the pool. Each of these buckets is independent from the others and implements the actual LRU pool logic. The reason for these buckets is to provide a more fine-grained locking ability and reduce thread contention when multiple threads are trying to work on a given StringPool instance. Depending on the input sequence of characters being used to index the pool, a target bucket is selected and a lock is taken on just that bucket, so that other threads will be able to still work concurrently on other buckets. This is possible because each unique ReadOnlySpan<char> value will only map to a single bucket.
- Each bucket contains a hashset, which is implemented using the standard approach of using hashcode-indexed buckets and chained sequences, but then specifically tine-tuned to work within the StringPool type. For instance, each entry also stores the index of the associated min-heap item (this is needed to perform the downheap operation when an existing map instance is accessed), and some operations have also been heavily optimized by leveraging how the StringPool type work.
- Each bucket also contains a min-heap, implemented as a 0-indexed array. Each heap entry contains a timestamp, which is incremented every time a string instance is retrieved from the pool, and the index of the relative index in the chained list in the map.
As for the Memory2D<T> and Span2D<T> types, feel free to have a look at the source code if you’re interested in more details about all this!
Due to how the Memory<T> and Span<T> types work, there is no intuitive way to cast a Memory<TFrom> instance to a Memory<TTo> instance, such that the underlying data would just be reinterpret-cast to type TTo when getting a Span<TTo> from the second Memory<T> instance: there are APIs to to do that for Span<T> instances, but not for Memory<T>. The reason for that is that with a Span<T> it’s just a matter of retrieving the wrapped reference, reinterpreting it and then adjusting the size, then creating a new Span<T> instance (there is a built-in API for this in the BCL as well, it’s MemoryMarshal.Cast). But a Memory<T> instance is completely different: it wraps an object which could be either a string, a T array or a MemoryManager<T> instance and stores a starting offset and length, and then dynamically creates a new Span<T> when needed. Because of this, “casting” a Memory<T> is not as trivial, and there is no built-in way to do this with the existing APIs.
I originally had the idea for this feature after seeing this issue in the amazing library ImageSharp, and after adding (limited) support for this there I decided to also expand on this feature and make it part of the HighPerformance package, given the number of other developers that expressed interest in using this API as well. The usage is extremely straightforward and mirrors how you can use extensions to cast a Span<T> instance as well:
These extensions work by first retrieving the underlying data storage for the input Memory<T> instance (so that they can work on it directly for better efficiency), and then creating a custom MemoryManager<T> instance that is responsible for creating Span<T> instances of the target type when requested. There are three different types of custom memory managers (for string, T arrays or MemoryManager<T>), so that the extension can automatically select the right one and reduce overhead (as opposed to having a single one wrapping the entire input Memory<T> instance). Additionally, chained casts are optimized in such a way that if you attempt to cast a Memory<T> that itself was the result of a previous casting operation, the new cast will first retrieve the original data storage and wrap that one directly. This means that even if you try casting a Memory<T> instance multiple times, each resulting Memory<T> instance will only be using a single custom MemoryManager<T> instance to retrieve data, as opposed to just chaining all the previous instances as well for each cast operation.
This last feature I want to mention combines two different capabilities into one: it enables IBufferWriter<T> to interoperate with Stream-based APIs, and indirectly it also offers the ability to implement a memory pooled MemoryStream. The reason why this extension was added is that even in modern codebases going all-in with newer APIs and types such as Span<T>, it’s very common to have to deal with possibly older APIs that just accept a Stream instance as input: this extension is meant to help in these scenarios by providing an extra bridge between these two worlds.
As a practical example of where this new API could be used, here is a simple snippet of code that I found in a codebase I was recently browsing:
This code is simply compressing and decompressing data passed in a byte array, using the DeflateStream API. We can see how this code is not particularly efficient for a number of reasons, from the MemoryStream type not using array pooling to those final ToArray() calls to get the new array to return. Some of these issues could be worked around, such as replacing the ToArray() calls with GetBuffer(), and changing the return type to a Memory<byte> instead, but the end result would still not be ideal. For one, we have no way to transfer ownership of the returned buffer, so we have no way of making it a pooled array. Here is how we could rewrite the two methods above using some of the APIs in the HighPerformance package:
We can immediately see a number of differences in this second snippet. The return types have now been changed to IMemoryOwner<byte>, so we can transfer ownership and allow consumers to also dispose the buffer so that it can be returned to the pool. This means that we can amortize the memory allocations for repeated calls to this method. Similarly, we’re not using the ArrayPoolBufferWriter<T> type instead of MemoryStream, and turning it into a Stream to pass it to the DeflateStream APIs using this new AsStream() extension. This effectively lets us treat it as a pooled stream, so that as the DeflateStream type writes data to it, the internal buffer will just be expanded by renting a new larger array from the shared pool. This again lets us amortize the memory allocations over a larger number of calls. We also tweaked the input types to allow consumers to have more flexibility over the exact data source they use when invoking these two methods.
And here are some benchmarks to compare these two versions:
The pooled version is faster in both cases, especially when decompressing data, and more importantly it uses a fraction of the allocated memory every time! This is just an example of how all these APIs can be used together with the existing types and APIs from the BCL to write faster and more efficient code.
If you’re interested in learning more about the HighPerformance package, check out the docs in the .NET API browser, or the source code in the Windows Community Toolkit repo on GitHub! The new 7.0 releases many other new features, improvements and bug fixes that I couldn’t fit in a single post, and there’s more things planned for the future still! If you have any feedbacks or ideas, please feel free to open an issue as well!
Here’s a few useful links: