I have recently been spending a lot of time working on ComputeSharp, a .NET Standard 2.1 library written in C# 8.0 that lets you run code in parallel on the GPU through DX12 and dynamically generated HLSL compute shaders. That is a somewhat unintuitive description for a library that does something conceptually simple: it runs code in parallel in a manner similar to Parallel.For, but on the GPU instead of on the CPU.
The library works as follows: just like Parallel.For it takes an Action<T> instance representing the code you want to run on the GPU, it then decompiles it to generate the shader code to be compiled and executed on the GPU, and inspects it to identify all the variables that the code is accessing. Once these values have been discovered, they need to be properly loaded so that the GPU can access them and the shader can finally be executed.
This post is a case study into some of the technical challenges I faced when writing the code that handles the variables extraction and loading, and how I managed to get a >20x performance improvement when moving from reflection to dynamic code generation. It is not meant to be a tutorial to follow step by step, but more of a summary of how I approached this problem and eventually solved it. I have learnt a lot while working on this library, and here I just want to share some of that knowledge and hopefully give new ideas to other developers working on their personal projects. After all, I got the idea to use dynamic code generation while going through this Reddit post, and I would really like people reading this post to find the same inspiration I had to start learning about some new topic I had never really explored before.
I will start by introducing the issue at hand, and I will generalize it so that we will not need to care about the specifics of my library or its codebase. I will also be describing all the necessary details about how the C# compiler works and what it does behind the scenes, so that every C# developer will be able to understand what is going on at any time, even without previous experience in this particular area.
If you have never dealt with reflection or dynamic code generation before, welcome to the magical world of metaprogramming!
The goal we are trying to achieve
We want to write a method that given a Delegate instance with an associated closure retrieves the values of all the variables that are captured by that closure. Additionally, we want all those values to be loaded and stored in two different arrays: an object array for reference types and a byte array for value types. We will serialize all value types assuming that they respect the unmanaged constraint.
Let’s consider the following code snippet with a practical example of a delegate, to introduce the concept of closures in the C# language:
The Delegate we defined (an Action<int> instance in this case) is accessing the variables “array” and “value” from its parent method. To make this possible the C# compiler creates a closure, which to use Jon Skeet’s words can be described as the following:
[…] a block of code which can be executed at a later time, but which maintains the environment in which it was first created — i.e. it can still use the local variables of the method which created it, even after that method has finished executing.
We can look at the code the C# compiler generates when compiling the previous code snippet to see how this is achieved:
The compiler has created a closure class, which in this case is <>c__DisplayClass0_0. The class name is not valid according to the C# syntax (as it starts with an angle bracket), but the compiler is actually quite happy with it anyway. This is particularly useful to avoid having to worry about potential name collisions with user-defined types. This class exposes two fields representing our two captured variables, and a method which corresponds to the one we specified when declaring the Delegate instance. We can see that our code has been replaced by an instantiation of the closure class, and that our two variables have been assigned to the two fields in that class, which is how they can still be used even after the method has returned.
Note that if our lambda expression had not actually captured any variables, the resulting closure class would have had a slightly different structure. We are not going to take that scenario into account and just assume that there will always be at least one captured variable.
The previous structure becomes slightly more complicated if we capture variables that are declared in different scopes, like in this example:
Here we have created a new scope using a pair of brackets with no additional logic tied to them, but the same would have happened if we had used a for loop, a using statement or any other scoped constructs. Let’s take a look at the generated closure in this case:
There’s two of them now! The compiler has created a separate closure class for each scope that the captured variables belong to, going from the innermost to the outermost one. The closure class that hosts the method representing our lambda expression has a field named CS$<>8__locals1, which does not correspond to any of our captured variables: that is a reference to the instance of the other closure class containing the captured variable from the outer scope. This will happen for any additional scope, which means that we will need our code to handle any number of nested closure classes.
The first approach to solve this problem simply consists in using reflection, both to inspect the closure class (or classes) to identify the fields we are interested in, and to extract the values of those fields when needed. To introduce the topic to those that have never used reflection before, here is a small example that shows how we can use reflection to retrieve the value of a given property of an instance of a given type:
Let’s imagine that we did not know the type of our object variable (which in this case is a string), but we did know for sure that it had a property called “Length”. Here we are retrieving the Type of that object and then a PropertyInfo instance for the property we need. From there we can get the value of that property for the instance of our object by calling GetValue(object). This method returns the value of a property as an object, which in this example we are just displaying on the console.
In our scenario we will not know in advance the names of the fields in a given closure class, so we will need to retrieve them at runtime, like this:
Here we are retrieving a list of all public fields of the type of our object using the GetFields() method, which returns an array of FieldInfo items. Then we are just iterating over that array to display the name and value of each field.
The key concept from these examples is that retrieving the values of all the fields (or properties) of an instance of a given type requires two separate steps: we first have to obtain the FieldInfo collection for the type of our object, and then get the value that those fields have for our object instance.
Let’s give it a go
We can now write a first version of our method to extract the captured variables for a given Delegate:
Here we are doing the classic pre-order exploration of our class hierarchy in the input closure: we simply check whether a field name starts with “CS$<>”, which is how the C# compiler marks generated fields for nested closure classes, then if that is the case we recurse and explore that tree, otherwise we simply yield the current field. Note how the Delegate class exposes a handy Target property that gives us a reference to the instance of the closure type that the delegate will be invoked upon. Let’s test this to see if everything works as expected:
Everything seems fine, but there is a catch: this code will throw an ArgumentException if our closure class contains a nested closure. That is because when we call field.GetValue(action.Target) we are actually passing a reference to the instance of the outermost closure type. But if the field belongs to one of the nested instances, that call will fail because the instance we are using is not the one exposing the field we are trying to get the value for. We can fix this by keeping track of the parents of each field. We will also add a ClosureField class to make our updated code a bit cleaner:
With this change we can now retrieve the values of all fields, both nested and not, and we can test that by running the following code:
The final step is to write a method that takes as input a Delegate and the result of our initial exploration (the list of fields and their relative parents), and produces the two arrays we are interested in. This is how such a method could be structured:
This might seem complicated for those not familiar with unsafe code, references and pointers, so let’s go over this snippet step by step.
- First we are iterating over the captured fields to calculate the length of the arrays we will need to populate. To get the size in bytes of a captured value type, we can use the Marshal.SizeOf(Type) method.
- We then allocate the two arrays, and to keep track of our position in each one of them we initialize two running offsets. The arrays are allocated using the ArrayPool<T> class, which provides a high-performance pool of arrays of a given type. This means that instead of allocating a new array each time with new T[int], the pool will internally reuse arrays whenever possible, which saves us some time. One thing to note is that when renting arrays through this class, they are not guaranteed to be of the exact size we requested: they could also be bigger. In this snippet we are just returning those arrays to the caller ignoring this implementation detail, just to keep the code more compact.
- Then we iterate over each captured field. If it is not a value type we can just assign it to the right position in the first array. If it is, we need some additional work. FieldInfo.GetValue(object) returns the value of a field as an object, which means that if the field type is a value type (eg. an int), it gets boxed: the runtime will copy it on the heap and pass it around by reference. We are interested in the actual value of that field, not in a reference to it, so we need a way to access that value from here. To do that, we first allocate a GCHandle. We also tell the runtime to “pin” that object, so we avoid the risk of the garbage collector moving it to another memory location while we are reading from it. Once we have our handle, we can use AddrOfPinnedObject() to get the address where the data we need is stored in. Now we need two ref pointers to the pinned data and to the target position inside our array: we use Unsafe.AsRef<T>(void*) to convert the address of the pinned object to a ref value, and Unsafe.Add<T>(ref T, int) to shift the ref pointer into our byte array to the right position. Finally, we use Unsafe.CopyBlock<T>(ref T, ref T, uint) to copy the data from one location to the other.
First steps with IL
If you are not skipping this paragraph, you might have never heard of the word “IL” before. Here is a brief summary of everything you need to know to be able to follow along in the rest of this post:
- IL stands for Intermediate Language, and it is the low-level common language that all the languages running on a .NET runtime (eg. .NET Core) are compiled to. Whenever you compile some C# code, the compiler generates IL code, which is not machine code. When you then run that code, the JIT (Just In Time) compiler is responsible for converting that IL code into machine code in the most efficient way possible.
- Each IL method is made up of a series of opcodes, representing the operations to be executed. These instructions are executed one after the other, with the possibility of jumping to other parts of the codebase when needed. The runtime works by pushing and popping values to its execution stack, which is also used to pass arguments to functions and other instructions. Each opcode can perform one or more operations, like popping values from the stack to read them and pushing new values to the stack so that they can be used when executing other instructions.
- An IL method can also declare a number of local variables, which are accessed by their position (the order in which they are declared).
- When a method returns, the return value is just the value that is at the top of the execution stack in that moment
Here is an example with a method in C# and its literal translation in IL:
Enter code generation
The previous implementation works fine, but using reflection to get the value of all the closure fields is not ideal, as it is generally slow. It would be nice if each captured field exposed a method that took an object, cast it to the right type, loaded the field value and returned it. Just like the FieldInfo.GetValue() method, but without the slowdown caused by reflection. That is not the case, but we can actually generate these methods ourselves at runtime, through the use of the DynamicMethod class and its available APIs.
This is what such a method should do if we were trying to load and return the value of the X field of a boxed Vector2 instance:
This IL method first loads the boxed object with the ldarg.0 instruction, which simply loads the method argument at position 0, then uses the unbox instruction to replace that reference in the stack with the unboxed Vector2 value. Then ldfld reads the value of the specified field (in this case, Vector2.X) and box, of course, boxes that value before returning.
To optimize our previous code, let’s write a method that given a ClosureField instance builds a dynamic method that loads that field, given an instance of the class that exposes it. In our case, the method will be wrapped by a Func<object, object> instance, as it matches the signature we need:
Once we have this, we just need to store our ClosureField instances paired with the relative function to retrieve their values, and use these functions instead of FieldInfo.GetValue(object).
Some of you might correctly point out that such a method could also have been created using expression trees, and that is absolutely right. But, introducing IL was a necessary step to gradually build to our last optimization in the final paragraphs, which wouldn’t have been possible without it.
Can we do better?
Sure! Using these dynamic methods already improves the performance a bit, but we have a slowdown whenever our closure class contains nested fields, as in that case we have to perform additional iterations to access them. We can modify our method to build the dynamic field getters so that it also performs loop unrolling for all the parent fields, when needed. These new dynamic getters will always just take our first target object instance and traverse down the fields hierarchy until they reach the desired field. Here is what our method builder looks like now:
With this change, we will be able to always get the value of a captured field with a single method invocation to our custom methods, no matter the actual depth of the field in the closure type it belongs to.
That’s fine, but can we do even better?
If we don not want to accept any compromises, there is definitely room for more optimizations. Let’s look at the possible bottlenecks in our code:
- Multiple invocations, one for each variable getter.
- Boxing: if a getter is retrieving the value of a field of a value type, right now it is also boxing it, as the return type for our functions is object.
- Pinning: allocating GCHandle instances is expensive, and pinning objects should generally be avoided, if possible.
- Repeated calculations: our GetData method is doing a number of operations every time it is invoked, like iterating over the list of captured fields, checking each field to see if it is a value type, calculating the right byte offset to write the value type fields into the byte array, etc.
What we want is a single method that takes our object instance and retrieves the values of all the captured fields one after the other. We can do this just like we did the loop unrolling to traverse the hierarchy of each captured field. To solve the issue of value types being boxed, we can avoid returning the values of our captured fields entirely: we can instead generate a method that takes our object and byte arrays and writes the field values directly from there. We will actually go one step further, and have our method take ref parameters to the start of our arrays. This way we will be able to assign values directly to each memory address without the JIT compiler adding safety bound checks, which it does whenever we use the T[int] indexer for array types.
We will first need to write a few extension methods for the ILGenerator class, so that the IL generation for this dynamic method will be easier to manage. We are going to perform a much higher number of operations in IL this time around, so we will definitely need them.
Let’s write a method to store a local value. As mentioned above, local variables in IL are accessed by their index. The instruction we need is stloc.
Here we first note one peculiar feature of some IL instructions: in some cases we have at our disposal “multiple versions” of a given opcode that we can use to perform some operations faster. For instance, the stloc.0 opcode stores the value on top of the execution stack on the local variable with index 0, and it is faster than the standard stloc variant as it does not need to also load the target index as a parameter: it is embedded into the instruction itself. This method makes sure to always use the fastest opcode for the index we need. We also need an EmitLoadLocal method, which will be structured just like this one, with the only difference being the fact that it will use the ldloc opcode and its variants instead of stloc.
We then need an extension to replace Unsafe.Add<T>(ref T, int) calls. Our delegate is receiving a pair of ref parameters pointing to the first element in each array, and we want to be able to move those references ahead by a given offset to access other elements of those arrays.
Here we are first using the ldc.i4 instruction, which pushes an int value to the top of the stack. As before, we are also making sure to always use the fastest possible opcodes, which means using the explicit versions ldc.i4.0 through ldc.i4.8 for values in the [0,8] range, and the ldc.i4.s opcode for values that fit in a sbyte parameter. Then, we use conv.i to convert our loaded offset (which is of type int) to a native int, which is a type that simply represents a memory address, and whose size depends on the CPU architecture the code is running upon. Now we can use add, and we end up with our shifted reference on the top of the execution stack.
The last piece of the puzzle is a method to write a value at the location pointed by a given reference. This method will assume that the execution stack will have a reference and then a value of a specified type at the top of the execution stack, and will make sure to write the value to the target location using the appropriate instruction:
In this method, we first need to check whether the value we are trying to set is a value type. If it is not, then we just need the stind.ref opcode, which simply stores an object reference at a given memory location. If it is, we need to check whether a dedicated opcode is available for our current data type: stind.i4 for int values, stind.r4 for float values, etc. If that is not the case, we can fallback to the stobj opcode, which despite its confusing name is actually used to copy value types to a supplied memory address. Lastly, we need to check whether the selected opcode is stobj, in which case we will also need to pass a type token to indicate the value of the type being assigned, otherwise the runtime will not be able to properly execute that particular instruction.
We now have all the building blocks we need, and we just need to put them together. We also need to define a custom delegate that will wrap our dynamic methods, to be able to specify two of the input parameters as ref parameters, and then we can finally build our updated IL method:
As before, the first step is creating a DynamicMethod, making sure to properly specify its signature, including ref parameters. After that, we need to create a mapping for the root closure type and all the nested closure types, so that we can assign a unique index to each of them, which will correspond to their location in our list of local variables in IL. Then we use an HashSet<T> to keep track of the indices of the local variables that have already been initialized. Once we have our mappings, we define a local method that we will need to load a given field. This method will check whether the parent instance of that field has already been loaded, and if it is not, it will go back to the most in depth loaded instance and will resume the traversal from there, until the needed instance is assigned to its corresponding local variable and loaded on the execution stack.
We start the construction of the IL method by declaring all the local variables we need, and then we load the input object instance, cast it to the right type and store it in the first local variable. After that, we can iterate over the captured fields, and prepare the target memory reference to write to. If the field is a value type, we load our ref byte parameter and shift it ahead by the current offset we are at in the byte array, and then we update it by retrieving the size of the field type. If it is a reference type, we instead load the ref object parameter. To update the offset into the object array, we are using the size of the object type, which we can retrieve with the Unsafe.SizeOf<T>() method. This works for every other reference type as well: a memory address has a fixed size that only depends on the CPU architecture. Once this is done, we can call our LoadField(ClosureField) method defined above and our extension to write the loaded field value to the target memory location.
We are almost done now, we just need to change our GetData(Delegate) method so that it can use the delegate we’ve just created:
The first thing you might notice here is that our method is also taking the number of captured reference and value type variables as parameters, instead of calculating them. This is another optimization we can add, since those values only need to be calculated once instead of every time we want to extract data from a given closure instance. We can apply this same optimization to the previous variants as well, to prevent this version from having an unfair advantage over the others. Looking at the rest of the method body, we are creating our object and byte arrays as usual, and then getting references to the first item of each array. If either of the arrays is empty, we do not access it and just use a null reference instead, obtained with Unsafe.AsRef<T>(null). Then we call our latest dynamic IL method, which will take care of all the rest.
And that’s it! We have gone all the way from a purely reflection-based approach to solve this problem, to an optimized version that not only ditches reflection entirely, but memory allocations too.
Was it worth it?
I have prepared a benchmark that compares the 4 different implementation we have discussed. The results are categorized into three scenarios, from “Small” being a simple closure with a couple of captured reference and value types, to “Large” having over a dozen captured fields spread across three different variables copes. Each test case has been run one million times, so the mean execution time you see below is not for a single call to our GetData methods. Here are the results:
Compared with the baseline being the reflection-based implementation, using dynamic getters is over 1.4x faster on average, which becomes a whopping 25x faster with the single getter and the other optimizations! We can see that there is not much difference between the first implementation with dynamic getters and the one with loop unrolling, but doing that intermediate step was still useful to gradually approach the last level of optimization.
I have setup a full test project that demonstrates all these different techniques and includes the benchmark used to generate the results listed above. You can find it here on GitHub: https://github.com/Sergio0694/ReflectionToIL.
I hope that no matter your skill level with C# you have enjoyed this post and found it interesting. I definitely had fun working on this project and writings all of this down! I have added links to every single API that I’ve used across all the code snippets, so if you want to get more details on any of them you can just click and read the official MS docs page about it.
If you spot any mistake in either this post, one of the embedded code snippets or one of the files in the test project, please let me know!