Friday, January 29, 2010

Run-time Compilation - Performance

I put together an example of compiling code generated at run-time, and then loading it using Extension:

I'm preparing some experiments to compare the performance of numerical code compiled in advance to that compiled at run-time. There will be three test cases involving complex matrix operations:

  1. The matrix sizes is known at compile-time, and can be used by the compiler for optimization. The matrix computations are done in the main binary.
  2. The matrix sizes are not known at compile-time. The matrix computations are done in the main binary using algorithms that work for arbitrarily-sized matrices.
  3. The matrix sizes are not known when the binary is compiled. When the binary loads a matrix, it will generate code on the fly to process the matrix. It will then compile the code into a shared library, load the shared library, and run the algorithm.
My hypothesis is that it will be possible to make the runtime-generated code faster than the arbitrarily-sized matrix code, for the following reasons:
  • The compiler will be able to make certain optimizations that it can only do when it knows the size of the arrays being looped over.
  • I'll be able to use fewer variables and more constants in the compiled code.
 I expect it to be slower than the specialized pre-compiled code for two reasons though:
  • Code in a shared library usually runs more slowly than code in the binary itself. One reason for this is that shared libraries do not know which address space they will be running at during run-time, and can't hardcode as many pointers and such as an executable. (Note that operating system libraries often have an optimization to avoid this problem, by using reserved address space)
  • Even if it did run as fast, it has to be compiled and loaded first.
Any opinions? I hope to post preliminary results in the next couple weeks or so.


Immanuel.Hayden said...

hi, just wanted to know what the status on this is. looking forward to see some results :)

Jeremy said...

The code I wrote ended up using lots of internal pointers to reference the parts of the matrices. Since I used this instead of array subscripts on standard dynamically allocated arrays (or global, static arrays or some other technique), the performance was basically identical within the shared library. I'm going to run some more tests though - using different types of matrix access.