Matrix Mull…

Posted on November 13, 2010

A slightly technical (but shorter) post tonight as it’s been a busy week of projects, various talks and meetings!

I’ve been working on optimisations on a title we’re finishing up at the moment and matrix multiplies was one area I knew needed optimising.

There have been a couple of developers (links near the bottom of the post) talking about optimising for the VFP and NEON vector processing extensions over the last few years so I was aware that the savings were significant. We’d simply not had to use these optimisations within our own math library code before now.

I’ve also recently heard a bit about Accelerate framework from WWDC so thought I’d have a look at that but my main worry was how calling a library function would avoid function call overhead (at least without fancy linker features removing such overhead).

I thought it would be interesting to do a post looking at rough timings of an operation using the various options we have.

I decided to choose the fairly common 4×4 matrix multiply. As I mentioned these timings are fairly rough, I simply set up loops to perform 100,000 matrix multiplies and (separately from the timed code) ensured results came out the same.

C (direct) is a call to a function that looks a lot like

void lSIMD_Base::Matrix4x4Mul( float * r, const float * a, const float * b )
{
  r[0]  = (a[0]*b[0])  + (a[1]*b[4])  + (a[2]*b[8])   + (a[3]*b[12]);

Etc…

C (indirect)is the same function via an operator* in our matrix class, I wanted to see at the same time whether the temporary matrix and function call were being optimised out on GCC.

VFP is a call to the vfpmathlibrary Matrix4Mul implementation. Note this is a column major matrix mul whereas the others in this rough test are row major.

NEON is code based off a post on Wolfgang Engels comments on his blog

CBLAS is BLAS Accelerate framework in iOS4.0 and above, as you’ll see we’re only going to get a result on OS4.0 and above devices.

The code was compiled using the current 4.2 SDK with the current GCC based Xcode with Thumb disabled and in default release mode (-Os I believe is the default optimisation level).

Device / OS version C (direct) C (indirect) VFP NEON CBLAS
iPhone 4 (4.1) 72ms 90ms 170ms 7.0ms 338ms
iPad (3.2.1) 55ms 69ms 138ms 5.3ms n/a
iPhone 3GS (4.0.2) 95ms 123ms 233ms 9.4ms 473ms
iPod v3 (3.1.3) 134ms 166ms 47ms n/a n/a
iPhone 3G (3.1.2) 249ms 283ms 58ms n/a n/a
iPod v1 (3.1.2) 176ms 214ms 58ms n/a n/a

I’ll try and remember to come back to update this table as I update OS versions and try new things!

The timings are roughly as you’d expect (though I’m not sure the 3G results should be quite that slow – I think the device is on its way out to be honest!). The Accelerate framework is a bit of a disappointment but this is mainly due to call overhead I believe, the WWDC presentation certainly had much better results for other operations and with larger operations such as a Fast Fourier Transform the call overhead becomes a much smaller % overhead of the operation you’re trying to perform. I need to try out some more things with Accelerate as I’m not sure it should be this slow.

As expected NEON is faster on the ARMv7 chips and VFP is faster on the ARMv6 chips, NEON is 10x faster than the C implementation which is quite impressive.

The chart also acts as quite a nice example of general chip speed, I incorrectly believed the iPhone 4 CPU to be faster than the iPads before seeing these results.

As promised here are some useful links relating to the above

Noel Llopis talking about floating point performance a few years ago
http://www.slideshare.net/llopis/cranking-floating-point-performance-to-11-on-the-iphone-2111775

Wolfgang Engel’s original post on the VFP Math Library
http://diaryofagraphicsprogrammer.blogspot.com/2008/11/iphone-arm-vfp-code.html

The VFP math library itself
http://code.google.com/p/vfpmathlibrary/

I believe this will be the same version here in Oolong as well as NEON implementations based off the comments posts on Wolfgangs blog.
http://code.google.com/p/oolongengine/source/browse/trunk/Oolong%20Engine2/Math/

NEON intrinsics guide
http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

Math neon library – extensive math library implementation for NEON (LGPL)
http://code.google.com/p/math-neon/

‘iPhone VFP for n00bs’ – also covers some basics of using inline assembly on GCC
http://aleiby.blogspot.com/2008/12/iphone-vfp-for-n00bs.html

A blog at arm.com on matrix multiplication with NEON
http://forums.arm.com/index.php?/blogs/index.php?/blog/7/entry-241-coding-for-neon-part-3-matrix-multiplication/

Accelerate framework slide from WWDC 2010

  • available via iOS developer centre

Things we’ve been enjoying this week

Kinect

  • I think on the launch titles Move is just winning for us but Kinect is interesting and I’m looking forward to seeing what comes out of the Kinect hacking going on now the open source drivers are out.

http://www.youtube.com/watch?v=OwWSFj3TTLM

  • 4k demoscene intro with code, should be interesting!

http://www.minecraftforum.net/viewtopic.php?f=35&t=69299

  • Working 8-bit CPU in Minecraft

http://www.eetimes.com/General/DisplayPrintViewContent?contentItemId=4210470

  • Related to this blog post, efficient C code for ARM devices

http://www.3drender.com/challenges/index.htm

  • Awesome resource of 3d models intended for artists to texture and light, should be very nice looking test assets for any tech tests though!
Be Sociable, Share!

Tags:

3 Responses

  1. […] This post was mentioned on Twitter by Gavin Bowman, Simon Barratt, Frederic Tessier, Karnak Games, Bob Koon and others. Bob Koon said: RT @barog: #iDevBlogADay new post : 'Matrix Mull…' http://bit.ly/d9rYiO – bit later than usual sorry! Enjoy :) […]


  2. Ronald
    November 14, 2010

    Maybe you should optimize your engine code for cache coherency and not doing super fast matrix math. Even such a “slow” system like the iPhone is memory bound in most cases.


  3. admin
    November 14, 2010

    Absolutely, and data oriented design is vital on pretty much every platform as you say.
    The articles focus was a quick investigation into how much faster the SIMD features were on the platform and to discover any practical issues at the same time.

    Thanks for the comment, a very valid point.


Leave a Reply