SMUGOpenCL is an open-source library (MIT License) that provides a Cocoa wrapper to the OpenCL framework in Mac OS X 10.6 Snow Leopard. Its applications include game programming, scientific computing, image processing, and much more.

The project is located on bitbucket at this URL: http://bitbucket.org/liscio/smugopencl.

The repository includes a simple example that demonstrates how to set up an OpenCL context, and compile/run a kernel. It also provides an example of mixing the SMUGMath and SMUGOpenCL frameworks together.

An example of SMUGOpenCL usage (taken from the included example program) can be seen at the wiki page here: https://bitbucket.org/liscio/smugopencl/wiki/Home

There are many features to come, so follow the project on BitBucket to keep an eye on SMUGOpenCL development.

SMUGMath is an open-source library (MIT License) that is designed for working with large vector data sets. Applications lie in signal processing and statistics, among many others.

The project is located on bitbucket at this URL: http://bitbucket.org/liscio/smugmath.

As of this writing, there is only a cursory feature set. The SMUGRealVector and SMUGComplexVector classes are defined, along with a handful of useful operations on either type. There are also some unit tests included, which should help explain the various operations, and allow further tests to be easily written.

SMUGMath’s strength lies in its flexibility. You can easily add highly-tuned numerical operations by creating categories on either class, using Apple’s Accelerate framework, OpenCL, Grand Central Dispach, or any combination of them, to suit your application.

In the coming weeks, SMUGMath will gain more core functionality relating to signal processing, and some documentation to highlight effective use of the framework. Follow the project on BitBucket to watch these developments as they happen.

Math and Cocoa

A few years ago when I was working on FuzzMeasure, I found myself in need of a math library to work with large data sets, but I didn’t want to deal with building C++ classes to fill out my framework. Ideally, I’d build something that fit in with Cocoa as much as possible.

Why not C++?

Well, I really dislike C++, for starters. It has a whole lot of fancy OO features, but it’s an awful lot of rope to hang yourself with. Every time I’ve worked with C++, I’ve hated it. If I’m coding for fun, I’ll code with a language I like, thanks.

But Objective-C is missing some crucial things, right?

  • Syntactic sugar. You couldn’t write aVector + bVector to add two vectors together using Objective-C, as there is no operator overloading (not a bad thing, as you’ll see below).
  • Templates. Yes, I actually wrote separate SMUGRealVector, SMUGDoubleVector, and SMUGComplexVector classes.
  • Speed. Maybe, but I’m not so sure the win for C++ is cut & dry.

After using this library for over 4 years, I haven’t found these missing “features” to be a problem. In fact, I find the Objective-C language to be far better, in the long run, because I’ve gained these features (among others):

  • Extensibility. You can build specific features onto existing classes, which is especially powerful if you’re consuming the class as part of a framework. (i.e. You don’t own it.)
  • Readibility. The code’s more verbose, but you can read the code again in a few years without much trouble.
  • Frameworks. Cocoa is amazing. It’s chock full of extremely useful classes that basically do what I want, anyway.

When I talk about readability above, I’m especially poking at overloading operators in C++. How do you know what v * C does, just by looking at it? Does it scale v by a constant? How can you be sure at first glance? Compare with: [v multiplyBy:C] or [v scaleBy:C] for vector multiplication, and scaling.

Keeping it Cocoa

I started out by naïvely considering NSArrays of NSNumbers, but that fell over quickly. I wanted to utilize Cocoa: but not too much. I also wanted it to be fast (or give me an opportunity to opimize it easily later).

I had to operate primarily on long vectors, so I could carry over mountains of MATLAB code that are built with vectors in mind. Take a signal, stuff it in a vector, do FFTs on the vector, normalize it, multiply with another vector, work in the frequency domain, etc.

Well, it doesn’t get much faster than arrays of floats—big ol’ blobs of memory on your system—to deal with giant data sets like these. So, how do we get arrays of floats in a Cocoa-like way?

Storing Vectors In Memory

@interface SMUGRealVector : NSObject  {
    NSMutableData *mData;
}
// ...lots of stuff

That’s right—there’s not much to the classes. Just a blob of NSMutableData, which is a giant step up from a naked pointer to a blob of floats. NSMutableData gives us so much:

  • Operations on, and extraction of, ranges of data.
  • Simple increase/decrease operations on its size, even appending other blobs of data.
  • File serialization routines!
  • And more, of course…

Yes, I’m competent enough to write the above routines myself, but I know better than that. So how do we build these vectors?

Vector Construction

- (id)initWithLength:(unsigned int)N;
{
    if ( !( self = [super init] ) ) {
        return nil;
    }
    mData = [[NSMutableData alloc]
        initWithLength:(N*sizeof(float))];
    if ( !mData ) {
        return nil;
    }
    return self;
}

// And a whole whack of convenience functions...
+ (id)realVectorWithLength:(unsigned int)N;
+ (id)realVectorWithOnes:(unsigned int)N;
+ (id)realVectorWithIntegersRangingFrom:(int)begin to:(int)end;

Looks good, right? Some of you might recognize some MATLAB idioms in there, such as ones(), or replicating [1:5], which returns [1 2 3 4 5]. Great, what else can we do?

Accessing Vectors

To get at the bits & pieces of vectors, you have a few options:

- (float*)components;
- (unsigned int)length;
- (void)setComponent:(float)c atIndex:(unsigned int)i;
- (float)componentAtIndex:(unsigned int)i;

They do what you’d expect, wrapping the matching NSData routines in the case of component, and length. You can also operate on ranges of vectors:

- (SMUGRealVector*)realVectorInRange:(NSRange)range;
- (void)appendVector:(SMUGRealVector*)v
- (void)replaceComponentsInRange:(NSRange)range
    withRealVector:(SMUGRealVector*)v;

You can actually build a ‘vector queue’ of sorts, by using the routines above. In one case, you can build a standard queue by appending vectors on one end, and extracting a sub-range from the other end—discarding the original, longer vector as you go. This is certainly memory-intensive, but these are extremely handy to bootstrap some signal-processing algorithms (overlap-add and overlap-save, for instance).

(If you wanted to optimize overlap-add, and overlap-save, you would instead build a large circular buffer, of sorts, and just use a combo of replaceComponentsInRange: and realVectorInRange: to do your bidding, but I digress…)

Vector Math

This is where I think my math library really kicks ass. I built with simplicity in mind, but I also wanted to ensure I could take advantage of vecLib/vDSP.h as much as possible, because I think it’s an underused API from Apple.

Here are a few choice routines:

- (void)square;
{
    vDSP_vsq( [self components], 1, [self components], 1,
        [self length] );
}
- (void)multiplyBy:(SMUGRealVector*)x
{
    NSParameterAssert( [self length] == [x length] );
    vDSP_vmul( [self components], 1, [x components], 1,
        [self components], 1, [self length] );
}
- (void)scaleBy:(float)scalar;
{
    vDSP_vsmul( [self components], 1, &scalar,
        [self components], 1, [self length] );
}
// and so on...

Isn’t this great? These are one-liner routines thanks to vDSP!

(I actually chose not to use the FFT library from vecLib, for a few reasons. First of all, I find the way it stores complex values, and the DC/Nyquist components to be strange. Second, I encountered a (now fixed) bug long ago with FFT lengths > 128K.)

Memory vs Speed

When I designed the vector class, I had to keep in mind that there was a tradeoff between memory, and speed. For instance, to calculate, z(Cv+w), where C is a constant, and v,w,z are vectors, this is valid:

float *vc = [v components];
float *wc = [w components];
float *zc = [z components];
unsigned int len = [z length];
// Gotta be careful!
NSParameterAssert( len == [w length] && len == [v length] );
for ( unsigned int i = 0; i < len; i++ ) {
    zc[i] = zc[i] * ( ( vc[i] * C ) + wc[i] );
}

However, instead I prefer to write:

[v scaleBy:C];
[v add:w];
[z multiplyBy:v];

(Note that the results of the calcuations first stuck to v, and then to z. Thus, a copy of both v and w must be made in advance if you want to retain their values for later. This is something I have to think about constantly when using the library…)

In most cases, the latter way of writing the code turns out to be much faster. This is because vDSP is built to operate on large data sets very quickly, so it can often outperform hand-written loops in many cases. Furthermore, it's much easier to read and maintain this code!

However, there are certainly instances where you can do better than the canned routines, and this is where I think Objective-C really shines for this library.

Categories Rule

In specific projects, I have specific math needs. For instance, FuzzMeasure has specific categories for generating swept sine waves. So let's say we wrote a highly-tuned version of the loop above, and we wanted to operate directly on z (i.e. we didn't care about the original value of z). We build a category called SMUGRealVector (MyOperation), and define this routine:

- (void)myOperation;
{
    // Replicate the above routine, but replace 'z' with 'self'.
}

Then, when we want to use it in our source, we #import "SMUGRealVector_MyOperation.h", and then call it:

[z myOperation];

This isn't news to many veteran Objective-C coders, but I found it to be a great tool for building an extensible library for doing math on large vectors. Furthermore, it lets me slowly evolve my class into one that closely resembles the MATLAB built-in functions.

That way, when I come across signal processing algorithms described in MATLAB code, I can quite easily port them to work in Objective-C. Even better, I can easily go back & forth between my code, and Octave (the free MATLAB clone), comparing results for operations as I code these algorithms.

So, Now What?

I'd really like to share this math library, but there are some problems I need to resolve before I give it away.

  • I lied above. FuzzMeasure doesn't have a swept sine generation category, because I've not split it out of the main framework yet. I do use categories in the way I described, but much of the extensions that are neatly organized are considered secrets right now.
  • It's a mess. When I started writing it, I sucked at Cocoa/Objective-C. So there are many silly coding mistakes, and things I'd rather not share.
  • A big mess. There are also many other routines in the framework that have nothing to do with math at all.
  • I'm really busy. When I do this, it's going to take time, and effort that I just don't have to spare right now.
  • The name sucks. SMUGFoundation makes sense for a general group of classes, but I need to split this out into a new SMUGMath framework.
  • I don't know how to build it. Do I branch my own copy, pushing changes to the mainline repository once in a while? Or, do I just split the code out and give it away, not ever consuming changes made by the public? Both have caveats.
  • I can't afford to support it. You are a great friend of open-source, and will help me improve the library. But we both know I'm going to be getting tons of emails from randoms asking about lame Xcode link failures, and how to stick it into their iPhone project (which won't work right now, due to a lack of vDSP, for starters).

I'll do my best, though, because I wanted to share this framework for at least two years. I've only recently reached a point where I'm comfortable taking the steps.

If nothing else, I hope this post helps people come to a similar conclusion that I did, which is that Objective-C, and Cocoa, can be used as a part of very sophisticated processing frameworks. There's no reason to force yourself into straight C, or C++, to achieve this.

Sending messages to objects might be slower than C or C++, but once you get into the method implementation, that's where you can really rock out. I've branched this framework off to do all sorts of advanced signal processing, including using OpenCL to further accelerate operations on large vectors of data.

Trust me, it's fast enough.

Please excuse the vague post, as I don’t have anything specific I’d like to share just yet. However, what I’d like to do here is call attention to my new favorite part of Mac OS X 10.6 Snow Leopard—OpenCL.

I’m working on some incredible technology for Capo lately, but it’s pretty heavyweight stuff. I’m processing audio data to produce a fancy visualization of its spectral content (not using the FFT). Unfortunately, running this operation is quite slow, so I’ve been trying to parallelize, and optimize it as best as I can.

In practice, I don’t intend to run the processing on entire audio files, but I’ve been using that as a worst-case example to test the throughput of a few approaches I’ve been working on. The input file for the tests below is a 45-second wave file, and the tool I built produces a detailed image file containing a time-vs-frequency view of the entire file.

All the test results were collected on my 8-core Mac Pro. I realize this isn’t representative of users’ machines, but it allows me to verify that all the system’s computing resources are being utilized—it’s not trivial to peg 8 cores. And, seeing how this is where computers are heading in the near future, this seems like a smart thing to focus on…

Initial Approach

I implemented the algorithm in question using a MATLAB source file for reference. I used my optimized math routines (which are sped up using vecLib/vDSP stuff), but it still took a while—492 seconds to process the file!

I fully expected this to be slow, so I wasn’t surprised with the result. It operated on only one core, and used up a modest amount of memory. At least I had a baseline in place, and output data to verify the optimized routines against.

(Note: This test was run at a lower resolution than the ones below, as it was unbearably slow as it stands. I think the true timing for this algorithm, with the same analysis parameters for the file, were more on the order of thousands of seconds. Running a test with the same resolution used in the rest of the tests, using a 0.91 second-long input file, resulted in 10 minutes of running time—brutal!)

NSOperationQueue

I know this path very well—NSOperationQueue is employed to speed up waterfall calculations in FuzzMeasure, and it’s a very easy API to work with. After some intense optimization work that lasted over a day, I managed to get the process running in 115 seconds.

This was a huge improvement, and I even managed to integrate this algorithm into Capo with a preliminary UI wrapped around it. Unfortunately, I had to really turn down the resolution to get reasonable performance numbers.

There was also another interesting side effect, which is that the Capo audio engine would be bogged down as the file was being processed in the background. You see, NSOperationQueue seems to run at a normal priority, so it could preempt anything else that’s happening in your application. You can mitigate the problem by reducing the number of concurrent operations on the queue, but you don’t have much space to reduce that load on a 2-cpu machine.

Also, the way my math library is structured, I decided to trade memory for computation time, so I had to do a bunch of work to balance the memory load (e.g. how much of the audio file is loaded at once before spawning off lots of copies of its data so it can be read in parallel by all these threads) during runtime. By the end, my code wasn’t very pretty at all.

Finally, the way this code was all written, it wasn’t very easy to have on-demand updates of the parameters used to generate the image of the spectral data. So you couldn’t have a user-defined frequency range parameter that is manipulated in real time, as updating parameters would result in things being re-calculated again. These are design issues on my part (it was a tradeoff for overall algorithm run speed), but I made these decisions consciously.

OpenCL Attempt 1—Scalar Code

OpenCL excites a lot of people, and they seem to go ga-ga over the fact that you can schedule work for your GPU to do. However, it’s also an amazingly expressive way to write parallelized code for multi-core CPUs.

I didn’t try the OpenCL route until I had become sufficiently frustrated with my NSOperationQueue implementation. I had optimized it as much as I could handle (without severely obfuscating a ton of code, and making the whole implementation very fragile), and I really didn’t want to start thinking about making a future release of Capo 10.6-only so soon.

That said, I really wanted to know for sure that this would offer some kind of benefit over what I’ve been doing so far. Heck, maybe I _could_ use the GPU to do my bidding…

Well, I whipped up a Cocoa wrapper for OpenCL (which I hope to share once I add some more features to it), and wrote my first kernel for OpenCL in a few hours (with the OpenCL spec at-hand). Once I wrapped my head around the whole process, I stepped back and realized that my code was much cleaner, and readable, than before.

Still, this was a very naïve implementation, so I wasn’t expecting magic out of it. After working out the bugs, I measured 30.87 seconds! Holy cow—that’s a huge gain!

At this point, I could have stopped. I had basically shaved >60% of the time off my NSOperationQueue implementation, but I wanted to push it a little further, because it still wasn’t running all that great on my dual-core 13″ MacBook Pro.

I did not yet integrate this into Capo, as I only just finished writing up the test code (it’s not hard to move over), but what I did notice is that OpenCL is scheduling these work items using the low-priority Grand Central dispatch queue. This means that I will be playing very nicely with the rest of the system as this monstrous operation is happening—score one more win for OpenCL!

OpenCL Attempt 2—Vectorized Code

The Intel CPUs ship with decent vector units these days, and OpenCL lets you write vectorized code very easily. You can cut a loop into a quarter of the operations, and can work on 4 elements at once, simply by switching to the float4 data type, and playing around with indexes into your data arrays.

This was tricky to get working—maybe 3-4 hours of toying around and debugging the code before I realized I had a mathematical error (I was combining the result of a non-linear operator—oops!) contributing to garbled output. After I got the bugs worked out, I was getting a result of 14.1 seconds.

Absolutely incredible—I basically doubled my runtime by working with vectors.

OpenCL Non-Attempt—Running on the GPU

I’m not planning to ship code that runs on the GPU for this particular algorithm. The GPU is a dodgy thing to work with, and I’m dealing with an algorithm that runs much longer than you want to tie up the GPU for. For instance, I actually manage to completely lock up my system for a full minute as the algorithm runs.

Oh, that’s right—this takes a full minute to execute on a GeForce 8800GT. The type of algorithm I’m working with is far better suited to the memory layout of a general purpose computer, its caching strategy, etc.

Furthermore, there’s an issue of overhead here…

OpenCL Overhead

When you work with the CPU, you avoid all that overhead of moving data to/from the GPU. With some extra flags specified, you can tell OpenCL that you are supplying your own host memory pointer, and you wish to avoid the copying step.

In my testing experience, it takes almost no time to start up an OpenCL context, compile your OpenCL program, and set up your memory/parameters when you use the CPU. On the GPU, I was losing somewhere between 1-3 seconds for the round-trip.

Conclusion

Overall I’m extremely impressed with what OpenCL brings to the table. It’s really not that hard of an API to use (especially now that I have a Cocoa wrapper), and if you work at it, you can get some huge speed gains over a more “traditional” multi-core programming approach such as what you get using NSOperationQueue.

It’s not for everyone, for sure, but it’s going to make a lot of otherwise complex things easier to do.

Many of you will be happy to know that TapeDeck 1.3 is out today. It’s an exciting release, because it adds the much-requested ability to record lossless audio.

Now you pro audio folks can record audio in the highest quality, and drag your tapes straight from TapeDeck into GarageBand with no loss in fidelity. And, if you’re really nutty about your audio quality (and have the hardware to back it up), you can unlock TapeDeck’s recording quality in the preferences so you can record beyond 44.1kHz!

Check out http://tapedeckapp.com to grab the latest release, and see the updated site design (inside the drawer). I also put some nice little touches into the UI for this release, because it needed some love. :)