Please excuse the vague post, as I don’t have anything specific I’d like to share just yet. However, what I’d like to do here is call attention to my new favorite part of Mac OS X 10.6 Snow Leopard—OpenCL.
I’m working on some incredible technology for Capo lately, but it’s pretty heavyweight stuff. I’m processing audio data to produce a fancy visualization of its spectral content (not using the FFT). Unfortunately, running this operation is quite slow, so I’ve been trying to parallelize, and optimize it as best as I can.
In practice, I don’t intend to run the processing on entire audio files, but I’ve been using that as a worst-case example to test the throughput of a few approaches I’ve been working on. The input file for the tests below is a 45-second wave file, and the tool I built produces a detailed image file containing a time-vs-frequency view of the entire file.
All the test results were collected on my 8-core Mac Pro. I realize this isn’t representative of users’ machines, but it allows me to verify that all the system’s computing resources are being utilized—it’s not trivial to peg 8 cores. And, seeing how this is where computers are heading in the near future, this seems like a smart thing to focus on…
Initial Approach
I implemented the algorithm in question using a MATLAB source file for reference. I used my optimized math routines (which are sped up using vecLib/vDSP stuff), but it still took a while—492 seconds to process the file!
I fully expected this to be slow, so I wasn’t surprised with the result. It operated on only one core, and used up a modest amount of memory. At least I had a baseline in place, and output data to verify the optimized routines against.
(Note: This test was run at a lower resolution than the ones below, as it was unbearably slow as it stands. I think the true timing for this algorithm, with the same analysis parameters for the file, were more on the order of thousands of seconds. Running a test with the same resolution used in the rest of the tests, using a 0.91 second-long input file, resulted in 10 minutes of running time—brutal!)
NSOperationQueue
I know this path very well—NSOperationQueue is employed to speed up waterfall calculations in FuzzMeasure, and it’s a very easy API to work with. After some intense optimization work that lasted over a day, I managed to get the process running in 115 seconds.
This was a huge improvement, and I even managed to integrate this algorithm into Capo with a preliminary UI wrapped around it. Unfortunately, I had to really turn down the resolution to get reasonable performance numbers.
There was also another interesting side effect, which is that the Capo audio engine would be bogged down as the file was being processed in the background. You see, NSOperationQueue seems to run at a normal priority, so it could preempt anything else that’s happening in your application. You can mitigate the problem by reducing the number of concurrent operations on the queue, but you don’t have much space to reduce that load on a 2-cpu machine.
Also, the way my math library is structured, I decided to trade memory for computation time, so I had to do a bunch of work to balance the memory load (e.g. how much of the audio file is loaded at once before spawning off lots of copies of its data so it can be read in parallel by all these threads) during runtime. By the end, my code wasn’t very pretty at all.
Finally, the way this code was all written, it wasn’t very easy to have on-demand updates of the parameters used to generate the image of the spectral data. So you couldn’t have a user-defined frequency range parameter that is manipulated in real time, as updating parameters would result in things being re-calculated again. These are design issues on my part (it was a tradeoff for overall algorithm run speed), but I made these decisions consciously.
OpenCL Attempt 1—Scalar Code
OpenCL excites a lot of people, and they seem to go ga-ga over the fact that you can schedule work for your GPU to do. However, it’s also an amazingly expressive way to write parallelized code for multi-core CPUs.
I didn’t try the OpenCL route until I had become sufficiently frustrated with my NSOperationQueue implementation. I had optimized it as much as I could handle (without severely obfuscating a ton of code, and making the whole implementation very fragile), and I really didn’t want to start thinking about making a future release of Capo 10.6-only so soon.
That said, I really wanted to know for sure that this would offer some kind of benefit over what I’ve been doing so far. Heck, maybe I could use the GPU to do my bidding…
Well, I whipped up a Cocoa wrapper for OpenCL (which I hope to share once I add some more features to it), and wrote my first kernel for OpenCL in a few hours (with the OpenCL spec at-hand). Once I wrapped my head around the whole process, I stepped back and realized that my code was much cleaner, and readable, than before.
Still, this was a very naïve implementation, so I wasn’t expecting magic out of it. After working out the bugs, I measured 30.87 seconds! Holy cow—that’s a huge gain!
At this point, I could have stopped. I had basically shaved >60% of the time off my NSOperationQueue implementation, but I wanted to push it a little further, because it still wasn’t running all that great on my dual-core 13″ MacBook Pro.
I did not yet integrate this into Capo, as I only just finished writing up the test code (it’s not hard to move over), but what I did notice is that OpenCL is scheduling these work items using the low-priority Grand Central dispatch queue. This means that I will be playing very nicely with the rest of the system as this monstrous operation is happening—score one more win for OpenCL!
OpenCL Attempt 2—Vectorized Code
The Intel CPUs ship with decent vector units these days, and OpenCL lets you write vectorized code very easily. You can cut a loop into a quarter of the operations, and can work on 4 elements at once, simply by switching to the float4 data type, and playing around with indexes into your data arrays.
This was tricky to get working—maybe 3-4 hours of toying around and debugging the code before I realized I had a mathematical error (I was combining the result of a non-linear operator—oops!) contributing to garbled output. After I got the bugs worked out, I was getting a result of 14.1 seconds.
Absolutely incredible—I basically doubled my runtime by working with vectors.
OpenCL Non-Attempt—Running on the GPU
I’m not planning to ship code that runs on the GPU for this particular algorithm. The GPU is a dodgy thing to work with, and I’m dealing with an algorithm that runs much longer than you want to tie up the GPU for. For instance, I actually manage to completely lock up my system for a full minute as the algorithm runs.
Oh, that’s right—this takes a full minute to execute on a GeForce 8800GT. The type of algorithm I’m working with is far better suited to the memory layout of a general purpose computer, its caching strategy, etc.
Furthermore, there’s an issue of overhead here…
OpenCL Overhead
When you work with the CPU, you avoid all that overhead of moving data to/from the GPU. With some extra flags specified, you can tell OpenCL that you are supplying your own host memory pointer, and you wish to avoid the copying step.
In my testing experience, it takes almost no time to start up an OpenCL context, compile your OpenCL program, and set up your memory/parameters when you use the CPU. On the GPU, I was losing somewhere between 1-3 seconds for the round-trip.
Conclusion
Overall I’m extremely impressed with what OpenCL brings to the table. It’s really not that hard of an API to use (especially now that I have a Cocoa wrapper), and if you work at it, you can get some huge speed gains over a more “traditional” multi-core programming approach such as what you get using NSOperationQueue.
It’s not for everyone, for sure, but it’s going to make a lot of otherwise complex things easier to do.




November 12th, 2009 at 6:38 pm
Great post! I really am trying to get done with my thesis work so I can upgrade to SL to take advantage of these features.
As an aside, do you have your own Obj-C/Cocoa wrappers to LAPACK/veclib/Accelerate framework etc? If so, I (along with many other computational scientists) could greatly benefit from seeing them.
Please do keep up the great work.
November 12th, 2009 at 11:44 pm
@Marc: I do wrap parts of vecLib/vDSP.h, but it’s nothing special. One day I plan on sharing my approach, but maybe not my whole framework as it is.
I suspect that any effort to clean up and distribute such a framework could result in a significant time sink. It’s a bit of a mess after being hacked on for 5+ years now! :)
That said, it’s been on my mind for a good number of those years, so eventually I’ll warm up to the idea…
November 13th, 2009 at 2:16 am
Chris, do you have a preferred email address where I can contact you? (or if you can see me email, would you email me so you do not have to post)
I have some questions/ideas you may be interested in talking about regarding a pure cocoa API to lapack. I am SO sick of the FtoC underscores.
Please do keep up the great work! Marc
November 13th, 2009 at 3:38 am
Did you try using GCD ‘dispatch_apply’ method instead of OpenCL?
I found it much easier and more efficient than OpenCL for some audio work I have done..
November 13th, 2009 at 11:15 am
@Marc you can reach me at chris@supermegaultragroovy.com.
@croc I’ve not, but I’m not sure it’s going to be much different from my other code. If I have time I’ll try and give that a shot as well.
My gut feeling is that I’d have to write a lot of similarly-annoying plumbing that I had to for the NSOperationQueue case, but we’ll see. OpenCL definitely has the advantage in letting me write highly specialized vector code that will be compatible with future generations of vector units / architectures. Writing that kind of code otherwise is irritating…
November 13th, 2009 at 5:37 pm
I work with Quartz Composer a bit. I’m a layman and a hobbiest, but I stay informed by closely monitoring and occassionally contributing to the forums at Kineme.net.
Be advised that, as of this posting, OpenCL does not run at all on quite a few relatively recent GPUs, and particularly on some manufactured by ATI. I have an ATI Radeon HD 2600 on a 2.8 GHz Core 2 Duo iMac, manufactured in November 2007. OpenCL flat out does not run on that card and I’ve been told that it never will.
Just a heads up, eh? At this particular juncture, the Kineme forums are seeing frequent postings about OpenCL non-functionality issuses.
November 13th, 2009 at 6:07 pm
@Lee Thanks! I didn’t want to get into too much detail in my post, but this is effectively what I meant by “The GPU is a dodgy thing to work with.” It’s a world of hurt I really don’t want to get into.
Also, when your OpenCL programs crash, very often the user’s system crashes very hard (either with a kernel panic, or a complete GPU lockup). This is a road I’d prefer not to travel… ;)
November 14th, 2009 at 1:46 am
I’ve been planning to get started on a similar audio spectrogram-type project using OpenCL soon. It’s unfortunate that Apple/NVIDIA haven’t provided an FFT library for OpenCL – especially because they were showing impressive FFT benchmarks in presentations prior to OpenCL’s release. I’m expecting it to come soon though… which has kind of discouraged me from doing a port of the CUDA FFT library to OpenCL.
Out of curiousity, what kind of algorithm are you using to do the visualization? Some sort of bank of band-pass filters with amplitude tracking?
November 14th, 2009 at 9:29 am
I’m pretty surprised that the GPU was slower, as it should be an order of magnitude faster (at least) than your CPUs. You’ve got to be very very careful about how you load and store memory, however. If you’ve not watched the macresearch.org OpenCL podcasts, they give you an idea of what to look out for. It’s very easy to take those 240+ cores on your GPU and serialize most of them. Unfortunately there’s no performance counters on the GPU, so it’s quite difficult to figure out how well you’re doing WRT pipeline stalls, bank conflicts, etc.
November 14th, 2009 at 9:48 am
And actually rereading the end of your post, it seems like maybe your problem isn’t well suited to the constraints of the GPU? If that’s the case, I’d be really interested to know what kind of problems you ran into.
November 17th, 2009 at 1:18 pm
[...] Processing audio data to produce a fancy visualization of its spectral content. Some numbers to process an audio test file on a 8-core Mac Pro: – Initial Approach (Matlab): 492 seconds – NSOperationQueue: 115 seconds – OpenCL Attempt 1 (scalar code): 30.87 seconds – OpenCL Attempt 2 (vectorized code): 14.1 seconds [source] [...]
November 19th, 2009 at 7:05 am
@Jonah “Unfortunately there’s no performance counters on the GPU, so it’s quite difficult to figure out how well you’re doing WRT pipeline stalls, bank conflicts, etc”
You can use NVIDIA’s Visual Profiler to get access to the performance counters.
November 19th, 2009 at 9:44 am
@Tom this is intriguing. However, I can’t find a link to the OpenCL visual profiler for Mac OS X. Only Windows/Linux versions exist.
Got a link to share?
December 12th, 2009 at 8:08 am
Do release that Cocoa wrapper, I think you’re going to enable a whole new class of coders by just letting them read that, Chris. Many Cocoa coders are event-driven, and this just might be the event to push them to wrap their heads around compute kernels and dispatch queues. :)
December 12th, 2009 at 10:55 am
@godDLL I’ve got the wrapper separated into a framework now. Just gotta build a test app (or two) to distribute with it. I may just try and rewrite the Apple samples with my library, to make life easier…
February 22nd, 2010 at 6:33 pm
Hi, I’m also currently using the vDSP but would like to port my code on windows / linux… so doing a kind of OpenCL wrapper sounds good (even if yes… for now which system would support it ?) also as I read the few lines about your project, you are probably dealing with autocorrelation (for the pitch analysis ? or not ?) if so, you may also mathematically improve your problem by passing to the frequency domain in order to compute your autocorrelation and then go back to the time domain. (Now maybe you’re not autocorrelating at all…)
March 29th, 2010 at 4:52 pm
[...] Imagínate que el periodista pudiera codificar muchas pistas de audio al mismo tiempo y más deprisa aún. ¿Para qué quiere un periodista codificar varias pistas de audio al mismo tiempo más deprisa? Muy sencillo, vamos a suponer que ha estado en 5 presentaciones de producto, 2 actos oficiales y 3 entrevistas a lo largo del día y lo tiene todo grabado en su grabadora o móvil. Cuando llegue a la oficina o a casa, quiere volcarlo a su PC, pero el formato en el que se ha grabado no es compatible (sí, os lo juro, siguen usándose formatos no compatibles ;) con su reproductor de audio del PC/portátil. Pues nada, echa mano del codificador de audio y pasa todas esas pistas de audio al formato que sea en segundos gracias a la GPGPU. Ya no hablamos de minutos, hablamos de segundos (incluso de MUY pocos segundos). [...]
August 9th, 2010 at 1:44 pm
One advantage of writing in the OpenCL vector types is that unlike SSE/AltiVec/… intrinsics, they are portable, meaning you can be up and running on new architectures with vectorized code in no time. You also get standard operators like +-/*&,etc.
One tip: use larger vectors than the machine width where it make sense, like float8 or float16. You should see a small win (~10%) today due to better use of superscalar resources and what is in effect loop unrolling. However, in the future your code can better take advantage of bigger vector units, like AVX, without further modification by you. http://software.intel.com/en-us/avx/
August 9th, 2010 at 1:52 pm
Ian: Yes! This is an oft-overlooked benefit of OpenCL, in my opinion.
Thanks for the heads up on the float8 types—I’ll have to try that. :)
Can’t wait to see AVX show up in a future Mac update. Seems like a great thing for me to hold out for… ;)