Archive for the 'Capo' Category

Please excuse the vague post, as I don’t have anything specific I’d like to share just yet. However, what I’d like to do here is call attention to my new favorite part of Mac OS X 10.6 Snow Leopard—OpenCL.

I’m working on some incredible technology for Capo lately, but it’s pretty heavyweight stuff. I’m processing audio data to produce a fancy visualization of its spectral content (not using the FFT). Unfortunately, running this operation is quite slow, so I’ve been trying to parallelize, and optimize it as best as I can.

In practice, I don’t intend to run the processing on entire audio files, but I’ve been using that as a worst-case example to test the throughput of a few approaches I’ve been working on. The input file for the tests below is a 45-second wave file, and the tool I built produces a detailed image file containing a time-vs-frequency view of the entire file.

All the test results were collected on my 8-core Mac Pro. I realize this isn’t representative of users’ machines, but it allows me to verify that all the system’s computing resources are being utilized—it’s not trivial to peg 8 cores. And, seeing how this is where computers are heading in the near future, this seems like a smart thing to focus on…

Initial Approach

I implemented the algorithm in question using a MATLAB source file for reference. I used my optimized math routines (which are sped up using vecLib/vDSP stuff), but it still took a while—492 seconds to process the file!

I fully expected this to be slow, so I wasn’t surprised with the result. It operated on only one core, and used up a modest amount of memory. At least I had a baseline in place, and output data to verify the optimized routines against.

(Note: This test was run at a lower resolution than the ones below, as it was unbearably slow as it stands. I think the true timing for this algorithm, with the same analysis parameters for the file, were more on the order of thousands of seconds. Running a test with the same resolution used in the rest of the tests, using a 0.91 second-long input file, resulted in 10 minutes of running time—brutal!)

NSOperationQueue

I know this path very well—NSOperationQueue is employed to speed up waterfall calculations in FuzzMeasure, and it’s a very easy API to work with. After some intense optimization work that lasted over a day, I managed to get the process running in 115 seconds.

This was a huge improvement, and I even managed to integrate this algorithm into Capo with a preliminary UI wrapped around it. Unfortunately, I had to really turn down the resolution to get reasonable performance numbers.

There was also another interesting side effect, which is that the Capo audio engine would be bogged down as the file was being processed in the background. You see, NSOperationQueue seems to run at a normal priority, so it could preempt anything else that’s happening in your application. You can mitigate the problem by reducing the number of concurrent operations on the queue, but you don’t have much space to reduce that load on a 2-cpu machine.

Also, the way my math library is structured, I decided to trade memory for computation time, so I had to do a bunch of work to balance the memory load (e.g. how much of the audio file is loaded at once before spawning off lots of copies of its data so it can be read in parallel by all these threads) during runtime. By the end, my code wasn’t very pretty at all.

Finally, the way this code was all written, it wasn’t very easy to have on-demand updates of the parameters used to generate the image of the spectral data. So you couldn’t have a user-defined frequency range parameter that is manipulated in real time, as updating parameters would result in things being re-calculated again. These are design issues on my part (it was a tradeoff for overall algorithm run speed), but I made these decisions consciously.

OpenCL Attempt 1—Scalar Code

OpenCL excites a lot of people, and they seem to go ga-ga over the fact that you can schedule work for your GPU to do. However, it’s also an amazingly expressive way to write parallelized code for multi-core CPUs.

I didn’t try the OpenCL route until I had become sufficiently frustrated with my NSOperationQueue implementation. I had optimized it as much as I could handle (without severely obfuscating a ton of code, and making the whole implementation very fragile), and I really didn’t want to start thinking about making a future release of Capo 10.6-only so soon.

That said, I really wanted to know for sure that this would offer some kind of benefit over what I’ve been doing so far. Heck, maybe I _could_ use the GPU to do my bidding…

Well, I whipped up a Cocoa wrapper for OpenCL (which I hope to share once I add some more features to it), and wrote my first kernel for OpenCL in a few hours (with the OpenCL spec at-hand). Once I wrapped my head around the whole process, I stepped back and realized that my code was much cleaner, and readable, than before.

Still, this was a very naïve implementation, so I wasn’t expecting magic out of it. After working out the bugs, I measured 30.87 seconds! Holy cow—that’s a huge gain!

At this point, I could have stopped. I had basically shaved >60% of the time off my NSOperationQueue implementation, but I wanted to push it a little further, because it still wasn’t running all that great on my dual-core 13″ MacBook Pro.

I did not yet integrate this into Capo, as I only just finished writing up the test code (it’s not hard to move over), but what I did notice is that OpenCL is scheduling these work items using the low-priority Grand Central dispatch queue. This means that I will be playing very nicely with the rest of the system as this monstrous operation is happening—score one more win for OpenCL!

OpenCL Attempt 2—Vectorized Code

The Intel CPUs ship with decent vector units these days, and OpenCL lets you write vectorized code very easily. You can cut a loop into a quarter of the operations, and can work on 4 elements at once, simply by switching to the float4 data type, and playing around with indexes into your data arrays.

This was tricky to get working—maybe 3-4 hours of toying around and debugging the code before I realized I had a mathematical error (I was combining the result of a non-linear operator—oops!) contributing to garbled output. After I got the bugs worked out, I was getting a result of 14.1 seconds.

Absolutely incredible—I basically doubled my runtime by working with vectors.

OpenCL Non-Attempt—Running on the GPU

I’m not planning to ship code that runs on the GPU for this particular algorithm. The GPU is a dodgy thing to work with, and I’m dealing with an algorithm that runs much longer than you want to tie up the GPU for. For instance, I actually manage to completely lock up my system for a full minute as the algorithm runs.

Oh, that’s right—this takes a full minute to execute on a GeForce 8800GT. The type of algorithm I’m working with is far better suited to the memory layout of a general purpose computer, its caching strategy, etc.

Furthermore, there’s an issue of overhead here…

OpenCL Overhead

When you work with the CPU, you avoid all that overhead of moving data to/from the GPU. With some extra flags specified, you can tell OpenCL that you are supplying your own host memory pointer, and you wish to avoid the copying step.

In my testing experience, it takes almost no time to start up an OpenCL context, compile your OpenCL program, and set up your memory/parameters when you use the CPU. On the GPU, I was losing somewhere between 1-3 seconds for the round-trip.

Conclusion

Overall I’m extremely impressed with what OpenCL brings to the table. It’s really not that hard of an API to use (especially now that I have a Cocoa wrapper), and if you work at it, you can get some huge speed gains over a more “traditional” multi-core programming approach such as what you get using NSOperationQueue.

It’s not for everyone, for sure, but it’s going to make a lot of otherwise complex things easier to do.

There’s really not all that much to say beyond, “all my products work well in Snow Leopard.”

I highly recommend upgrading to Snow Leopard, as there are improvements across the board that will be apparent in all your applications—especially when it comes to performance. I’ve been running Snow Leopard for months as part of Apple’s Developer Connection program, and it has been solid for a long time.

For FuzzMeasure users on Snow Leopard, I encourage you to check out version 3.2 at the latest build page. You’ll be treated to some speed increases as a result of the 64-bit binary that’s available only to Snow Leopard users. (If you’re stuck on Leopard—don’t worry! FuzzMeasure 3.2 will remain compatible with Mac OS 10.5.)

I just pushed the latest release of Capo online. Go grab it at the Capo site.

As usual, release notes for Capo are available. Instead of simply leaving it at that, I’m going to give a little bit of commentary on what I fixed, since I don’t want the release notes to get overly verbose.

Album art is now available immediately after it is retrieved from the iTunes store, and not after QuickLook has gotten around to caching it.

I found this issue while testing Snow Leopard, though the issue likely occurs on Leopard as well. I moved a bunch of music over to my test machine, and couldn’t see any album art for tracks dragged into Capo. What gives?

I discovered that there is a window of time between acquiring album art from the iTMS, and Capo being able to see that album art. Further investigation revealed that this also resulted in the Finder, and QuickLook previews being unable to view album art during this time. In one case it took a few days to clear up, and in another case it was only few minutes (likely shortened by my poking around using qlmanage, which always showed album art immediately after getting it in iTunes). I filed a bug ( rdar://7123568 ) thinking that the Finder, and QuickLook, got screwed up in Snow Leopard.

Some further investigation into the matter (with the help of Apple’s engineers assigned to the bug report) revealed a workaround that let Capo access that album art before QuickLook got around to caching a thumbnail for the audio file. Because Capo acquires the album art on a separate thread, I’m able to take the slower route of forcing QuickLook to dig beyond its cache and generate the album art thumbnail for me on demand.

Fixed crashes reported by a small group of users.

Oddly, I never encountered this one during pre-release testing. However, if the effects window was disclosed, and Capo was left to play audio for a little while (anywhere from 10 seconds to 10 minutes, in my testing), it’d crash. It turns out I had a bad getter which didn’t call ‘retain] autorelease]’ on its returned value.

In many cases, that’s not a big deal, because most code runs in a single thread, and you can be certain that the value won’t go away in between you retrieving it and retaining it shortly thereafter. In this particular case, the value was being released (and freed!) in a separate thread—before the value had a chance to be retained. Oops!

Normally I rely on Objective-C 2.0 properties for my getters and setters, but in this case it’s an indexed accessor (valueForIndex:) which I had to implement manually. Objective-C @properties are written in such a way that all member variables are retained and autoreleased (or copied and autoreleased) in the getters depending on how you set up your property (for retain and copy, obviously).

Cleaned up a handful of small memory leaks.

I decided to run Instruments again on Capo and clean up every little loose end I found. It was fun to find my way around the new version that ships with Snow Leopard.

Removed assertion failures that may have showed up in the console when mono files are played. and When mono files are played, spectrum data is shown properly in the effect views.

Both of these were found by me while testing the fixes for the above issues. It was just a fluke that I even encountered a mono file in my testing (I recorded it using TapeDeck), and I couldn’t let myself ship the other fixes until I cleaned this one up.

So, there you have it—a glimpse into all the work that goes into a tiny maintenance release. It was fun to do this little postmortem, and I hope to do them more often in the future.

Some folks are turned off of Capo because they feel that learning music by ear is difficult. Unless you’re afflicted with amusia, which is unlikely, it’s really not a tough skill to learn.

I recognize that folks are going to have some difficulty figuring out the process of learning music, and I would like to try and help them out. The best way I could imagine relaying the information is to do it on camera, and specific to Capo.

Of course, there are also plenty of Capo users who already know the basics of learning music by ear, or are quite comfortable with the process. For those folks, I hope to share some Capo tips that’ll help them jump around the application much more quickly.

My plan is to have a few types of episodes: the core ear-learning material for beginners/intermediates, some tips/tricks that are specific to Capo, and some “in the trenches” videos where I demonstrate the use of Capo to learn real songs.

For the “in the trenches” series, I’ll be featuring music by Jonathan Coulton, who has graciously allowed me to use his music in videos related to Capo. I’m so stoked about this, because I enjoy a lot of his material. If you, or anyone you know, is in a band who’d like their music featured in the lessons, let me know and I’d be happy to check it out and see if it fits.

So check out the first (HD!) video, which I’ve posted on YouTube: http://www.youtube.com/watch?v=C6mlvaVJyVY

Leave it up to my creative users to discover new ways in using Capo.

Dr. Drang has an excellent blog post about using Capo to transcribe his voice notes. Check it out!

Dr. Drang also posted a really nice review of Capo when it launched. Man, my users rock!!