Archive for the ‘programming’ Category

CUDA

Sunday, August 24th, 2008

To recently took a job at NVIDIA to work on the CUDA team.  Someone told me about CUDA a few months ago, and I finally decided to look into it a month ago.  I was won over by its coding model and really wanted to work on it, so I sent my resume to NVIDIA.  Within a week or two, I was an employee!

I encourage you to look into CUDA.  The future of software development will tend towards multi-core and multi-thread programing and hardware.  I think CUDA is the best solution to this problem that I have found so far.

-Edward

Found an Optimization in my Netflix Code

Tuesday, April 29th, 2008

I have this loop that loops over the 100 million plus entires of the Netflix Prize data set. I had read over the weekend something on reddit.com about performance, and there was a whole thread trying to answer the question: Why does performance matter? Many people were arguing that there is no reason to focus on performance. Some people even showed that by rewriting some examples, it would be slightly faster. Though, in their rewrites, they made the code more unreadable and not as optimal as it could have been. (it involved factoring numbers)

The biggest thing that most people seem to miss about performance is that why question. They think it can require large changes and if the code becomes unreadable, that is the cost of performance. I disagree. Performance can usually be gained with no cost to readability. As for why, because there is no reason to make the machine do more work than is needed to generate the correct answer. Back to my example.

This loop needed to update these 100 million pieces of data. In the loop, there was a number that worked out to be a constant. I had assumed that the compiler would generate the correct code: a divide by a constant can be turned into a multiply by the inverse for a huge speedup. It didn’t generate that code for me. After I made the code change myself, it went from 62 seconds to 47 seconds, just by adding a 1/ and changing the /= to a *=. Seems like that is worth it.

The second big win was from data ordering. In this loop, there is a 2-d array. I had moved this code from a different subsection where it was accessing the array like this: array[constant][variable] - iterating over the right-hand subscript. In this loop though, the access was different: array[variable][constant], which means it was not accessing memory contiguously per iteration. I made another small change, I swapped the subscripts and updated my dataset to reflect this change. After this change, the code went to 25 seconds per iteration.

So, with no changes in readability, I was able to go from 62 to 25 seconds, which is a big win. Why do it? Well, I can now run 2,000 iterations in 14 hours rather than the 34 hours it took before, to generate the exact same result. These are the kinds of important changes people should look for within their own code, once their algorithm works and is readable.

-Edward

Order (n^2) in time

Tuesday, April 22nd, 2008

I was reading a set of posts on reddit about functional programming.  At some point in the thread, the topic went to order of algorithms.  Someone commented that he had implemented something in an order n-squared since he was only using a few elements in the list.  He said that since no one noticed, it didn’t matter.  It was very frustrating to read that.

I wrote about this in my blog when I worked at Adobe and will write a bit about it now.  In all of my travels as a specialist in code optimization, the thing I see the most is people implementing code in O(n^2) that could be written in linear or constant time.  I know Netscape had a few of them when I was working on that project.  Since it is so pervasive, let me give some concrete examples of why/when/how to look for this.

Let’s create a table with elements being the number of elements in question, linear and n-squared being the time it would take to processes that number of elements with that algorithm:

Elements Linear n-squared
1 1 1
10 10 100
100 100 10,000
1,000 1,000 1,000,000

See, it grows pretty fast!  If your n is 1, 10, or even 100, no one will notice a difference in the time.  But, if the number of elements goes a just a bit more, suddenly the time take will be noticeable.  But, there is a catch, sometimes it pays to use n-squared to much higher ranges.  The actual order is M*O(n) + K for linear, and M*O(n^2) + K for n-squared, with M and K different for each.  So, let us create a new table:  1000 * n + 1000 will be the linear, and 1 * n^2 + 1 will be the n-squared:

Elements Linear n-squared
1 2000 2
10 11,000 101
100 101,000 10,001
1,000 1,001,000 1,000,001

In this case, the linear takes more than a thousand elements before it is better than the n-squared algorithm.  M represents something that happens with each operation, like a data copy, while the K represents things like setup time, pre-calculating, etc.

So, the important thing to remember, algorithm selection depends on the dataset it will be used on.  In most cases, the number of elements will grow to be a lot larger than initially expect, so it almost always pays to switch to a linear algorithm if you can find one.  There are rare cases where the number of elements, the setup cost, and per operation cost of the linear doesn’t make it worth using, so you have to be aware of that.  This is only considering the speed cost of an algorithm, there is also the same set of arguments for memory cost.

I’ll leave with this: at one point Netscape’s browser used linked lists with no tail pointer (and it still might), which means the operation to build a linked list would be n-squared.  This saved a 4-byte pointer, and if the lists were small enough, there was no impact on performance.  The easiest way to find n-squared is feed in a thousand or even ten thousand elements and see where the code “freezes” the machine.  Creating a website with a thousand elements would create a thousand link list elements, which would cause a million operations, which would then take many seconds to display while the user is waiting.  Switching it over to linear would have meant that one could create a page with a million elements before seeing this kind of slowdown, and the thousand element page would have been instantaneous from the users perspective.

-Edward

Problem 127

Thursday, April 10th, 2008

It was over a year ago now… Project Euler - I was in the top 10, then they posted Problem 127. All of the problems there should execute in under the minute, and I thought I had optimized my code, but it was running for many minutes when I stopped it to see what was going on. I instrumented my code to see and saw that it was just chugging away. It would have taken 10-ish hours to complete! What was the problem? Python.

Quite a few of the problems at Project Euler require numbers to be larger than 32 bits. I originally used Project Euler to teach myself Python, and Python deals with big numbers fairly easily in code. My language of choice at the time, C/C++/Asm, didn’t. Even in Python, an interpretted language, I was able to get most of my solutions to run in under a minute - and that was the goal. Every so often, like times where the code iterated over large arrays, or basically anything that had a large number of real operations that had to be performed, I would have to switch back to C/C++. If the solution was taking minutes to run under Python, it would complete in a second or two under C/C++. But the time I saved writing the code in Python for the cases it could solve was worth it.

Now I have F# - my new favorite language. It also has big number support built in (so does Erlang). I was able to recode my old solution from memory from last year, and suddenly, I had a solution in under a minute! Well, it took a bit of work, since I had a bug in my code - I missed an overflow, so it was doing way to many calculations - it was taking minutes to generate the wrong answer. (comparing numbers to see if a calculation needed to be done, but the numbers being compared overflowed, so it was always doing the calculation, which was very slow as well as incorrect)

Some important things I learned from this year-ish journey:

1) People always tell me that only the order of the algorithm matter, linear speed ups will be swamped by faster machines. If the linear is hundreds, thousands, or millions, that is false. Rewriting critical code in a compiled or assembly language will always be of use.

2) Numbers shouldn’t have ranges. The overflow bug that cost me a day wouldn’t have existed if the language could handle numbers cleanly, not just integers. I had written the code in Erlang last night, but it was too slow to complete (linked lists as were the killer there). Erlang transparently works with all numbers, so that code couldn’t have that bug. Couldn’t. Bug-free code is the goal, so the languages need to handle these types of things transparently.

3) Getting things done is important. I used Python because it was trivial to write most of the code needed for Project Euler. The code only had to execute once (well, I reused a lot of it, so it had to execute once per problem), so optimizing my time speeds up the time from start to answer-in-hand. And that is what programming is about, getting the correct answer as quickly as possible.

-Edward

GCD

Wednesday, April 9th, 2008

I was working on a new problem from the Euler Project - they post new problems every week-ish.  I decided to try this one in Erlang, to give myself more exposure to the language to see what I think of it.  I realized that something was missing, and actually, I think it is missing from most languages:  GCD.

 Alex Stepanov has given many talks about the importance of GCD - I think there are a few four hour talks out there that have been recorded and are very interesting to watch.  I realize more and more how right he is about this.  (His paper on GCD is linked to at Stepanov Papers, which I have linked to on the right - some great reading on that site.)

It seems to be a useful function, one that should be built in.  Given that it is not, I’ve included one in this blog post.  Code like this should only be written once, programmers should be able to focus on the problem at hand, not worry about missing common functions - it takes away from focus.  Sure, everyone could write their own, but the same is true for sin, cos, etc.  GCD should be a standard, it is in J the programming language.

Before I wrote this, I did some searching, and found that a few people had written a GCD and posted their code.  It was after looking at their code that I decided to open a dialog about this, since some of the ones I found were wrong, or longer, or more complex than the actual GCD calculation.

-Edward

gcd(M, 0) -> M;
gcd(M, N) -> gcd(N, M rem N).

More on Erlang Factoring

Wednesday, April 9th, 2008

I tried something new.  I was curious as to why the times varied so much, so I installed VMWare on my MacPro and ran the previously mentioned code under a few OSes.  I have Ubuntu 64 bit installed on a separate HD for my Mac, so I installed one under VMWare to figure out how much the VM software would skew the other results.  Looks like not by much.

MacPro, Ubuntu 64, direct boot -> 3.7 seconds
MacPro, Ubuntu 64, VMWare -> 3.9 seconds

With this out of the way, I can tell that the numbers I get from under VMWare would match up pretty closely to what I would see if I installed the OSes/Erlang directly.  Here is what I found under the VMs:

MacPro, Windows XP 32bit -> 6.9 seconds
MacPro, Ubuntu 32 -> 18.9 seconds

What does this tell me?  That the test I was using for Erlang speed is heavily based on the underlying bit-ness.  Erlang on a 64 bit system is faster than the 32 bit version when doing large number arithmetic.  And the AMD 4600+ is half the speed of the Xeon chip in the MacPro.

It looks like this is a win for 64 bit.  These new results help explain something else I had seen last week - the speeds of F# and Haskell on the same problem.  It took 32 seconds under Haskell and minutes under F# on a 32 bit machine to do the same factoring.  I’m guessing that they would be faster on a 64 bit version, though I know that F# is only 32 bit right now, but I hope it becomes 64 bit soon!

-Edward

Factor in Erlang

Sunday, April 6th, 2008

 The first piece of code that I wrote in Erlang was to factor numbers.  I thought it would be a good test to learn the basics of Erlang (very basics).  I have appended the code to this post.  I am not sure if it the fastest way to factor a number (vs building up the list, but it would have to pass an accumulator around), I will have to investigate that in the future, but it served its purpose for now. 

I then used this code to test which machines and implementations of Erlang were the fastest and got some interesting results.  The code factored 380312393432894324523423103, which I discovered by typing in random numbers to factor.  It has a large factor, 9517107199440611, which makes it a reasonable benchmark for a simple test.  Here are my results:

AMD 64, 4600+, Windows XP -> 15.4 seconds
MacPro, 2.66 Xeon, Linux -> 3.7 seconds
MacPro, 2.66 Xeon, MacOSX -> 7.0 seconds
MacPro, 2.66 Xeon, Vista64 -> 7.3 seconds

I only have one OS on the AMD, but it would be interesting to see if Linux is 2x faster, as it was on the MacPro.  I was surprised that the Xeon chipset in the MacPro was so much faster than the AMD64, but I guess they are more than a year apart in age.

-Edward

fac(N) -> fac(N,2).
fac(N, F) when F*F > N -> [N];
fac(N, F) when N rem F =:= 0 -> [F | fac(N div F, F)];
fac(N, 2) -> fac( N, 3 );
fac(N, F) -> fac(N, F+2).

Functional Programming

Saturday, April 5th, 2008

If you go to the link on ray tracing, you can see some of the images that I did. Simple stuff, but they were fun and interesting to do. The interesting part for me was that all of the ones that I did, I did to teach myself a few new functional programming languages: Erlang, F#, and Haskell.

This is kind of linked to my post about UI. The thing I liked about learning these functional languages is that made focusing on the problem at hand the point of coding, rather than fighting the language. Don’t get me wrong, I’ve used C/C++ for many, many years and am an expert at them. But if I want to whip out a demo of a new idea, or learn ray tracing, I like to prototype in a higher level language. That used to be Smalltalk back in the early 90s, then Python in the 2000s, but now I’ve switched to these functional ones.

When an idea strikes, I just want to sit down, type away, and make progress towards my goal of seeing my idea on the screen. The problem with C/C++ is that if one doesn’t already have the framework in place for the idea, a lot of time is spent on memory management code, allocating objects, defining APIs, etc. That kind of stuff is great for a performance person, since it gives me a lot of axises along which to tweak code for performance, therefore getting me a lot of performance jobs, but it isn’t so great when I have a weekend and want to explore ideas.

Just like with a bad UI, one can get things done if one has done it before, but with a good UI and a good programming language, getting stuff done for the first time is a lot of fun!

-Edward