December 01, 2007
One of the great things about not having studied computer science formally is that you get to discover interesting problems all by yourself, and then cobble together your own solutions. You can then, if you want, do some research in order to find out, as generally happens to me, that Donald Knuth identified, analyzed, and solved the problem several decades ago. Better than you did, needless to say.
So, shuffling. I was doing a thing with playing cards in software the other day, and got sidetracked (more like stopped short) by the interesting problem of how to "shuffle" the "deck." Obviously it involved random selections and such, but the actual mechanics were not immediately obvious to me.
Here's where the non-CS mind comes into play. My first thought was to generate an unshuffled deck as an array-like structure -- all cards in order by suit. Then I'd create a second array-like structure. I'd walk through each card in the unshuffled deck, pick a random number, and insert the card at the randomly selected spot in the second array. If the randomly chosen position in the second array was already occupied, I'd choose another random number, see if it was used, and so on until the random selection happened to land on a free spot. I'll call this the Random Insert approach.
But this seemed slightly fishy to me. For the first some number of cards, the random selection would not collide (much) with spots already filled. But as the second array got more and more filled, it would take the random-selection loop longer and longer to stumble across open spots in the array. Good heavens, how many random selections would it take to fill that last slot? There is, I'm sure, a mathematical way to express this ever-fewer-options issue. Not knowing what that is, I indulged a notion that this could get into squares and cubes of time and, who knows, bring the computer to a halt.
Before I even tried this, I started thinking of alternatives. One thought I had was to find random buckets for, say, the first 42 cards in the unshuffled deck. When I got to the last 10 cards, I'd stop randomly selecting slots in the second array. Instead, I'd just walk up that array and stuff the last 10 cards into whatever 10 spots happened to not be filled yet. That seemed sort of quasi-random to me, without the problem of the randomizer spinning and spinning quite so much to find a free spot toward the end of the process. This would be the Blended Random Insert approach.
I also thought that another solution would be to create the second array as a huge array, maybe 5 times as large as the deck. In that case, the randomizer wouldn't run into nearly as many collisions and would find spots for all the cards relatively easily. However, after all the cards had new homes, this would require crunching down this sparse array and discarding the 4 x 52 unfilled slots. I'll call this the Big Array approach.
Or perhaps another approach altogether. Rather than randomly finding new spots in the shuffled array, I could randomly select cards in the unshuffled deck, and then simply add them one by one to the shuffled array. To avoid the problem of potentially selecting a card again, I would simply chop that card out of the array. So the first time, I'd choose among 52 chards. The second time, among 51 items. The third time, among 50 items, and so on, until all the cards had been randomly selected and moved over to the shuffled array. I'll call this the Array Chop approach.
What the heck, I tried three of them. (I didn't try the Blended Random Insert approach.) You can see them here. Source for the shuffle-y part is here, so you can tell me where I'm going wrong.
Update I added Julian's "Array Swap" approach as a fourth experiment -- see comments.
As an experiment, I shuffled each deck five times. To do that, I took the output of one shuffle and fed it back into another shuffle. Might this result in additional randomess? Kinda hard to tell from looking at the results.
To time these, I used a "high-performance" timer that was recommended by Scott Hanselman. I don't care about units, just the comparative values.
Speaking of which. What I find is that the difference between the
three four approaches is almost negligible. The first time or two that you run the shuffle, the process can take "a long time" (some small fraction of a second). Subsequent tries go down by a factor of three; clearly something is being initialized the first time or something. If there is any of the approaches that's slower, it's the Random Insert method; more accurately, it's the algorithm whose timing varies the most across multiple tests.
Anyway, just for fun. At some point I need to get back to whatever it was that I needed shuffling for.