16 March 2008

Special Combined Sunday Judgment/Hacker Within: March Madness

Today's subject: the NCAA men's basketball tournament.

First, a couple of confessions.

(1) It's a bit of a stretch to throw the Hacker Within label on this post, but I couldn't pass up an opportunity to integrate my two regular features in one post.

(2) Since I have picks to make, and also fair number of goals for genuine productivity during this week (which is spring break at UW-Madison), I'm totally half-assing this post with respect to both the commitment-to-quality-science-writing and commitment-to-meaningful-integration-of-technical-
and-non-technical-material ambitions of this blog. Thus, the whole integration thing in (1) is also a stretch.

So here we go. First, Sunday Judgment. If you've been watching Sports Center or listening to any sports talk radio this week, you know that the trendiest game in NCAA basketball journalism is coming up with clever synonyms for bubble team. (See especially this week's Mike Tirico Show.) Not surprisingly, this trend makes for good radio and plenty of fun armchair etymology (or maybe reverse-etymology?).

However, this seemingly harmless game reminded me of a serious problem that mars a lot of college writing. Call it synonymic hyper-proliferation. Or restless diction. Or mythesaurus rex.

Actually, please call it only one of those things.

I can't tell you how many student writers feel pressured to substitute synonyms when repeatedly referring to an important theoretical construct or technical term. Of course, variety is an important attribute of all good writing, and it's often a bad idea to use the same word twice in one sentence, in the same position in subsequent sentences, etc. But, more often than not, if you're writing a paper about, say, disciplinary matrices, it's a mistake to give in to the urge to come up with a million different ways of saying disciplinary matrix. These precise terms come about for a reason; don't feel pressured to over-substitute.

OK, onto some quasi-Hacker Within material. For reasons of, well, basically realizing it was a huge waste of time, I've abandoned draft.gms, my probably futile attempt to turn my fantasy baseball draft into a huge assignment-problem-like GAMS model. Naturally, I've re-channeled my silly interest in applying mathematical programming to, say, sporting events and dice games in bars, and so I wanted to point you in the direction of resources for using the power of science to make better March Madness picks.

You may have heard last year about some professors at Georgia Tech who published a paper in Naval Research Logistics called "A logistic regression/Markov chain model for NCAA basketball." If memory serves, the UW-Madison libraries don't carry this one, but it looks like Kvam and Sokol have posted a manuscript of the paper here. There's also a kinda funny "powerpoint style equivalent" to the non-mathematical summary they wrote, presumably for all the media (I heard about it via some ESPN article last year that also included an "insider look" at how the oddsmakers go about their business). I haven't read the whole thing, but even checking out the first few pages gives you an appreciation for their methodology. If you need some help with Markov chains (I certainly did), this AMS primer is pretty comprehensible.

Not interested in Markov chains? No problem. Profs. Kvam and Sokol make the output of their model, applied to this year's game results, available here. You can choose between three versions of the model that take one of the types of input data, margin of victory (MOV), into account in various ways. Not surprisingly, the "pure" strategy (which doesn't cap the contribution of MOV) is best. Nevertheless, the selection committee, which at least last year had access to the LRMC, won't use tools that consider MOV (for sportsmanship reasons, presumably). Anyway, if you're interested in trying this, just choose a model and use the rankings to pick each match-up. And remember: the pure LRMC is the most successful systematic ranking system available.

Couple of thoughts:

(1) Check out the top eight teams for each of the three rating schemes--
Pure: Kansas, Memphis, UCLA, Duke, North Carolina, Tennessee, Wisconsin, Clemson
Capped MOV: Kansas, Memphis, Duke, UCLA, Tennessee, North Carolina, Wisconsin, Stanford
No MOV: North Carolina, UCLA, Memphis, Duke, Tennessee, Kansas, Texas, Wisconsin

This is kinda neat. You can see the effects of all those close games North Carolina won reflected in their placement under each model. Are they actually overrated? I dunno, but I like the sound of it.

(2) See, Wisconsin, shoulda been a two seed, no matter how you look at it (or, rather, no matter how these two industrial engineers looked at it).

(3) Pull a Joe Morgan if you like, either by criticizing the very idea of using stats to predict this stuff or by gloating when the inevitabilities of statistical randomness play out and the models break down from time to time. But please don't say that this kind of analysis sucks the life out of playing or watching these games. That mentality totally missed the point. Of course we shouldn't reduce sports to mechanical calculations. Of course what's really exciting is watching people overcome their mathematical destiny and do something special. Of course these methods overlook all kinds of intangibles.

But the point of doing brackets is to get the most picks right, right? There are all kinds of arguments against using the LRMC method to make your picks, but wanting to give yourself the best chance to win isn't one of them.

Kvam and Sokol humorously observe, "With so much money on the line, a model that predicts outcomes more effectively than standard ranking and rating systems can be useful." I myself am risking a total of two beers, so I'm going to let pride cloud my scientific judgment and tweak the pure LRMC rankings a bit. To my detriment, no doubt.

No comments: