Data Mining in MATLAB

Friday, January 14, 2022

Implied Probabilities

Oddsmakers provide a useful prediction mechanism for many subjects of interest. Beyond sports, they host prediction markets for events in politics, entertainment, current events and other fields. There are some subtleties, however, in converting payout odds to probabilities. Note that that payout odds (also called payoff odds or house odds) are expressed in this article as fractional odds (there are several other popular formats used in casinos and on-line bookmakers, such as decimal odds and American odds: they simply indicate the same information a different way).

Payout Odds

It is important to understand the specific nature of payout odds, which are not the same as odds taught in statistics courses. Payout odds indicate the schedule of payment for win or loss at the conclusion of a wager. Typically, payout odds overpredict the probability of a specific outcome. Think about it this way: The rarer the event the player successfully predicts, the more the house will need to pay out. Hence, payout odds tend to be upwardly biased.

If the oddsmaker believes that the probability of an event is 50%, then a bet which does not favor the house nor the player would be 1/1: The house puts up $1 and the player puts up $1. To make such a situation work to the oddsmaker's advantage, the payout odds might be set by the house to, for instance, 5/6: the house puts up only $5 while the player puts up $6 . The player in this case is expected to lose an average of $0.50 dollars: Half the time, the player leaves with $11, and the other half of the time the player leaves with nothing: an average payout of $5.50 after putting up $6.

Implied Probability

Payout odds can be converted to implied probabilities by doing some quick arithmetic: the player's wager is divided by the total wager (the house wager plus the player's wager). Continuing with the hypothetical payout odds of 5/6, the implied probability is 0.54 = 6 / (5 + 6). The extra 0.04 versus the oddsmaker's prediction of 0.5 is the statistical bias of this implied probability.

Mapping to the Probability Continuum

Interestingly, it is normally the case that the sum of the implied probabilities for real payout odds exceeds 1.0. This is because the oddsmaker generally inflates each of the payout odds (to make them all in the house's favor). One common method for dealing with this is to divide each of the implied probabilities by their total.

For an illustration using real payout odds, consider an actual wager on American politics offered on Jan-14-2022 at Bovada ( https://www.bovada.lv/ ), an on-line gambling house. The listed wager is: "Which Party Will Control The Senate After The 2022 Midterm Election?", and the listed outcomes are "Republican" with payout odds of 2/5 and "Democratic" with payout odds of 37/20. The implied probability of Republican control of the Senate is 0.71 = 5 / (2 + 5). The implied probability of Democrat control is 0.35 = 20 / (37 + 20). Note, significantly, that 0.71 + 0.35 > 1. To map these implied probabilities together into the probability space, one divides by their total, 1.06. Thus, using payout odds as our model, the estimated probability of Republican control is 0.67 = 0.71 / 1.06, and the estimated probability of Democrat control is 0.33 = 0.35 / 1.06.

Important Details

A few important details are worth mentioning:

First, gambling is a product of human behavior. Betting is partially driven by emotion, and gamblers enjoy the same failings as the rest of humanity. This sometimes spoils their aim, and statistically biases predictions of the betting markets. On the positive side, the gain or loss of real money provides a powerful inducement for clear thinking, and in the long run tends to weed out poor predictors. It's one thing for an academic or pundit to casually make a prediction in a paper or on a talk show, it's quite another thing to do so when one's own money is on the line!

Second, payout odds are arguably driven by two forces: 1. the wisdom of crowds and 2. the desire of the house to minimize risk. The first force works in favor of statistically unbiased estimates, whereas the second one may bias payout odds. It has been suggested that oddsmakers will set odds to balance wagers on competing outcomes. Despite making less money per wager, this would ensure that "the losers pay the winners", avoiding situations in which the house pays out painfully on its own poor predictions.

Third, payout odds may vary from one casino or on-line house to another, but market forces (and the potential for arbitrage) tend to keep them fairly well aligned.

Last, commercial gambling houses tend to be fairly specific in the terms of their proposed wagers, regarding times, conditions, etc. In the real example given above, in the event that neither political party wins control (there is a 50/50 split), the wager will include a stipulation that the bet is cancelled, or the party of the Vice President breaks the tie or some other such specific resolution is used.

Note, too, the exact meaning of such wagers. In the real example related here, the 0.67 represents the probability that the Republican party will control (have more than 50% of the membership of) the Senate. It does not indicate the percent of Senate seats which the Republicans will fill.

Conclusion

Gambling markets are, at their foundation, information markets. Participants are powerfully motivated to deliver competent analysis. By aggregating such predictions, oddsmakers provide a competitive alternative to more traditional predictors, such as pollsters, forecasters and similar experts. Additionally, the events whose outcomes are anticipated by gambling houses are often important yet statistically awkward: high importance, low frequency events that often involve special, one-time circumstances which can stymie other prediction mechanisms.

Friday, April 21, 2017

A New Dawn for Local Learning Methods?

The relentless improvement in speed of computers continues. While some technical barriers to this progress have begun to emerge, exploitation of parallelism has actually increased the rate of acceleration for many purposes, especially in applied mathematical fields such as data mining.

Interestingly, new, powerful hardware has been put to the task of running ever more baroque algorithms. Feedforward neural networks, once trained over several days, now train in minutes on affordable desktop hardware. Over time, ever fancier algorithms have been fed to these machines: boosting, support vector machines, random forests and, most recently, deep learning illustrate this trend.

Another class of learning algorithms may also benefit from developments in hardware: local learning methods. Typical of local methods are radial basis function (RBF) neural networks and k-nearest neighbors (k-NN). RBF neural networks were briefly popular in the heyday of neural networks (the 1990s) since they train much faster than the more popular feedforward neural networks. k-NN is often discussed in chapter 1 of machine learning books: it is conceptually simple, easy to implement and demonstrates the advantages and disadvantages of local techniques well.

Local learning techniques usually have a large number of components, each of which handles only a small fraction of the set of possible input cases. The nice thing about this approach is that these local components largely do not need to coordinate with each other: The complexity of the model comes from having a large number of such components to handle many different situations. Local learning techniques thus make training easy: In the case of k-NN, one simply stores the training data for future reference. Little, if any, "fitting" is done during learning. This gift comes with a price, though: Local learning systems train very quickly, but model execution is often rather slow. This is because local models will either fire all of those local components, or spend time figuring out which among them applies to any given situation.

Local learning methods have largely fallen out of favor since: 1. they are slow to predict outcomes for new cases and, secondarily, 2. their implementation requires retention of some or all of the training data, and 2. This author wonders whether contemporary computer hardware may not present an opportunity for a resurgence among local methods. Local methods often perform well statistically, and would help diversify model ensembles for users of more popular learning algoprithms. Analysts looking for that last measure of improvement might be well served by investigating this class of solutions. Local algorithms are among the easiest to code from scratch. Interested readers are directed to "Lazy Learning", edited by D. Aha (ISBN-13: 978-0792345848) and "Nearest Neighbor Norms: NN Pattern Classification Techniques", edited by B. Dasarathy (ISBN-13: 978-0818689307).

Wednesday, March 29, 2017

Data Analytics Summit III at Harrisburg University of Science and Technology

Harrisburg University of Science and Technology (Harrisburg, Pennsylvania) has just finished hosting Data Analytics Summit III. This is a multi-day event featuring a mix of presenters from the private sector, the government/government-related businesses and academia which spans research, practice and more visionary ("big picture") topics. The theme was “Analytics Applied: Case Studies, Measuring Impact, and Communicating Results".

Regrettably, I was unable to attend this time because I was traveling for business, but I was at Data Analytics Summit II, which was held in December of 2015. If you haven't been: Harrisburg University of Science and Technology does a nice job hosting this event. Additionally, (so far) the Data Analytics Summit has been free of charge, so there is the prospect of free food if you are a starving grad student.

The university has generously provided links to video of the presentations from the most recent Summit:

http://analyticssummit.harrisburgu.edu/

Video links for the previous Summit, whose theme was unstructured data can be found at the bottom of my article, "Unstructured Data Mining - A Primer" (Apr-11-2016) over on icrunchdata:

https://icrunchdata.com/unstructured-data-mining-primer/

I encourage readers to explore this free resource.

Friday, March 17, 2017

Geographic Distances: A Quick Trip Around the Great Circle

Recently, I wanted to calculate the distance between locations on the Earth. Finding a handy solution, I thought readers might be interested. In my situation, location data included ZIP codes (American postal codes). Also available to me is a look-up table of the latitude and longitude of the geometric centroid of each ZIP code. Since the areas identified by ZIP codes are usually geographical small, and making the "close enough" assumption that this planet is perfectly spherical, trigonometry will allow distance calculations which are, for most purposes, precise enough.

Given the latitude and longitude of cities 'A' and 'B', the following line of MATLAB code will calculate the distance between the two coordinates "as the crow flies" (technically, the "great circle distance"), in kilometers:

DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)));

Note that latitude and longitude are expected as decimal degrees. If your data is in degrees/minutes/seconds, a quick conversion will be needed.

I've checked this formula against a second source and quickly verified it using a few pairs of cities:

% 'A' = New York
% 'B' = Atlanta
% Random on-line reference: 1202km
LatA = 40.664274;
LongA = -73.9385;
LatB = 33.762909;
LongB = -84.422675;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =

1202

% 'A' = New York
% 'B' = Los Angeles
% Random on-line reference: 3940km (less than 0.5% difference)<0 .5="" br="" difference="">
LatA = 40.664274;
LongA = -73.9385;
LatB = 34.019394;
LongB = -118.410825;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =

3955

References:

"How Far is Berlin?", by Alan Zeichick, published in the Sep-1991 issue of "Computer Language" magazine. Note that Zeichick credits as his source an HP-27 scientific calculator, from which he reverse-engineered the formula above.

"Trigonometry DeMYSTiFieD, 2nd edition", by Stan Gibilisco (ISBN: 978-0-07-178024-7)

Friday, March 18, 2016

Four Books Worth Owning

Below are listed four books on statistics which I feel are worth owning. They largely take a "traditional" statistics perspective, as opposed to a machine learning/data mining one. With the exception of "The Statistical Sleuth", these are less like textbooks than guide-books, with information reflecting the experience and practical advice of their respective authors. Comparatively few of their pages are devoted to predictive modeling- rather, they cover a host of topics relevant to the statistical analyst: sample size determination, hypothesis testing, assumptions, sampling technique, etc.

I have not given ISBNs since they vary by edition. Older editions of any of these will be valuable, so consider saving money by skipping the latest edition.

A final clarification: I am not giving a blanket endorsement to any of these books. I completely disagree with a few of their ideas. I see the value of such books in their use as "paper advisers", with different backgrounds and perspectives than my own.