Friday, March 18, 2016

Four Books Worth Owning

Below are listed four books on statistics which I feel are worth owning. They largely take a "traditional" statistics perspective, as opposed to a machine learning/data mining one. With the exception of "The Statistical Sleuth", these are less like textbooks than guide-books, with information reflecting the experience and practical advice of their respective authors. Comparatively few of their pages are devoted to predictive modeling- rather, they cover a host of topics relevant to the statistical analyst: sample size determination, hypothesis testing, assumptions, sampling technique,  etc.

I have not given ISBNs since they vary by edition. Older editions of any of these will be valuable, so consider saving money by skipping the latest edition.

A final clarification: I am not giving a blanket endorsement to any of these books. I completely disagree with a few of their ideas. I see the value of such books in their use as "paper advisers", with different backgrounds and perspectives than my own.

Thursday, March 10, 2016

Re-Coding Categorical Variables

Categorical variables as candidate predictors pose a distinct challenge to the analyst, especially when they exhibit high cardinality (a large number of distinct values). Numerical models (for instance linear regression and most neural networks) cannot accept these variables directly as inputs, since operations between categories and numbers are not defined.

It is sometimes advantageous (even necessary) to re-code such variables as one or more numeric dummy variables, with each new variable containing a 0 or 1 value indicating the presence (1) or absence (0) of one distinct value. This often works well with very small numbers of distinct values. However, as cardinality increases, dummy variables quickly become inconvenient: they add potentially many parameters to the model, while reducing the average number of observations available for fitting each of those parameters.

One solution is to convert each individual categorical variable to a new, numeric variable representing a local summary (typically the mean) of the target variable. An example would be replacing the categorical variable, State, with a numeric variable, StateNumeric, representing the mean of the target variable within each state.

Unfortunately, there are often insufficient observations to reliably calculate summaries for all possible categories. To overcome this limitation, one might select either the n most frequent categories, or the n categories with the largest and smallest means. Though both of those techniques sometimes work well, they are, in reality, crude attempts to discover the categories with the most significant differences from the global mean.

A more robust method which I've found useful is to select only those categories whose mean target values are at least a minimum multiple of standard errors away from the global mean. The detailed procedure is as follows:

1. Calculate the global mean (the mean of the target variable, across all observations)
2. For each category…
  • a. Calculate the local mean and local standard error of the mean for the target variable (“local” meaning just for the observations in this category).
  • b. Calculate the absolute value of the difference between the local mean and global mean, and divide it by the local standard error.
  • c. If the calculation from B is greater than some chosen threshold (I usually use 3.0, but this may be adjusted to suit the context of the problem), then this category is replaced in the new variable by its local mean.
3. All variables not chosen in 2c to get their own value are collected in an “other” category, and all such observations are assigned the mean target for the “other” group.

The threshold can be judgmentally adjusted to suit the need of the problem, but I normally stay within the arbitrary range of 2.0 to 4.0. Missing values can be treated as their own category for the purposes of this algorithm.

Note that this procedure seeks the accumulation of two kinds of evidence that categories are significantly different from the global mean: category frequency and difference between category local mean and the global mean. Some categories may exhibit one and not the other. A frequent category may not get its own numeric value in the new variable in this procedure, if its local mean is too close to the global mean. At the same time, less frequent categories might be assigned their own numeric value, if their local mean is far enough away from the global mean. Notice that this procedure does not establish a single, minimum distance from the global mean for values to be broken away from the herd.

Though using a multiple of the standard error is only an approximation to a rigorous significance test, it is close enough for this purpose. Despite some limitations, this technique is very quick to calculate and I’ve found that it works well in practice: The number of variables remains the same (one new variable for each raw one) and missing values are handled cleanly.

Quick Refresher on Standard Errors

For numeric variables, the standard error of the mean of a sample of values is the standard deviation of those values divided by the square root of the number of values. In MATLAB, this would be:

StandardError = std(MyData) / sqrt(n)

For binary classification (0/1) variables, the standard error of a sample of values is the square root of this whole mess: the mean times 1 minus the mean, over the number of values. In MATLAB, this is:

p = mean(MyData); StandardError = sqrt( (p * (1 - p) ) / n)

Final Thoughts

Data preparation is an important part of this work, often being identified as the part of the process which takes the most time. This phase of the work offers tremendous opportunity for the analyst to be creative, and for the entire process to wring the most information out of the raw data.

This is why it is so strange that so little published material from experts is dedicated to it. Except in specialized fields, such as signal and image processing, one generally only finds bits of information on data preparation here and there in the literature. One exception is Dorian Pyle's book, "Data Preparation for Data Mining" (ISBN-13: 978-1558605299), which is entirely focused on this issue.

Monday, March 07, 2016

MATLAB as (Near-)PseudoCode

In "Teaching Data Science in English (Not in Math)", the Feb-08-2016 entry of his Web log, "The Datatist", Charles Givre criticizes the use of specialized math symbols (capital sigma for summation, etc.) and Greek letters as being confusing, especially to newcomers to the field. He offers, as an example, the following, traditional definition of sum of squared errors:

He suggests that "English" (pseudo-code) be used instead, such as the following:

residuals_squared = (actual_values - predictions) ^ 2
RSS = sum( residuals_squared )

Although there are some flaws in this particular comparison (1. the first example includes an unnecessary middle portion, 2. it also features the linear model definition, while the pseudo-code example does not, and 3. the indexing variable in the summation is straightforward in this case, but is not always so), I tend to agree. Despite my own familiarity with the math jargon, I agree with him that, in many cases, pseudo-code is easier to understand.

Pseudo-code is certainly simpler syntactically since it tends to make heavy use of (sometimes deeply-nested) functions, as opposed to floating subscripts, superscripts and baroque symbols. Pseudo-code often employs real names for parameters, variables and constants, as opposed to letters. It is also closer to computer code that one actually writes, which is the point of this post.

Given the convention in MATLAB to store variables as column vectors, here is my MATLAB implementation of Givre's pseudo-code:

residuals_squared = (actual_values - predictions) .^ 2;
RSS = sum( residuals_squared );

Only two minor changes were needed: 1. The MATLAB .^ operator raises each individual element in an input matrix to a power, while the MATLAB ^ operator raises the entire matrix to a power (through matrix multiplication). 2. The semicolons prevent echoing the result of each calculation to the console: They are not strictly necessary, but will typically be included in a MATLAB program. Note that, unlike some object-oriented languages we could name, no extra code is needed to lay the groundwork of overloaded operators.

Consider an example I have given in the past, for the forward calculation of a feedforward neural network using a hyperbolic tangent transfer function:

HiddenLayer = tanh(Input * HiddenWeights);
OutputLayer = tanh(HiddenLayer * OutputWeights);

Note that this code recalls the neural network over all observations stored in the matrix Input.That's it: Two lines of code, without looping, which execute efficiently (quickly) in MATLAB. The addition of bias nodes, jump connections and scaling of the output layer would require more code, but even such changes would leave this code easy to comprehend.

Saturday, April 12, 2014

Probability: A Halloween Puzzle


Though Halloween is months away, I found the following interesting and thought readers might enjoy examining my solution.

Recently, I was given the following probability question to answer:

Halloween Probability Puzzler

The number of trick-or-treaters knocking on my door in any five minute interval between 6 and 8pm on Halloween night is distributed as a Poisson with a mean of 5 (ignoring time effects). The number of pieces of candy taken by each child, in addition to the expected one piece per child, is distributed as a Poisson with a mean of 1. What is the minimum number of pieces of candy that I should have on hand so that I have only a 5% probability of running out?

I am not aware of the author of this puzzle nor its origin. If anyone cares to identify either, I'd be glad to give credit where it's due.


I interpret the phrase "ignoring time effects", above, to mean that there is no correlation among arrivals over time; in other words, the count of trick-or-treater arrivals during any five minute interval is statistically independent of the counts in the other intervals.

So, the described model of trick-or-treater behaviors is that they arrive in random numbers during each time interval, and each take a single piece of candy plus a random amount more. The problem, then, becomes finding the 95th percentile of the distribution of candy taken during the entire evening, since that's the amount beyond which we'd run out of candy 5% of the time.

Solution Idea 1 (A Dead End)

The most direct method of solving this problem, accounting for all possible outcomes, seemed hopelessly complex to me. Trying to match each possible number of pieces of candy taken over so many arrivals over so many intervals results in an astronomical number of combinations. Suspecting that someone with more expertise in combinatorics than myself might see a way through that mess, I quickly gave up on that idea and fell back to something I know better: the computer (leading to solution idea 2... ).

Solution Idea 2

The next thing which occurred to me was Monte Carlo simulation, which is way of solving problems by approximating probability distributions using many random samples from those distributions. Sometimes very many samples are required, but contemporary computer hardware easily accommodates this need (at least for this problem). With a bit of code, I could approximate the various distributions in this problem and their interactions and have plenty of candy come October 31.

While Monte Carlo simulation and its variants are used to solve very complex real-world problems, the basic idea is very simple: Draw samples from the indicated distributions, have them interact as they do in real life, and accumulate the results. Using the poissrnd() function provided in the MATLAB Statistics Toolbox, that is exactly what I did (apologies for the dreadful formatting):

% HalloweenPuzzler
% Program to solve the Halloween Probability Puzzler by the Monte Carlo method
% by Will Dwinnell
% Note: Requires the poissrnd() and prctile() functions from the Statistics Toolbox
% Last modified: Apr-02-2014

% Parameter
B = 4e5;   % How many times should we run the nightly simulation?

% Initialize
S = zeros(B,1);  % Vector of nightly simulation totals

% Loop over simulated Halloween evenings
for i = 1:B
  % Loop over 5-minute time intervals for this night
  for j = 1:24
    % Determine number of arrivals tonight
    Arrivals = poissrnd(5);
    % Are there any?
    if  (Arrivals > 0)
      % ...yes, so process them.
      % Loop over tonight's trick-or-treaters
      for k = 1:Arrivals
        % Add this trick-or-treater's candy to the total
        S(i) = S(i) + 1 + poissrnd(1);

% Determine the 95th percentile of our simulated nightly totals for the answer
disp(['You need ' int2str(prctile(S,95)) ' pieces of candy.'])

% Graph the distribution of nightly totals
grid on
title({'Halloween Probability Puzzler: Monte Carlo Solution',[int2str(B) ' Replicates']})
xlabel('Candy (pieces/night)')


Fair warning: This program takes a long time to run (hours, if run on a slow machine).

Hopefully the function of this simulation is adequately explained in the comments. To simulate a single Halloween night, the program loops over all time periods (there are 24 intervals, each 5-minutes long, between 6:00 and 8:00), then all arrivals (if any) within each time period. Inside the innermost loop, the number of pieces of candy taken by the given trick-or-treater is generated. The outermost loop governs the multiple executions of the nightly simulation.

The poissrnd() function generates random values from the Poisson distribution (which are always integers, in case you're fuzzy on the Poisson distribution). Its lone parameter is the mean of the Poisson distribution in question. A graph is generated at the end, displaying the simulated distribution for an entire night.

Important Caveat

Recall my mentioning that Monte Carlo is an approximate method, several paragraphs above. The more runs through this process, the closer it mimics the behavior of the probability distributions. My first run used 1e5 (one hundred thousand) runs, and I qualified my response to the person who gave me this puzzle that my answer, 282, was quite possibly off from the correct one "by a piece of candy or two". Indeed, I was off by one piece. Notice that the program listed above now employs 4e5 (four hundred thousand) runs, which yielded the exactly correct answer, 281, the first time I ran it.

Saturday, November 23, 2013

Reading lately (Nov-2013)

I read a great deal of technical literature and highly recommend the same for anyone interested in developing their skill in this field. While many recent publications have proven worthwhile (2011's update of Data Mining by Witten and Frank; and 2010's Fundamentals of Predictive Text Mining, by Weiss, Indurkhya and Zhang are good examples), I confess being less than overwhelmed by many current offerings in the literature. I will not name names, but the first ten entries returned from my search of books at Amazon for "big data" left me unimpressed. While this field has enjoyed popular attention because of a colorful succession of new labels ("big data" merely being the most recent), many current books and articles remain too far down the wrong end of the flash/substance continuum.

For some time, I have been exploring older material, some from as far back as the 1920s. A good example would be a certain group of texts I've discovered, from the 1950s and 1960s- a period during which the practical value of statistical analysis had been firmly established in many fields, but long before the arrival of cheap computing machinery. At the time, techniques which maximized the statistician's productivity were ones which were most economical in their use of arithmetic calculations.

I first encountered such techniques in Introduction to Statistical Analysis, by Dixon and Massey (my edition is from 1957). Two chapters, in particular, are pertinent: "Microstatistics" and "Macrostatistics", which deal with very small and very large data sets, respectively (using the definitions of that time). One set of techniques described in this book involve the calculation of the mean and standard deviation of a set of numbers from a very small subset of their values. For instance, the mean of just 5 values- percentiles 10, 30, 50, 70 and 90- estimates the mean with 93% statistical efficiency.

How is this relevant to today's analysis? Assuming that the data is already sorted (which it often is, for tree induction techniques), extremely large data sets can be summarized with a very small number of operations (4 additions and 1 division, in this case), without appreciably degrading the quality of the estimate.

Data storage has grown faster than computer processing speed. Today, truly vast collections of data are easily acquired by even tiny organizations. Dealing with such data requires finding methods to accelerate information processing, which is exactly what authors in the 1950s and 1960s were writing about.

Readers may find a book on exactly this subject, Nonparametric and Shortcut Statistics, by Tate and Clelland (my copy is from 1959), to be of interest.

Quite a bit was written during the time period in question regarding statistical techniques which: 1. made economical use of calculation, 2. avoided many of the usual assumptions attached to more popular techniques (normal distributions, etc.) or 3. provided some resistance to outliers.

Genuinely new ideas, naturally, continue to emerge, but don't kid yourself: Much of what is standard practice today was established years ago. There's a book on my shelf, Probabilistic Models, by Springer, Herlihy, Mall and Beggs, published in 1968, which describes in detail what we would today call a naive Bayes regression model. (Indeed, Thomas Bayes' contributions were published in the 1760s.) Claude Shannon described information and entropy- today commonly used in rule- and tree-induction algorithms as well as regression regularization techniques- in the late 1940s. Logistic regression (today's tool of choice in credit scoring and many medical analyses) was initially developed by Pearl and Reed in the 1920s, and the logistic curve itself was used to forecast proportions (not probabilities) after being introduced by Verhulst in the late 1830s.

Ignore the history of our field at your peril.