Wednesday, March 29, 2017

Data Analytics Summit III at Harrisburg University of Science and Technology

Harrisburg University of Science and Technology (Harrisburg, Pennsylvania) has just finished hosting Data Analytics Summit III. This is a multi-day event featuring a mix of presenters from the private sector, the government/government-related businesses and academia which spans research, practice and more visionary ("big picture") topics. The theme was “Analytics Applied:  Case Studies, Measuring Impact, and Communicating Results".

Regrettably, I was unable to attend this time because I was traveling for business, but I was at Data Analytics Summit II, which was held in December of 2015. If you haven't been: Harrisburg University of Science and Technology does a nice job hosting this event. Additionally, (so far) the Data Analytics Summit has been free of charge, so there is the prospect of free food if you are a starving grad student.

The university has generously provided links to video of the presentations from the most recent Summit:

http://analyticssummit.harrisburgu.edu/


Video links for the previous Summit, whose theme was unstructured data can be found at the bottom of my article, "Unstructured Data Mining - A Primer" (Apr-11-2016) over on icrunchdata:

https://icrunchdata.com/unstructured-data-mining-primer/

I encourage readers to explore this free resource.


Friday, March 17, 2017

Geographic Distances: A Quick Trip Around the Great Circle

Recently, I wanted to calculate the distance between locations on the Earth. Finding a handy solution, I thought some reader might be interested. In my situation, location data included ZIP codes (American postal codes). Also available to me is a look-up table of the latitude and longitude of the geometric centroid of each ZIP code. Since the areas identified by ZIP codes are usually geographical small, and making the "close enough" assumption that this planet is perfectly spherical, trigonometry will allow distance calculations which are, for most purposes, precise enough.

Given the latitude and longitude of cities 'A' and 'B', the following line of MATLAB code will calculate the distance between the two coordinates "as the crow flies" (technically, the "great circle distance"), in kilometers:

DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)));

Note that latitude and longitude are expected as decimal degrees. If your data is in degrees/minutes/seconds, a quick conversion will be needed.

I've checked this formula against a second source and quickly verified it using a few pairs of cities:


% 'A' = New York
% 'B' = Atlanta
% Random on-line reference: 1202km
LatA = 40.664274;
LongA =  -73.9385;
LatB = 33.762909;
LongB = -84.422675;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =

        1202


% 'A' = New York
% 'B' = Los Angeles
% Random on-line reference: 3940km (less than 0.5% difference)<0 .5="" br="" difference="">
LatA = 40.664274;
LongA =  -73.9385;
LatB = 34.019394;
LongB = -118.410825;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =

        3955



 References:

"How Far is Berlin?", by Alan Zeichick, published in the Sep-1991 issue of "Computer Language" magazine. Note that Zeichick credits as his source an HP-27 scientific calculator, from which  he reverse-engineered the formula above.

"Trigonometry DeMYSTiFieD, 2nd edition", by Stan Gibilisco (ISBN: 978-0-07-178024-7)

Friday, March 18, 2016

Four Books Worth Owning

Below are listed four books on statistics which I feel are worth owning. They largely take a "traditional" statistics perspective, as opposed to a machine learning/data mining one. With the exception of "The Statistical Sleuth", these are less like textbooks than guide-books, with information reflecting the experience and practical advice of their respective authors. Comparatively few of their pages are devoted to predictive modeling- rather, they cover a host of topics relevant to the statistical analyst: sample size determination, hypothesis testing, assumptions, sampling technique,  etc.



I have not given ISBNs since they vary by edition. Older editions of any of these will be valuable, so consider saving money by skipping the latest edition.

A final clarification: I am not giving a blanket endorsement to any of these books. I completely disagree with a few of their ideas. I see the value of such books in their use as "paper advisers", with different backgrounds and perspectives than my own.

Thursday, March 10, 2016

Re-Coding Categorical Variables

Categorical variables as candidate predictors pose a distinct challenge to the analyst, especially when they exhibit high cardinality (a large number of distinct values). Numerical models (for instance linear regression and most neural networks) cannot accept these variables directly as inputs, since operations between categories and numbers are not defined.

It is sometimes advantageous (even necessary) to re-code such variables as one or more numeric dummy variables, with each new variable containing a 0 or 1 value indicating the presence (1) or absence (0) of one distinct value. This often works well with very small numbers of distinct values. However, as cardinality increases, dummy variables quickly become inconvenient: they add potentially many parameters to the model, while reducing the average number of observations available for fitting each of those parameters.

One solution is to convert each individual categorical variable to a new, numeric variable representing a local summary (typically the mean) of the target variable. An example would be replacing the categorical variable, State, with a numeric variable, StateNumeric, representing the mean of the target variable within each state.

Unfortunately, there are often insufficient observations to reliably calculate summaries for all possible categories. To overcome this limitation, one might select either the n most frequent categories, or the n categories with the largest and smallest means. Though both of those techniques sometimes work well, they are, in reality, crude attempts to discover the categories with the most significant differences from the global mean.

A more robust method which I've found useful is to select only those categories whose mean target values are at least a minimum multiple of standard errors away from the global mean. The detailed procedure is as follows:


1. Calculate the global mean (the mean of the target variable, across all observations)
2. For each category…
  • a. Calculate the local mean and local standard error of the mean for the target variable (“local” meaning just for the observations in this category).
  • b. Calculate the absolute value of the difference between the local mean and global mean, and divide it by the local standard error.
  • c. If the calculation from B is greater than some chosen threshold (I usually use 3.0, but this may be adjusted to suit the context of the problem), then this category is replaced in the new variable by its local mean.
3. All variables not chosen in 2c to get their own value are collected in an “other” category, and all such observations are assigned the mean target for the “other” group.



The threshold can be judgmentally adjusted to suit the need of the problem, but I normally stay within the arbitrary range of 2.0 to 4.0. Missing values can be treated as their own category for the purposes of this algorithm.

Note that this procedure seeks the accumulation of two kinds of evidence that categories are significantly different from the global mean: category frequency and difference between category local mean and the global mean. Some categories may exhibit one and not the other. A frequent category may not get its own numeric value in the new variable in this procedure, if its local mean is too close to the global mean. At the same time, less frequent categories might be assigned their own numeric value, if their local mean is far enough away from the global mean. Notice that this procedure does not establish a single, minimum distance from the global mean for values to be broken away from the herd.

Though using a multiple of the standard error is only an approximation to a rigorous significance test, it is close enough for this purpose. Despite some limitations, this technique is very quick to calculate and I’ve found that it works well in practice: The number of variables remains the same (one new variable for each raw one) and missing values are handled cleanly.


Quick Refresher on Standard Errors

For numeric variables, the standard error of the mean of a sample of values is the standard deviation of those values divided by the square root of the number of values. In MATLAB, this would be:

StandardError = std(MyData) / sqrt(n)

For binary classification (0/1) variables, the standard error of a sample of values is the square root of this whole mess: the mean times 1 minus the mean, over the number of values. In MATLAB, this is:

p = mean(MyData); StandardError = sqrt( (p * (1 - p) ) / n)


Final Thoughts

Data preparation is an important part of this work, often being identified as the part of the process which takes the most time. This phase of the work offers tremendous opportunity for the analyst to be creative, and for the entire process to wring the most information out of the raw data.

This is why it is so strange that so little published material from experts is dedicated to it. Except in specialized fields, such as signal and image processing, one generally only finds bits of information on data preparation here and there in the literature. One exception is Dorian Pyle's book, "Data Preparation for Data Mining" (ISBN-13: 978-1558605299), which is entirely focused on this issue.


Monday, March 07, 2016

MATLAB as (Near-)PseudoCode

In "Teaching Data Science in English (Not in Math)", the Feb-08-2016 entry of his Web log, "The Datatist", Charles Givre criticizes the use of specialized math symbols (capital sigma for summation, etc.) and Greek letters as being confusing, especially to newcomers to the field. He offers, as an example, the following, traditional definition of sum of squared errors:



He suggests that "English" (pseudo-code) be used instead, such as the following:

residuals_squared = (actual_values - predictions) ^ 2
RSS = sum( residuals_squared )


Although there are some flaws in this particular comparison (1. the first example includes an unnecessary middle portion, 2. it also features the linear model definition, while the pseudo-code example does not, and 3. the indexing variable in the summation is straightforward in this case, but is not always so), I tend to agree. Despite my own familiarity with the math jargon, I agree with him that, in many cases, pseudo-code is easier to understand.

Pseudo-code is certainly simpler syntactically since it tends to make heavy use of (sometimes deeply-nested) functions, as opposed to floating subscripts, superscripts and baroque symbols. Pseudo-code often employs real names for parameters, variables and constants, as opposed to letters. It is also closer to computer code that one actually writes, which is the point of this post.


Given the convention in MATLAB to store variables as column vectors, here is my MATLAB implementation of Givre's pseudo-code:

residuals_squared = (actual_values - predictions) .^ 2;
RSS = sum( residuals_squared );


Only two minor changes were needed: 1. The MATLAB .^ operator raises each individual element in an input matrix to a power, while the MATLAB ^ operator raises the entire matrix to a power (through matrix multiplication). 2. The semicolons prevent echoing the result of each calculation to the console: They are not strictly necessary, but will typically be included in a MATLAB program. Note that, unlike some object-oriented languages we could name, no extra code is needed to lay the groundwork of overloaded operators.

Consider an example I have given in the past, for the forward calculation of a feedforward neural network using a hyperbolic tangent transfer function:

HiddenLayer = tanh(Input * HiddenWeights);
OutputLayer = tanh(HiddenLayer * OutputWeights);


Note that this code recalls the neural network over all observations stored in the matrix Input.That's it: Two lines of code, without looping, which execute efficiently (quickly) in MATLAB. The addition of bias nodes, jump connections and scaling of the output layer would require more code, but even such changes would leave this code easy to comprehend.