Using K Nearest Neighbors to Predict Pitch Types

Fans of baseball have, to varying degrees, a fair understanding of what defines any given pitch type. Fastballs are defined by features like high velocity, backspin, and a straight flight path, relatively speaking. Curveballs break vertically more than horizontally, while sliders generally feature more side-to-side movement. Changeups aren’t as fast as fastballs, can feature some arm side movement, and have relatively low spin rates too. This of course is brutally oversimplifying matters, but so it goes.

If humans can pick out patterns in pitch characteristics, you can bet that computers can as well. A prime example of this in action comes via Trackman, which offers an out-of-the-box option to auto-classify the pitches that its radar system has recorded data for. I do not know what specific model(s) or technique(s) Trackman employs to classify pitches, but one has to assume, given the amount of data at Trackman’s disposal, that they have been fine-tuned. 

This post briefly chronicles my dabbling in the machine learning technique k-NN, or K Nearest Neighbors, to see if I can generally replicate the pitch classifications made by Trackman for a single game’s worth of data. I make use of a Trackman CSV file ready with auto-classified pitch types. I rely heavily on the book “Machine Learning with R” by Brett Lantz, in order to run this k-NN test. 

Put (too) simply, K Nearest Neighbors classifiers consider the proximity between various data points in order to bucket those data in various categories or classes. There are a couple significant advantages to using k-NN: it is considered to be one of the simplest machine learning algorithms, it makes no assumptions about the distribution of the data in question, and open source programs like R have a rich set of packages that can run the algorithm, or its variants. 

Conceptually speaking, k-NN makes classification decisions based on each individual data point’s location in a defined space, and the cumulative landscape of points around it. An appropriate example might be pitch data on a scatterplot, the x-axis representing pitch velocity, and the y-axis representing pitch spin rate. One might visualize fastballs congregating in the upper righthand corner of that scatterplot, due to fastballs’ higher velocities and generally healthy spin rates (especially these days). In the lower lefthand corner, where pitches that feature slower velocities and lower spin rates collect, there might be more changeups. 

As Brett Lantz puts it, “The k-NN algorithm treats the features as coordinates in a multidimensional feature space.” For every featured data point, the algorithm calculates that point’s distance (often Euclidean) to every other point in the space. That distance quantifies how close each point is to its “neighbors.”

The k in the k-NN nomenclature stands for the number of closely related neighbors that the algorithm will take into account in order to make a classification decision. For instance, if a k value is set to 5, and a k-NN algorithm is determining whether to classify a pitch as either a changeup or fastball, and should 4 of the 5 nearest test data points be changeups, that pitch would correspondingly be classified as a changeup. As a rule of thumb, k is usually a number roughly equivalent to the square root of the observations present in the data set.

The above description can pretty safely be said to be “so simple it’s wrong,” but it does the job to offer a bare minimum of context. Moving on to the data set though, the first job is to import the data and do some light cleaning. Three primary adjustments have been made. First, the original data set included 326 pitches, but Trackman’s auto-classifier defined 7 of those pitches simply as “Other.” Those pitches have been removed from the data set. Second, to make things as simple as possible, all pitch classifications have been reclassified to be either a Fastball, Changeup, or Breaking Ball. This decision was made to give myself a break in light of “sinkers”, and “sliders”, etc. being few and far between in this dataset, which gives them very few “neighbors” to be bucketed with. Third, only the three fields that classifications are based on are included in the test data set: pitch velocity, spin rate, and spin axis. Note that, because pitches are classified based on three features, the space that points inhabit is not flat, but cubic.

Two more prerequisite adjustments are made to the data before testing. 

First, the data set must be split into test and trial parts. A randomly selected 100 pitch (row) subset of data has been extracted to conduct the actual test with. The remaining 219 pitches will be used to train the data to the k-NN model. In other words, 219 pitches will be the records for the model to base its conclusions from, while the remaining 100 simulate new pitches to be predicted without knowledge of the actual Trackman classifications.

Second, those three aforementioned features (spin axis, spin rate, and velocity) must be normalized in order to not completely distort the distance between points, thus disproportionately weighing one variable versus another. For instance, consider one axis for velocity ranging from 0 to 100 and one for spin rate ranging from 0 to 3000. In order to circumnavigate this issue, feature scaling, or max-min normalization, has been conducted on these three fields so that they all range from 0 to 1.

Below is R’s output for this normalization step. You’ll see the the fields with the suffix “Normal” have been normalized.

The fastest pitch recorded from this game CSV was 95.65 mph and the highest spin rate recorded was 2,722. Normalizing these metrics bounded those fields between 0 and 1.

From here, R makes it very easy to run the k-NN classifier. It also enables fairly intuitive results to be displayed in the form of a cross table. Below is a cross table for final results; an explanation of that table is below.

95 pitch classifications made by the k-NN algorithm matched that of Trackman.

The cross table above indicates every permutation of “true positives,” “false positives,” “true negatives,” and “false negatives.” The true positive values make up the upper-left, center, and lower-right tiles, where 10, 14, and 71 Breaking, Changeups, and Fastballs, have been classified in alignment with Trackman, respectively. Of 100 pitches, the k-NN algorithm classified 95 the same as Trackman. Twice (per the middle-height, righthand cell) the k-NN classifier binned a pitch as a fastball, which Trackman has classified as a Changeup. 

These results did come with a bit of trial and error. In choosing an appropriate k value, I began with 15 (as 15 squared is not too roughly 319). In choosing k, essentially a choice is being made that errs either toward bias or variance. A low k, which means fewer neighbors are drawn upon, is susceptible to outliers, while a higher k creates bias due to more neighbors possibly simply reflecting the majority of the points in the space (i.e. if most of the pitches are fastballs, a high k might simply classify all pitches as fastballs). In the end, a k value of 13 was determined after watching the accuracy of matches fluctuate between 92 and 95%. 

In all, these were encouraging results. Yes, pitch classifications were mutated into just three categories before the k-NN algorithm was run. Yes, 95% matching does leave room for improvement. But nonetheless, that one could achieve 95% accuracy from an exercise conducted with relative ease via an open source programming language is pretty interesting, especially considering the significant role Trackman plays in informing MLB teams. That this classifier made decisions regardless of various pitchers’ underlying stuff, their handedness, or any other context only makes it all the more exciting. 

You may also like...