When You Just Need a Zero

When creating digital products, like websites or apps, an important step is to conduct performance testing, which according to Wikipedia is “performed to determine how a system performs in terms of responsiveness and stability under a particular workload”. Performance testing is broad, varied, and not the subject of this post for many reasons, not the least of which being that I know very little about it.

However, one particular aspect of performance testing is relevant to touch on before transitioning to baseball, and that is percentile testing. To put things in (perhaps overly) simplistic terms, percentile testing measures something, like page loading speed, at typically a high percentile (90th, 99th, etc.) across all users. For those evaluating an online experience, high percentiles are employed because the relevant question is more often “are most users’ experiences meeting our performance criteria?” as opposed to “is the average user experience meeting our threshold for acceptability?”

One popular threshold in percentile testing is the 90th percentile. By examining the experience of those users with a site speed/performance in the 90th percentile, one may understand the experience of the considerable majority of users, versus simply gleaning how the average user is getting along.

In baseball, most statistics look at the sum of all performances and take some form of average. This is for good reason: players don’t get mulligans or do-overs in baseball. Nor do singularly inspired performances get weighed any more heavily than others. For all its shortcomings, ERA does effectively confer the amount of scoring that is being done against any given pitcher. FIP, while stripping out those more fickle outcomes on the baseball field, is still a metric which aggregates all of the events it measures.

Still, this practice of performance testing begs the question: should baseball take a page out of product development’s book and measure player performance a bit differently, at least in select cases?

That question is not necessarily answered here, but threads are pulled on. Specifically, I chose to focus on bullpens, a somewhat reasonable comp. To draw a simple comparison, when software developers work to ensure that 90% of users experience page loading speeds of, say, <1 second, relievers might alternatively aim to provide 90% of outings without allowing a run. Granted, 90% of outings being scoreless, even for 1-inning relievers, is a tall task indeed. Still, while P90 (or P??) outcomes for starting pitchers are more likely than not rough outings, relievers ideally should be delivering considerably more scoreless appearances than not.

Put another way, and while we are making outlandish hypotheticals, as a manager, which of the following relievers would you rather deploy throughout the season:

  • Reliever A: 3.00 ERA, 27 IP/outings, 18 scoreless outings, 9 outings with 1 ER apiece
  • Reliever B: 3.00 ERA, 27 IP/outings, 24 scoreless outings, 3 outings with 3 ER apiece
  • Reliever C: 3.00 ERA, 27 IP/outings, 26 scoreless outings, 1 outing (catastrophe) with 9 ER

I think a strong argument could be made that pitcher C, the only player with a P90 of 0 earned runs, would be the most valuable reliever, particularly should these hypothetical relievers be appearing consistently in close games.

With all this context, we can proceed to the data. Data comes by way of FanGraphs’ fantastic (and updated) Splits Leaderboard, which allows for data to be split to single dates. I pulled data for all relief appearances this season, up to September 10th. With that data, I calculated the percentage of scoreless outings by all pitchers with 20+ relief appearances, of which there were 254.

Setting those pitchers’ “Scoreless Outing Percentage” against their season-long ERA in a scatterplot results in the following graph.

Unsurprisingly, there is a negative relationship between scoreless outing percentage and ERA: the more runs you give up on average, the smaller fraction of your outings are scoreless, generally speaking. Another point to note is the dearth of players with scoreless outing percentages of 90% (i.e., P90 of 0 ER). There are in fact just 5 players with such a percentage: Brock Stewart, Josh Hader, Tim Mayza, Jesse Chavez, and Chris Martin. Clearly, P90 of 0 ER is likely too high a bar in practice.

Additionally, it looks as though there are plenty of instances where relievers have near-identical ERAs but considerably different scoreless outing percentages, which calls back in a way to the hypothetical posed above wherein ERAs are the same but scoreless outing percentages differ. As an example, Brooks Raley and Nick Anderson each have a 3.06 ERA, but Raley’s scoreless outing percentage, 85% is much higher than Anderson’s 71%.

The turquoise line superimposed over this scatterplot represents a linear regression for the relationship between ERA (as an explanatory variable) and scoreless outing percentage. While not overwhelmingly predictive, one can still make use of a simple model to estimate a player’s scoreless outing percentage based on his ERA alone. Using such a model, one can see where scoreless outing percentage and ERA diverge most drastically from that model’s estimation.

Below is a leaderboard of those scoreless outing percentages that “outperform” those players’ corresponding ERAs.

As one might expect, this list has a lot of high ERA pitchers. Keegan Akin owns a 6.75 ERA yet produces a scoreless (ER) appearance in nearly 7/10 instances. As a group these pitchers likely have been burnt by crooked innings.

Flip this list on its head and you find players whose relatively modest ERAs nonetheless correspond to depressed scoreless outing percentages.

This list might suggest a bit of a caveat: players responsible for going multiple innings. The cases of both Sean Manaea and Nick Pivetta represent players in swing roles that go multiple innings per appearance and therefore are more likely to give up at least one run when they make it into games. Interestingly, the simple model using ERA is much worse at estimating percentage scoreless outings in this direction; each player that makes up the top 15 has a scoreless outing percentage more than 10% lower than the linear model’s expectation.

One look at the scatterplot above suggests the scoreless outings in the 90th percentile of outings ranked by earned runs is a prohibitively high bar. To make things more reasonable, I examined players in terms of their earned runs at the 75th percentile, or P75. The thought being that a reasonable number of relievers put up zeros in three out of four outings. That does turn out to be the case: 116 of the 254 qualifying relievers had a P75 of 0.

Should teams consider players through this percentile lens? I am not in a position to say as this investigation only goes so far; there are a plethora of factors omitted here that would refine a “scoreless outing percentage,” or P75, metric. As the last table suggests, appearance length is not brought into consideration here. Nor is the fickle nature of earned (or unearned) runs allowed as a reliever, or the case of inherited runners.

Still, it seems to be a potentially compelling framework of evaluating and choosing relievers in the case of bullpen management. More compelling than deferring to pitcher-batter matchups, platoon advantages, pitch mix considerations or any of the other litany of criteria available to decision makers today? Most probably not. Regardless, in the ninth inning of the World Series leading by one, all else held equal, I am taking Pitcher C every time.

You may also like...

%d bloggers like this: