Baseball sabermetrics has really always been about fairly large datasets. A single batter's hitting line often reaches 700 plate appearances in the course of a season. There are over 1000 players who have donned a cap in major league baseball this year, with thousands more in the minor leagues.
One of this week's topics in +Andy Andres ' Sabermetrics 101x course this week is Big Data, with the capital B and D. There are a number of good points made in his introduction, but one of the most important is that big data does not always mean better data. Big data can be fraught with bias (lack of controls, systematic bias in data collection), a problem that Colin Wyers has long railed against with our favorite defensive metrics. Big data can also lead to false conclusions because statistical approaches no longer work well. When your sample size gets into the thousands, P-values below 0.05 get easy to reach, even when there is no actual meaning to the difference found.
This spring, MLBAM announced a new stream of data with its new player tracking system. This system, which is already in operation in a few parks this year, will track virtually everything on the baseball field: player position, ball trajectories from the pitchers' hand and from the bat into the field. It is essentially a replacement of the pitchf/x, hitf/x, and fieldf/x system we've been using (or, in the latter two cases, at least hearing about).
It's incredibly exciting to hear that this will be used. The question will be how much of these data will be available to the fan community at large. We probably don't need to be able to download all 7 TB of data. But hopefully, we'll get something. My wish list:
- Everything we currently have with the pitchf/x system for tracking pitches.
- The equivalent of pitchf/x, but for batted balls. Therefore, we'd get vertical trajectory, vector the ball was hit, velocity, spin, and landing location/hang time.
- For fielding, ball landing location alone would be a great step forward. It could be fed directly into our UZR-like algorithms, and immediately improve them by removing systematic bias in our hit location data. Initial player position could also be really useful, if nothing else than to help distinguish shifts when they happen.
Some of the other measures they show in the videos, like acceleration, reaction time, and maximum speed, could be interesting, and it would be great if they were released along with the stuff above. But I'm also not sure, for the time being, that I'll be all that interested in them. With what is described above, we'd have a wealth of useful analytical information at our hands. From the perspective of valuing players, which is often my main interest, it really doesn't matter if a player gets from point A to point B because they get a great jump, because they have great route efficiency, or because they're fast. What matters is that they get there.
Of course, if I'm a team, I care a lot more about the minutia. It might be that maximum speed can't be taught. But reaction speed might be able to be taught, and route efficiency almost certainly can be (right?). But personally, I'm most interested in just evaluating player value.
The concern, of course, is how much of the data will actually be available to us. I'm frankly a bit scared about this. Potentially, MLBAM could be really stingy with these data, and we might end up with LESS information than we have now through pitchf/x. I'm hopeful that this won't happen, however, and I've no doubt that folks like +David Appelman at +FanGraphs will be doing his best to have access to, or perhaps even license, some of the critical info that we as a community want.
Once the data ARE available, the question will be how well the community makes use of them. I have no doubt that we'll see a lot of spurious conclusions in the early goings. Fortunately, the sabermetric community is pretty good at policing itself, and correcting its past mistakes. We have a lot of bright minds in this community, and hopefully, within a few years, we'll have a good grasp of what these data can and cannot do.