3 - In sports showcase
Catégorie | Cours |
---|---|
Ordre d'apprentissage | 4 |
Statut | Préparé |
Cours de Julie Chaumard
Chapters | Homeworks | Exercises | Weeks | Dates |
---|---|---|---|---|
The movie | 1 | 08/05/2025 | ||
Film Analysis | 1 | 08/05/2025 | ||
Ressources | 1 | 08/05/2025 | ||
3 Videos about Sports and Big data | 1 | 08/05/2025 | ||
Work on the movie | 1 | 08/05/2025 | ||
And IA ? | 1 | 08/05/2025 | ||
Exercise 3-1 | 1 | 08/05/2025 |
Understanding Data Science Through Sports
The movie
Original Title | Moneyball |
Release | 2011 |
Director | Bennett Miller (Capote, Foxcatcher) |
Screenwriters | Aaron Sorkin (The Social Network) & Steven Zaillian (Schindler’s List) |
Based on | The book Moneyball: The Art of Winning an Unfair Game (2003) by Michael Lewis |
Genre | Drama, Biography, Sports |
Runtime | 2h13 |
Music | Mychael Danna |
Main Cast
Brad Pitt | Billy Beane, General Manager of the Oakland Athletics |
Jonah Hill | Peter Brand, data analyst (character inspired by Paul DePodesta) |
Philip Seymour Hoffman | Art Howe, head coach |
Robin Wright | Sharon, Billy Beane’s ex-wife |
Kerris Dorsey | Casey Beane, Billy’s daughter |
Paul DePodesta is a key figure in the true story behind Moneyball
Full Name | Paul DePodesta |
Date of Birth | November 16, 1972 |
Education | Harvard University (Economics) |
Profession | Sports executive, data and strategy specialist |
Known for | Applying data science to player recruitment in baseball |
Position at the time of Moneyball | Assistant General Manager to Billy Beane at the Oakland Athletics |
Current Position (in 2025) | Chief Strategy Officer of the Cleveland Browns (NFL American football team – Ohio) |
- He is considered one of the pioneers of data science in professional sports.
- He demonstrated that an analytical profile with no on-field experience could deeply transform a traditional sector.
- He later applied his skills to other sports, notably American football, showing the transferability of data science.
Synopsis
The film tells the true story of Billy Beane, General Manager of the Oakland Athletics, who uses an innovative statistical approach to recruit players with a very limited budget.
Billy Beane, manager of the Oakland Athletics (California), faces a problem: his team has one of the smallest budgets in MLB.
To stay competitive, he turns to a young analyst (Peter Brand) who proposes an innovative method based on advanced statistical analysis.
They will recruit underrated players, not based on their “style,” but on their actual performance (e.g., on-base percentage).
Awards and Nominations
- 6 Academy Award nominations, including:
- Best Picture
- Best Actor (Brad Pitt)
- Best Supporting Actor (Jonah Hill)
- Best Adapted Screenplay
What the Film Illustrates
- The transition from intuition-based sport to data science.
- The resistance to change in professional environments.
- The hidden value of atypical profiles when you change evaluation metrics.
His Role in Moneyball
- In the 2000s, Paul DePodesta works with Billy Beane at the Oakland A’s.
- He uses statistical models to identify players’ true value, especially through on-base percentage.
- He is one of the first to apply rigorous quantitative methods in a field still driven by “gut feeling” and networking.
- In the film Moneyball, his character is played by Jonah Hill under the fictional name Peter Brand, as DePodesta declined to have his real name used.
Film Analysis
Moneyball illustrates the use of Big Data in sports, particularly in baseball.
Film Objective
- Overcome a financial disadvantage compared to major teams.
- Use data instead of intuition or traditional scouts’ experience.
Use of Big Data
- Statistical data analysis:
- Use of advanced indicators (like OBP: On Base Percentage) to evaluate players’ actual effectiveness.
- On-Base Percentage (OBP) is an advanced baseball statistic measuring how often a player reaches base, whether by hitting, drawing a walk, or being hit by a pitch.
- $\text{OBP} = \frac{\text{Hits} + \text{Walks} + \text{Hit by pitch}}{\text{At-bats} + \text{Walks} + \text{Hit by pitch} + \text{Sacrifice flies}}$
- The batting average only considers hits.
- OBP is more comprehensive: it also counts times the player reaches base without hitting (walks, being hit).
- It’s the central metric used by Billy Beane and Paul DePodesta to spot undervalued but efficient players.
- It’s thus a better indicator of a player's ability to avoid outs, which is crucial to create scoring opportunities.
Player Hits Walks OBP Batting Average A 100 60 .400 .300 B 100 10 .320 .300 Both have the same batting average (.300), but
player A gets on base much more often
OBP is better
- Abandoning traditional criteria (graceful movement, fame, age) in favor of quantifiable performance.
- Use of advanced indicators (like OBP: On Base Percentage) to evaluate players’ actual effectiveness.
- Optimized recruitment:
- Based on statistical models, the team recruits players undervalued by other clubs.
- This allows for maximizing performance at low cost.
- Role of the analyst:
- Peter Brand (inspired by Paul DePodesta) is a young economist who introduces the mathematical approach to sports recruitment.
- He builds predictive models using Excel or simple databases.
Big Data Concepts in the Film
Massive Data | Stats on thousands of players |
---|---|
Predictive Analysis | Performance prediction based on history |
Data-driven decision-making | Recruitment is no longer by “feel” but via models |
Valuing “forgotten” data | Some players have good stats but are ignored by the traditional system |
What Truly Sparked the Revolution:
- It was Billy Beane, manager of the Oakland Athletics, who in the early 2000s used an analytical approach to recruit undervalued players.
- He was inspired by the work of Bill James, pioneer of sabermetrics (advanced baseball statistics).
- This strategy proved a team with limited budget could stay competitive thanks to data.
The Role of the Film (2011)
- It popularized this story to a wide audience.
- It served as a cultural trigger, showing that intuition and experience could be replaced by statistical models.
- After the film, many teams and sports (football, basketball, soccer, tennis…) began adopting this type of reasoning.
Ressources
Internet pages
- Data analytics : de Moneyball à All is money : https://keyrus.com/fr/fr/insights/data-analytics-de-moneyball-a-all-is-money
- https://www.grandviewresearch.com/industry-analysis/sports-analytics-market Sports Analytics Market Size
Youtube videos
- The Numbers Game | How Data Is Changing Football | Documentary https://youtu.be/lLcXH_4rwr4?si=zZv8m37_DKRsx2Vo
3 Videos about Sports and Big data
[3:22] Moneyball: 3 insight about Big Data & Change Management
https://youtu.be/CkY6RA4hOfo?si=39N8BiOIGm7IZBVa
Transcript en English
Extract 1
There is an epidemic failure within the game to understand what is really happening. And this leads people who run Major League Baseball teams to misjudge their players and mismanage their teams.
— Go on.
— Okay.
People who run ball clubs — they think in terms of buying players.
Your goal shouldn’t be to buy players. Your goal should be to buy wins, and in order to buy wins, you need to buy runs. You’re trying to replace Johnny. David, the Boston Red Sox, see Johnny Damon, and they see a star who’s worth seven and a half million dollars a year. When I see Johnny Damon, what I see is an imperfect understanding of where runs come from. The guy’s got a great glove, he’s a decent leadoff hitter, he can steal bases. But is he worth the seven and a half million dollars a year the Boston Red Sox pay? No. Baseball thinking is medieval. They are asking all the wrong questions.
Extract 2
— Hey Billy, I wanted you to see these player valuations that you asked me to do.
— I asked you to do three.
— Yeah, to value three players.
— Yeah.
— How many did you do?
— Forty-seven.
— Okay. Don’t you walk me through the board?
So using this equation on the upper left right here, I’m projecting that we need to win at least 99 games to make the postseason. We need to score at least 814 runs in order to win those games, and allow no more than 645 runs.
— What’s this?
— This is the code that I’ve written for year-two projections. This is building in all the intelligence that we have to project players. It’s about getting things down to one number. Using stats the way we read them, we’ll find value in players that nobody else can see. People are overlooked for a variety of biased reasons and perceived flaws: age, appearance, personality. Bill James and mathematics cut straight through them. Out of the 20,000 notable players for us to consider, I believe that there is a championship team of 25 people that we could afford — because everyone else in baseball undervalues them. Like an island of misfit toy This is Chad Bradford, relief pitcher. He’s one of the most undervalued players in baseball. His defect is that he throws funny. Nobody in the big leagues cares about him. He looks funny. But this guy could be not just the best pitcher in our bullpen, but one of the most effective relief pitchers in all of baseball. You can get him for two hundred thirty-seven thousand.
Explanation
Instead of thinking: “Which player can I buy?”, one should think: “How can I score more runs to win more games?”
This means: don’t focus on the “names,” but on the actual ability of players to contribute to victory.
Current methods are outdated, almost “religious” or “superstitious,” based on intuition. The film criticizes this rigid mindset and proposes a modern, scientific, and rational approach.
This passage explains that traditional baseball follows the wrong logic: it values visible stars, not those who actually produce wins. Moneyball proposes a revolution: analyze the numbers, seek actual efficiency, and recruit smartly.
Billy Beane asks for an evaluation of 3 players. But the analyst evaluates 47, because he adopts a comprehensive and scientific approach, not intuitive or based on reputation.
“getting things down to one number”: he creates a mathematical formula to predict each player’s contribution to those goals. The idea is to summarize a player’s value into an objective number, to compare all players on a fair basis.
Bias: The traditional system rejects players due to prejudice: Too old, Not charismatic, Weird playing style…
But these criteria have nothing to do with actual effectiveness.
Bill James is the father of sabermetrics, the science of stats applied to baseball. He shows that data often contradicts intuition.
Their bet: find 25 effective but undervalued players, therefore cheaper, and build a winning team without blowing the budget.
Chad has a weird motion, so traditional scouts dismiss him. But data shows he is extremely effective. And on top of that, he only costs $237,000 → a bargain.
[3:58] Moneyball, Big Data, And The Data Scientist
https://youtu.be/10M_AP9MBg4?si=BT_LUSN1b7e8AQs4
Ari Kaplan is a leading figure in the Major Leagues for revolutionizing and modernizing player assessment. In this episode of Forrester TechnoPolitics, Mike Gualtieri asks Ari how Big Data will effect the game.
Explanation
- Ari Kaplan: a recognized expert in the field of professional baseball (Major Leagues), especially for his advanced statistical methods.
- Player assessment: means “evaluation of players,” that is, judging their performance, potential, etc.
- Revolutionizing and modernizing: he transformed traditional methods (often based on observation or intuition) by integrating data and technology.
- Big Data will effect the game: he is asked how massive data (Big Data) will influence or transform the game of baseball (in strategies, recruitment, performance, etc.).
- Forrester: a consulting and research firm in technology (very well-known in the tech sector).
- Moneyball guy: an informal expression referring to someone who uses data and statistics to make decisions in sports, particularly in baseball, in the style of the characters from the film Moneyball.
Transcript en English
Mike Walter :
Hi everyone, I’m Mike Walter, Forester principal analyst and your host of Forester TechnoPolitics. I’m here at Predictive Analytics World in Chicago, and I’m very happy to be here with Ari Kaplan, president of AriBall and a Moneyball guy — like, a real Moneyball guy.
Ari Kaplan :
Well, it’s great to be here, thanks — thanks for having me. I’ve been running AriBall for a number of years and involved in Major League Baseball, like with Moneyball — using information, using analytics to predict what’s going to happen, to put values on players.
Mike :
So you actually predict — you actually help real teams win ball games?
Ari :
Sure. So I like to say there’s “above the field” and “on the field”.
“Above the field” is forecasting: what’s the economic impact of a player, what might they do in future years, and what’s the risk involved — putting a dollar on the muscle, as the term goes in the industry.
And then there’s “on the field”, which is really exciting — you sit with the players, with the coaches in the clubhouse, and use information and analytics and predictive information of behavior to find strengths and weaknesses and habits of the team you’re about to face.
Mike :
And so — predictive analytics — do you think all of the teams understand the power of this at this point?
Ari :
Sure. So every team’s different. There are all different cultures and all different personalities.
And teams are very broad. There’s a part of the team in terms of ownership, and they’re very interested in maximizing, you know, their investment — whether it’s on the team itself (maximizing the wins), or whether it’s the business side.
Mike :
Do they provide you with all of the data you need to do the analytics?
Ari :
So that’s a really fun part of it: where does the data come from?
The teams themselves often collect proprietary information.
There are third-party vendors that collect information of every single pitch in the majors and minors — you’re talking multiple millions — about 900,000 pitches in the majors, and about five times that in the minors.
Everything from where the pitcher’s hand is when the ball’s released, what the spin of the ball is, what the late break movement of the pitch is, where it ends up in the zone.
It sounds like straightforward information, but you can glean really predictive patterns in human behavior.
So, using information can help detect what’s changed recently — and then you can alert the players and come up with a game plan.
There’s third-party data, there’s team information.
Some teams have entire squads of people recording, for example, where the catcher’s glove is set up and then where the pitch ends up, to see if the pitcher has command of certain pitches.
And in situations like runners on base, whether they have to change their delivery or not — there’s a lot going on now in terms of understanding pitching and hitting.
But there’s also great future investment in collecting large amounts of data in things such as fielding and mechanics of hitting.
For example, there are sensors — potentially — but right now, cameras set up in certain fields (and hopefully all 30 fields in the majors, and more in the minors) that collect everything that’s happening during the game:
Video images like ball movements, and just every movement, absolutely.
Mike :
That’s going to be huge.
Ari :
Yeah. And the value of that is incredible.
Right now, it’s very subjective to say: “You know, Derek Jeter is a better shortstop than Starlin Castro.”
But you can quantify it by saying: how many miles an hour does he throw to first in a key situation? Is he leaning the right way before the ball is actually hit?
Mike :
You know what I’m thinking?
I’m thinking that the most valuable players on a Major League Baseball team are going to be the data scientists and the computer scientists.
They’re going to be some of the most important people on the staff.
Ari :
Absolutely.
Mike :
Ari Kaplan — thank you.
[37:04] Big Data in the Age of Moneyball
https://youtu.be/pzKu_buHlfM?si=q5vuPInVDI1elMRP
Explanation
- Big Data has become central in sports, far beyond simple intuition.
- It all concretely started with Moneyball in 2003.
- Since then, teams like the Texas Rangers (baseball) have been using advanced tools (cloud, artificial intelligence, Databricks) to:
- Analyze player movements (computer vision),
- Predict performance,
- Optimize game and recruitment decisions.
- They use Machine Learning tools like MLFlow to organize their predictive models.
The Big Data Revolution and the Age of Moneyball
Presenters:
- Alexander Booth (Senior Analyst)
- Ryan Stoll (Data Engineer)
Organization: Texas Rangers Baseball Club
Presentation Outline
- Introduction: the legacy of Moneyball
- Moneyball (2003, then film in 2011) introduced the idea of using statistics (like on-base percentage) to recruit undervalued players.
- This marked a first step toward a data-driven strategy.
- Moneyball is a strategic approach introduced in baseball by Billy Beane of the Oakland Athletics, popularized by the book and film of the same name. This strategy relies on the use of advanced statistical data to identify undervalued players, notably favoring on-base percentage (OBP) over traditional batting average. The central idea is to use statistics to optimize recruitment and gameplay strategy.
- Statcast & technology tracking
- Introduction of Statcast: system for tracking ball and player movements.
- Technology evolution: Pitch FX, Trackman, Hawk-Eye.
- Precise measurements: speed, angle, spin, limb positions, etc.
Statcast is an advanced technology used by MLB to collect and analyze precise data during games:
- PitchFX (2006): Initial 3-camera system developed by Sport Vision to track pitch trajectories and speeds.
- Statcast (2015): System combining HD cameras and Doppler radar (TrackMan) installed in all MLB stadiums. It collects data such as:
- Pitch speed
- Exit velocity
- Spin rate
- Horizontal and vertical movement of the ball
- Hawk-Eye (since 2020): A system of 12 high-speed cameras providing precise tracking of pitches, hits, player movements, and batted balls (99% of hits are now tracked). Hawk-Eye also provides:
- 18-point skeletal data per player (shoulders, knees, wrists, etc.) in 3D at 30 FPS.
- Detailed seam tracking of the ball to analyze real trajectory.
- Other complementary technologies:
- High-speed motion capture (with or without markers)
- Bat sensors (swing speed, swing path)
- Force plates (analyzing player weight distribution)
This data is also combined with 3D Lidar scans of stadiums and weather data updated every 5 minutes.
- Big Data & Technical Infrastructure
- The Rangers use Databricks to handle millions of data points per game.
- Integration via Apache Spark, Delta Lake, Autoloader, etc.
- Centralization of data pipelines (API, CSV/JSON files, streaming…).
The Texas Rangers use Databricks, a unified analytics platform, to effectively manage their massive data streams:
Data Engineering with Databricks
- Centralization of data ingestion scripts (previously scattered across servers, databases, and scripts).
- Data received from:
- APIs
- FTP
- Internal and external databases
- Cloud storage (CSV, JSON, Parquet, videos)
- Use of Databricks notebooks and Delta Lake storage for:
- Extracting, transforming, and cleaning data (via Spark, Koalas, and PySpark).
- Intermediate storage before depositing into the enterprise data warehouse.
Streaming Data with Databricks
- Use of Autoloader (Databricks technology) for continuous and automatic streaming from cloud storage to Delta Lake.
- Real-time analysis of gameplay data (e.g., ball speed, trajectory, sprint speed) enabling quick tactical decisions such as umpire trend analysis or pitcher fatigue monitoring.
- Machine Learning Operations (MLOps)
- Use of MLflow to:
- Track, version, and deploy predictive models.
- Collaborate efficiently among analysts.
- Reduce time between game and actionable insights.
- Example: anticipating player fatigue, umpire tendencies, etc.
Machine Learning Operations (MLOps)
- Databricks and its MLFlow technology allow:
- End-to-end tracking of Machine Learning models from creation to production.
- Continuous integration and automatic deployment.
- Model reproducibility via hyperparameter and feature tracking.
- Centralization in a model registry accessible throughout the organization, avoiding duplication and increasing transparency.
- Deployment via REST APIs, enabling instant predictions.
- Integration with various ML frameworks like:
- TensorFlow, PyTorch, Spark ML, Scikit-learn
- AutoML (H2O, FastAI, XGBoost, LightGBM)
- Use of MLflow to:
- Case Study: The New Science of Hitting
- Objective: predict the probability of a hit based on millions of pitches.
- Data used: launch angle, exit velocity, spray angle, defensive positions, player handedness.
- Model: XGBoost with 84% accuracy.
- Result: visualization of the "sweet spot" (20°–35°, >60 mph) to maximize chances of a hit or home run.
- Strategic consequence: ban on defensive shifts to benefit left-handed batters.
The team conducted a study based on 2 million pitches since 2019, analyzed with Databricks and an XGBoost classification model to determine the probability that a batted ball results in a hit.
Variables analyzed:
- Launch angle
- Exit velocity
- Defensive positioning (shift)
- Batter and pitcher handedness
Results:
- Model accuracy: 84%
- Identification of the “sweet spot,” the optimal zone combining:
- Launch angle between 20 and 35 degrees
- Exit velocity between 60 and 100 mph (faster = doubles/triples, moderate = singles)
This analysis confirmed the effectiveness of modern strategies to maximize offensive success.
- Conclusion:
- The Moneyball strategy is evolving with Big Data into an ultra-optimized approach.
- Models are centralized, traceable, and automatically deployed.
- Big Data + AI = continuous competitive advantage for modern teams.
- Big Data improves both offensive and defensive strategies:
- Optimize hitters' launch angles and speeds.
- Adjust pitching to avoid the “sweet spot.”
- Analyze and anticipate umpire and player trends.
- This data revolution is just beginning and already heavily influences:
- Training design (biomechanics labs, injury prevention).
- Tactical in-game approaches.
- Talent development starting from minor leagues.
The Texas Rangers demonstrate how Big Data technologies (Statcast, Hawk-Eye), the unified Databricks platform (Delta Lake, MLFlow, Autoloader), and advanced machine learning (XGBoost) are radically transforming professional baseball, significantly enhancing efficiency and strategic decision-making.
These technological and analytical advances represent a new era of competitiveness for professional sports teams.
Transcript in English
Alexander Booth:
Hello everyone, welcome to our presentation.
My name is Alexander Booth and I'm a senior analyst with the Texas Rangers Baseball Club.
Ryan Stoll:
And I'm Ryan Stoll, a data engineer with the Texas Rangers.
We're very excited to be talking to you today about the Big Data revolution and the age of Moneyball.
Agenda
We’re going to start off by discussing our agenda and a little bit of background on who we are.
Then we’ll go into our presentation.
We will be discussing the age of Moneyball and how it has revolutionized the game of baseball.
We’ll talk about a technology called Statcast, which allows us to track everything from ball movements to players — even the hip trajectory of a pitcher, everything about the swing, and weather that happens inside of a Major League game.
Then we’ll talk about how the Rangers utilize Databricks to analyze this vast amount of big data coming into our pipelines.
Finally, we'll end with a case study: The New Science of Hitting, which will explain how we can use machine learning and big data to predict whether a ball will fall for a hit — and how to optimize our players to hit more strategically.
Introductions
Alexander Booth:
As mentioned before, my name is Alexander Booth. I've been with the Rangers since 2018, so this is my fifth season with the club. I'm a senior analyst within their Research and Development department.
Before the Rangers, I worked as a machine learning engineer and front-end developer for a company in Chicago called McMaster-Carr.
Without any further ado, I’ll hand it over to my colleague Ryan.
Ryan Stoll:
Thanks Alexander.
I'm Ryan Stoll, a data engineer with the Texas Rangers.
I've been with the Rangers for about a year and a half. Before that, I was a business intelligence analyst at Canon USA in Long Island, New York, and also an IT consultant with Ernst & Young in New York City.
Moneyball
In case you're not familiar, Moneyball is a book written by Michael Lewis that was also made into a movie in 2011.
It starred Brad Pitt and Jonah Hill as Oakland A's executives who used data to make decisions and keep the A's competitive in an unbalanced Major League Baseball landscape.
Although there had been years of research and thousands of words written about using statistics to make smarter baseball decisions — by people like Bill James — this is often pointed to as the first example of a major league organization really buying into this approach.
One of the main ideas described in the book and movie is placing more emphasis on on-base percentage than batting average.
In case you're unaware, these are two key statistics in baseball.
Up until this time, batting average had been one of the leading metrics that teams used to evaluate player performance.
Even today, players are often described by their batting averages — as in, “this player is a .300 hitter,” which means he’s above average.
However, a key discovery described in Moneyball was that on-base percentage (OBP) has a higher correlation with total runs scored than batting average.
After all, scoring runs is how teams win games. So this is a very important thing to measure.
The main difference? Batting average excludes walks — when a batter takes four pitches out of the strike zone and gets to first base.
That walk is beneficial, but it’s not counted in batting average, only in OBP.
It’s hard to say why walks were undervalued for so long. Possibly because they were seen as the pitcher’s failure instead of the batter’s skill.
But now we know: plate discipline and drawing walks are repeatable, measurable skills that help a team win.
As they say: Billy Beane identified a market inefficiency.
This allowed the Oakland A’s to acquire players undervalued by the market, by using OBP instead of traditional stats.
They could win more games — at a lower price point.
This approach has left a legacy far beyond baseball.
It has spread to football, basketball, tennis, golf, and more.
Statcast Revolution
for those unaware stockcast is the name given to the current state of baseball tracking technology operated by major league baseball this allows for the collection and analysis of massive amounts of baseball data as you’ll see staccast has been powered by different technologies over the years but the name refers to the latest iteration of advanced baseball tracking
five years after the a’s began using their money wall approach mlb introduced pitch fx in 2006 which was a three camera tracking system that was created and maintained by a company called sport vision pitch fx could automatically track speed and trajectories of pitchfalls which allowed for a consistent visual representation of pitches as well as categorization of pitches this was a huge leap forward for baseball tracking technology and it opened the door for future more advanced systems to be developed
in 2015 stack cast which is a combination of camera and radar systems was installed in all 30 major league ballparks for the first time it provided radar and hd video measures for all action on the field on a per pitch basis this was further enhanced for technology developed by trackman a company previously focused on golf that uses doppler radar to pick up ball flight metrics for the pitcher and hitter some of the metrics this system produces has entered the baseball fan’s lexicon like spin rate horizontal and vertical movement and hit exit speed and launch angle finally mlb made the switch in 2020 to using hawkeye to power stackcast which could do everything the previous system could and more you may have heard of hawkeye as the camera system that powers instant replays in tennis however in baseball hawkeye consists of 12 high-speed cameras installed around the ballpark which are dedicated to either pitch tracking or tracking players and batted balls this system raised the percentage of batted balls that get tracked from 89 to 99 in part because of this multi-camera approach beyond pitching and hitting it can also track running and fielding in the form of sprint speed base to base times arm strength catch probability and much more
more recently mlb has made scheduled skeletal data available to its member clubs this comes in the form of x y and z coordinates for 18 points on a player’s body like the shoulder elbow wrist knee and this is on a 30 frame per second basis as you can imagine this easily produces millions of data points for each game that our analysts need to be able to make sense of and traditional programs running on local machines are no longer sufficient for this problem
one such use of this data the skeletal data is for fan engagement and entertainment mlb field vision takes the limb tracking data and transforms it into a 3d experience that enables fans to watch plays unfold from never before seen camera angles major league teams on the other hand might use the skeletal data to develop their own metrics like fielding and base running metrics that get to a much deeper and nuanced level of what each player is doing on a particular play
another field of exploration enabled by hawkeye is observed spin and seam tracking before hawkeye capturing actual pitch spin and direction was impossible analysts could only go off of a calculated or inferred spin which was determined with models using pitch trajectory and speed among other factors and the idea that an object’s rotation has an effect on its path however some pitches weren’t following the path that one would expect using the available factors alone this led to the concept of seam-shifted wake which is the idea that the seam orientation of a pitch has a measurable and significant impact on its flight path this effect is caused by the asymmetry and the rough burst smooth side of the ball basically where the seams are hawkeye allowed this to be directly measured for the first time as well as spin rate and direction and has really opened the door for advanced pitching analysis and design this created new positions in baseball in even whole companies within the sport of baseball one final example i’ll tell you about a big data that mlb makes available to all 30 clubs is lidar scans and weather tracking the lighter scans provide us with high resolution 3d representations of the 30 major league ballparks this is done by taking measuring the time that it takes for reflected light emitted from an airborne object to return to its source it’s not just to measure outfield wall distance either but the shape of the ballpark in general which can have an effect on flight path as you can imagine this detailed mapping coupled with weather data refreshed every five minutes and the right tools allows us to answer the question how do the specific ballpark and weather characteristics affect the game
so that’s data that mlb makes available to all 30 clubs but this is a slide we put together to list some of the technologies that clubs can use to gain additional insights on their players some of these are in the realm of high speed motion capture both markered and markerless bat sensors that can tell you things like swing speed and swing path and things like force plates which measure where a battery hitter are placing their weight all of these things continue to contribute to the big data landscape that all 30 teams are undoubtedly having to grapple with
to give you an idea of what this technology actually looks like this is an example of a state-of-the-art pitching lab at wake forest university and this implies those high speed cameras motion capture technology force plates that we talked about this allows for the analysis of picture mechanics and the development of custom training programs aimed at reducing injury risk and enhancing player performance so this explosion and baseball data is not only taking place at the major league level but also the minor leagues and even amateur teams like colleges and universities
The Problem
so now we’ll shift gears to discussing how the texas rangers handle big data and utilize data bricks to execute and centralize our analyses like any other enterprise baseball front offices have internal departments dedicated to different areas that keep the organization moving forward these departments have typically had their own data their own reports their own way of doing things that would be more transparent in an ideal world after all everyone wants to consume data from the players to the coaches to the highest levels of the front office you see our manager chris woodward there wearing an analytics t-shirt this problem of consolidating our information is further made difficult by all the technologies we utilize that may or may not have integrations with one another it’s also hard if not impossible to predict the future in terms of where we’ll be five years from now and what the best choice of software will be in the long run this is where databricks and the unified analytics platform comes in
for my job in particular as the lead data engineer i’m responsible for setting up the pipelines that ingest data from different types of sources these data pipelines have to be as agile and resistant to failure as possible like a lot of you we get data from apis ftps databases both external and internal cloud buckets and these data can come in the form of csv json part k video all of which we have to handle in the most efficient manner possible before databricks we had different ingestion scripts written in different languages running on different on-prem and cloud-based servers all saving to different databases
but with databricks notebooks saving to delta lake we’re able to centralize our ingestion scripts that extract data from all the different sources it can flatten transform and clean our data and save to stage tables before ultimately landing in our enterprise data warehouse by using spark koalas and the new integration of koalas into pi spark we can perform distributed extraction requests this has become necessary for us to transform millions of pitches with as much compute as required and we can do this at the speed of spark the amount of data we receive is only increase with each passing year and will only continue to increase so we need to be able to leverage the databricks platform and all of its latest features to keep up i’ll now hand it over to my colleague alexander to talk about how we use databricks in the analytics space
thank you so much ryan for your amazing background on the statcast revolution the technology that we are currently importing as well as our data engineering challenges on the analytics side we really wanted to focus on a concept that is near and dear to my efficient and automated hearts and that is machine learning operations with machine learning operations we are able to track our machine learning models as they iterate and change from development to production further by having our machine learning operations occur in the same unified analytics platform as our data engineering we can connect our models with our data in exactly the same place that it’s being processed this allows us to score and generate predictions as soon as our data is extracted and transformed by doing this we are able to communicate our insights super quickly to our stakeholders including our players and coaches before it would take up to 24 hours after a game finished before our predictions and metrics could be relayed to players now we’re able to provide those predictions in a matter of hours
so i mentioned machine learning operations at the top of this discussion what exactly is machine learning operations well ml ops takes its name from a combination of devops as well as data engineering and machine learning devops is characterized by a couple of key principles shared ownership workflow automation as well as rapid feedback automation is a core principle in the devops pipeline and it’ll translate as well to what we do with machine learning operations we need to have continuous integration continuous deployments and automated promotion of models as well as being able to track when our models fail to be able to communicate and iterate to maintain our competitive advantage emma lops involves building deploying and maintaining these machine learning models reliably and continuously in an automated way further the machine learning operations will allow us to have code reviews and peer reviews of all of our models to make sure that everyone is able to understand what the purpose and the output of each model actually is
some benefits of machine learning operations are going to be the same types of benefits that you would see from any devops platform since everyone has access to all of the models stored in the registry that increases transparency we have easy peer and code reviews of our outputs we can recommend new features or new transformations that we found efficient in the past further these models can be retrained on a schedule every month every week every night and the new trained model can be promoted to production all automatically on scheduled jobs we can also monitor our models we can monitor for shift in our targets we can mod we can monitor for changes in our metrics this will allow us to go back and iterate on our models more effectively before the drift in our metrics occurs too far down the pipeline and is in front of our stakeholders all model changes are tracked and this is very important to us we use github a lot in terms of our coding expertise however tracking changes to models in github is not effect is not the most optimal way as you’ll see databricks provides a platform called mlflow that will allow us to do all of this and more
ML Operations
mo flow integrates with every machine learning environment that you can think of everything from tensorflow pytorch spark psychic learn even to auto ml platforms like h2o you can have more exotic models like xc boost light gbm uh conda fast ai it’s honestly amazing all the different integrations that are tracked inside of that model there are dozens of companies that utilize ml flow we’re obviously not the first to use ml operations but we are seeing the impact on our organization already my one last note here is that we’re able to track models built in different languages as we have analysts use python and r and other model languages themselves being able to have all of these models in one location allows for further benefits of sustainability
so machine learning operations is really comprised of two key features model tracking and the model registry model tracking is important this allows us to log features parameters different model algorithms and metrics for any single machine learning problem further all of these models can be reproduced so in a typical machine learning development cycle we are trying dozens of different algorithms across dozens of different feature stores with many different hyper parameters as well comparing all these models to figure out the most optimized and efficient combination of hyperparameters algorithms and features can be draining doing that in excel you lose track of all your different columns writing it down in a notebook who writes down anything anymore so being able to automate and track all of these experiments using ml flow provides huge benefit in terms of model comparison it also allows us to track our train of thought we can see how a particular model evolved over time the second key component to machine learning operations is the model registry the model registry is essentially a centralized cloud storage location for machine learning models built in both python and r as well as in integrating with auto machine learning frameworks all previously stored versions of a model are saved and can be promoted through development staging and production environments so once we’ve used model tracking and experiments to select a final model we can put it in the model registry and then we can qa it in our staging environment before deploying to production further as we iterate on our model and we create a second version that we want to promote we can deprecate the original and promote the second model automatically using and we can still see how they have changed this is essential for us because we want to compare how our predictions have changed as our models have changed so being able to reference these old deprecated versions allows us to do that
the other benefit to the registry is hosting our models in a rest api endpoint anyone in our network can post a data set to this endpoint and receive predictions back as i won’t get too far ahead of myself here this will really help us especially when we start integrating streaming services
in summary by using ml flow within the databricks unified analytics platform the texas rangers r d department have created a centralized machine learning repository to host all of our models by centralizing our models in this repository our team has identified duplicated models that we have been able to eliminate as well as provide a single constant source of truth one model for whatever we’re trying to measure we have one model for pitch evaluation one model for strike probability or hit effectiveness and these models can be used by anyone across all of our departments we’re no longer siloed by someone in one department building a model that is also being built by a second person in a second silo we have one centralized location with transparency and automated access these models can be integrated into our unified data pipeline and this is also as i mentioned the beginning really important to us as ryan discussed with our data engineering pipelines being able to square predictions using models in the same location gives us again one location and one place to put everything together and this unification really does help tear down our silos as well as help make our communication faster and our time to insights more efficient
Streaming Data
so what’s kind of one use case that we can do with this amazing data engineering pipeline and machine learning operation workflow well streaming streaming is another impact of big data we are receiving data super quickly with a high velocity during games this screenshot is actually coming from the mlb app on your phone if you go follow a game on espn mlb you will see these numbers pop up you’ll see the player statistics you’ll see the trajectory of the ball you’ll also see the outcome of the play however this is not the only data that we receive during games when you go to a game and look at the scoreboard you’re going to see something that looks like this again we still have our career stats for our players but in that bottom toolbar at the bottom of the screen you’ll see exit velocity exit angle and distance this is tracking the speed of which of the ball off the bat it’s tracking the angle that the ball went off the bat and is checking how far the ball went into the stadium this is ball tracking this is essentially the statcast data that ryan discussed coming to us in real time and appearing on our scoreboard we also see that the pitch that was thrown was 81 miles an hour again more ball tracking that’s coming our way all of these numbers exit velocity movement sprint speed we are receiving this information as it happens during a game so how can we use this information from bullpens batting practices and even high school games to understand and make decisions as quickly as possible we can use a technology called autoloader which again is part of the databricks unified analytics platform autoloader is an optimized cloud file source for spark that loads data continuously and efficiently from cloud storage as new data continues to arrive essentially as long as the data is being loaded into cloud storage we can run an autoloader listener job to take that data and bring it into databricks and delta lake further because our models are also hosted on databricks we can score that stream data into a silver table before finally pushing it into some gold table that can be sent down and reports built off of so before by having multiple jobs to load in multiple different data sources in what once for example from our apis from our ftps we can just have one location our cloud storage bucket and one listener job with autoloader to read all of our data in using spark this set and forget model really eliminates the complicated setup of using multiple ingestion scripts and multiple ingestion listeners as mentioned we’re able to stream in data using apis in the form of json but other streaming data comes from ftps in the form of csvs with autoloader we can put together a script to load them into cloud storage where they can then be scored using our machine learning models and pulled automatically into our data lake as soon as the data is received we can predict on it generate our metrics and send that information as quickly as possible to the players and coaches who are using it to make decisions in game so what kind of decisions can be made off of data like this well here you will see a strike zone but not just any strike zone this is a strike zone for an umpire if you’ve ever watched a game of baseball you will have groaned and cheered as the umpire makes good or bad calls for your team as well as the opposing team we’ve all seen that pitch really inside almost hugging our batters ribs be called a strike or that pitch right down the middle or that hit the edge that the empire called a ball umpires have tendencies some umpires are more likely than others to call different pitches strikes or balls against different right-handed or left-handed batters using data streamed during a game we can approximate in umpires tendencies while the game is occurring this can really help us especially in the lighter last half of a game understand exactly where we need to be throwing or locating our pitches to get a most likely strike given an umpire’s opinions or tendencies for that specific night so as an example here imagine one umpire who is calling pitches in the lower left hand corner of the zone balls we can identify that within the first couple of innings and shift our strategy to make sure that our pitcher is throwing more inside the actual zone instead of that outside corner in converse if we know that the umpire is calling an inside strike more often than not then we can also let our pitchers know that within the first couple of innings and that way we can again shift our strategy we can try and use this tendency to our advantage to try and gain a competitive advantage during a game this is only one example of how predicted model outputs using ml flow and auto loader can change the game we can also use stream data to affect and approximate fatigue in our pictures as well as look at exactly how our batter’s swing is producing and going back to our batter swing that’s really going to go into our case study that we have here next
New Science of Hitting
case study the new science of hitting so far we’ve really talked about baseball technology we’ve discussed how the texas rangers are ingesting that baseball technology using data bricks and both the data engineering and machine learning driven way but the point of all of this data is to make our players better so how can we use this sheer vast amounts of data that we’ve already discussed and come up with a strategy to make our players the best version of themselves let’s quickly talk about home runs in 2017 home run rates started to skyrocket across the league and while i’m not going to be talking about changes to the ball or juiced or dead in balls in this conversation i’ll let you do your own research there we are going to talk about barrels hitters were quoted as trying to optimize specific launch angle and exit velocity combinations to achieve a barrel so what exactly is a barrel and how can we tell this story using data
so our goal here for this project we wanted to load in millions of pitches that have been thrown at the major league level since 2019 in fact we managed to load in about two million pitches again at the speed of spark and our goal here is to look at these pitches look at only the hits that were made and see if some kind of sweet spot or combination of launch angle and exit velocity can be taken into account to predict the likelihood of a hit so we are from these 2 million pitches we were able to filter down to 300 000 hits or balls in play and we can use this data to predict a hit probability
so the features that we used to predict the probability of a hit where the launch angle the launch angle is the up down angle that a ball leaves the bat we also looked at exit speed which is the velocity of a ball off of the bat hit spray angle is the left to right angle that the ball left the bat so was it closer to third base or first base we also had a couple of categorical variables in field positioning and outfield positioning one of the new revolutions in baseball has been the idea of a defensive shift by putting more players on one side of the infield we can increase our likelihood of getting it out especially if the hitter tends or has that tendency to hit the ball more often than not in that direction so both the infield and the outfield shift to position their players optimally how does that affect the probability of a hit we will be investigating that here as well finally we bring in the batter handedness and pitcher handedness as these features can affect the probability of a hit different lefties and varieties can implement different batted balls you may have heard of that of pulling so a left-handed hitter if they pull the ball it’ll go more towards right field i’m trying to imagine my ballpark now left-handed hitters hit the ball this way right-handed hitters hit the ball this way more often than not
so what model did we create we created an xg boost model so we split our data into a 75 25 trained test split and we created a xg boost classification model on this data and it actually performs pretty well we have an 84 percent accuracy and our f1 scores on both hits and hits are fairly high as well our rock curve looks sustainable and we’re pretty happy with this model however the true revolution will happen when we examine our feature importances looking at our feature importances we see a couple of things that should probably be intuition by now launch angle and launch speed were the most impactful features in our model this makes sense because as we mentioned before hitters are trying to find that optimal combination of launch angle and launch speed however something that stood out to us was the fact that the in-field shift as well as left-handed batters are also pretty important in determining whether or not a bow ball and play is hit for a hit so in case you don’t know left-handed hitters are more likely to have a shift on against them because left-handed hitters are more likely to hit the ball between first base and second base right-handed hitters are more likely to hit the ball across the entire ballpark but lefties are always going to be pulling the ball more often than not so many many balls in play by left-handed hitters have died because of the shift this is actually causing major league baseball to explore rule changes to try and increase offense back into the game they will ban the shift which could happen as soon as next year this will benefit left-handed hitters greatly as they’ll no longer have to worry about their ground balls being snagged by a shifted shortstop
Hit Probability Graph
by looking at the launch angle and exit velocity we can plot our hit probability against those top two features and the resulting graph is actually something beautiful this is one of the most famous graphs in a modern offensive baseball strategy and i’m sharing it with you today as you’ll notice we have a bunch of red dots and blue dots blue dots are balls that are have a very low hip probability they are likely going to be out red dots are more likely to be hits and we see there’s about two different patterns here we have a huge red blob to the far right of our graph and we also have a red swoosh in the middle let’s start with our red blob at the end as you may have guessed all of those are home runs being able to hit the ball over 100 miles an hour off the bat at an angle between 20 and 35 degrees you will almost always hit a home run of course that’s dependent on the ballpark that you’re hitting in but ballpark agnostic those balls are almost always leaving the field so while that accounts for our far right blob what about our swoosh that we have in the middle of this graph as you may notice any ball hit between 60 miles an hour and 100 miles an hour at a specific launch angle of again 20 to about 35 degrees seem to always land for a hit this is because at the weaker launch speeds the weaker exit velocity these balls will land over the heads of the infielders but in front of the outfielders and as our exit velocity goes up these hard-hitting balls suddenly go over the heads of our outfielders and bang off the wall and our guy hits a double
⸻
so here we can look at the optimal combinations to hit singles that’s going to be between 60 and 80 miles an hour at 20 to 35 degrees or doubles and triples that’ll be between 80 and 100 miles an hour off the bat again at around 20 to 25 degrees launch this has helped revolutionize a term called the sweet spot this is what the launch angle revolution is batters have realized that if they hit the ball between 20 and 35 degrees and they hit it hard enough as in over 60 miles an hour then that ball is going to drop for a hit it may be a single and maybe a double but either way they are getting on base and this takes us back all the way to the beginning of our presentation to the great billy bean you get on base we win you don’t we lose and i hate losing modern baseball strategy has evolved from the on base percentage market and efficiency of the moneyball era in this new age of big data we are still trying to optimize the money ball strategy getting on base however we can do so now with a more involved and intensive and specific strategy hit the ball at this angle at this speed and it will likely drop for a hit if it drops for a hit then you are on base and that is going to help your team win not only has this impacted our hitters but it has also impacted our pitchers as well how can we pitch the ball to our player to a hitter so that they will not hit the ball at this specific angle at this specific speed how can we throw the ball in such a way that the hitter will not achieve that barrel or hit that sweet spot this is only the beginning of the revolution in baseball we as ryan discussed earlier there is so much more going on in the era of pitch design as well as sprint speed defensive alignment as well as optimizing our player development to be able to consume this data effectively we want our players coming up from the minor leagues to be able to hit this sweet spot throw the best pitches possible and know exactly how to run to get a rogue fly ball the big data revolution in baseball has only just started thank you so much for your time and attention i’ll hand it back over to my colleague ryan stoll for his final words
thanks alexander great job i hope that you learned a little bit more about how baseball teams are using advanced metrics to their advantage and how we’re using data breaks to make sure that we can stay at the top of the competitive landscape thank you again and enjoy the rest of the conference enjoy the rest the conference everyone thank you so much for joining
Exercice 3-1
Here is an exercise based on the film.
- You are going to watch the film Moneyball. While watching, take notes on the following :
- Everything related to data, information, and statistics
- Try to understand the strategies and how data and information contribute to them.
- Do data and information help with decision-making?
- The tools they use
- key words
- Prepare a Powerpoint presentation that explains how data analysis has changed the way people make decisions. Your presentation must include the following elements: :
- What the 3 Youtube videos and present the main ideas and concepts discussed in each video.
- Show how the film Moneyball illustrates the change from the traditional method to the new, data-driven method used to build a baseball team’s strategy.
- Explain how people reacted to this new method within the story (supporters, skeptics, conflicts, etc.)
- Share your personal opinion about the integration of data into strategic decision-making, based on what you learned from the film.
- How has the strategy shown in Moneyball changed the world of sports?
- What are the limitations of a 100% data-driven approach?
- Can the same logic be applied in other fields (HR recruitment, marketing, etc.)?
- Try to find information or interviews about how base players feel about analytics and how it affects their style of play.
And IA ?
What Moneyball Made Possible for AI Afterwards:
Moneyball is not about AI, but it perfectly illustrates the transition from a sport based on intuition to a sport based on data. It is a key step in the history leading to artificial intelligence applied to sports.
- It paved the way for the massive use of sports data, a necessary condition for the development of AI.
- It showed that objective models could outperform human intuition in complex decision-making.
- Today, AI tools are used to do much more:
- Image recognition to track players,
- Real-time strategy optimization,
- Generation of predictive player profiles via deep learning.
Agence digitale Parisweb.art
Tout savoir sur Julie, notre directrice de projets digitaux :
https://www.linkedin.com/in/juliechaumard/