Individualized Athlete Monitoring in R, Part 2 (Building a Predictive Model)

Last updated on Mar 7, 2020 27 min read R

Whoops
A Refresher
Anomaly Detection
- RM ANOVA vs. Mixed Modeling
Building a Model
References

Whoops

Alright, so maybe I’m a liar and part 2 took…a while…to come to fruition. Sorry about that. In my defense, I’ve traveled to four different countries for work, worked on sprint step analysis for @DrBradDeWeese (hopefully we can share some of that soon), and bought a new laptop without any of my blogging packages installed so at least cut me a little slack. Now, where were we…

A Refresher

I started off last time by mentioning “Putting the ‘I’ Back in Team” [6]. If you haven’t read it, I would recommend starting there before going any further. There were three topics of conversation in the paper: MBI (Imma let everyone finish arguing before wading into that), SPC (last post), and a version of anomaly detection via mixed modeling (this post). If you’re interested in further background information on what I’ll be discussing today, here are the papers that originally sent me down the rabbit hole: [1,2,5]. Now, you’ll probably look at those papers and say, “Matt, those have nothing to do with mixed models,” and you wouldn’t be wrong. But they gave me some ideas for model building and how to define an outlier/anomaly (and next post will probably look at using non-linear methods to predict sRPE) so give them a read if you feel so inclined.

Anomaly Detection

The concept of anomaly detection is pretty straightforward: Given a dataset, can we detect data that are significantly different from our predictions? The specific field will then dictate what we do with this anomalous data. For instance, credit card fraud detection is probably the most well-known example of anomaly detection–your spending habits (typical purchase price, location, type of purchase, etc.) are used to build an individualized consumer profile. Purchases that fall significantly outside that profile (say, making a purchase in China without notifying your bank…I didn’t do that, for the record) are flagged and automatically rejected. In the case of my credit card company, the transaction is declined and I receive both a text and email asking if that was actually me making the purchase. Once I confirm my identity hasn’t been stolen I can have the merchant run my card again without issue, and the credit card company has new data to build a more robust consumer profile that slowly integrates my new (well, old but in a new country) spending habits into the model.

In reality, if you’ve been using traditional statistics, MBI, or SPC to identify outliers for follow-up in your monitoring data, you’ve been performing your own version of anomaly detection. That is, you’re looking for data outside the norm or what you would predict–an anomaly if you will. While SPC is great for taking an individualized approach to detecting changes in athletes’ physical characteristics, fatigue state, etc., one of the main issues for individual athlete monitoring is that SPC has a lead time; you need to collect a number of observations on an athlete before you can build a profile of what’s “normal.” Unfortunatley, sport isn’t conducive to “sit and wait,” and non-compliance, absence, and injury make building an individual profile even harder. That’s where tools like mixed models step in. Instead of relying only on individual data (SPC) or pooled data (simple or multiple regression), mixed models attempt to strike a balance between individualization and the “widsom of the crowd.” In theory, it’s the best of both worlds. Athletes with a ton of historical data will have more individualized random effects to complement the model’s fixed effects, whereas we can lean more heavily on the model’s fixed effects for athletes with less data (more on fixed and random effects in just a second).

RM ANOVA vs. Mixed Modeling

In the case of longitudinal analyses, mixed models (or you might see them called linear mixed models) are very similar to the garden-variety repeated-measures ANOVA you’re likely familiar with: a model is built to predict a dependent variable from an independent variable / series of independent variables (between-subjects effects, now called fixed effects) while also accounting for individual variation (within-subjects effects, now part of the random effects). There are some distinct differences between the two appraoches, though. For one, mixed models don’t require complete or balanced datasets. That is, athletes with missing data won’t need to be thrown out or have missing values imputed. This fact alone makes mixed models incredibly powerful tools for model building in sports science. Another important difference between the two approaches is how random effects work. In RM ANOVA, we can specify we’re analyzing data for the same individuals over time, but that’s about it. This can be a problem in the data we collect, where differences can exist between phases of the season (pre-season vs. non-conference vs. conference in the case of college sports) or between seasons (different playstyles, maturation, improved fitness levels, new coach, etc.). RM ANOVA isn’t able to account for this clustering or nesting of variables (note: we can model the interaction between these variables via fixed effects, but that’s a bit different), whereas mixed model random effects are. And let me tell you, random effects can get pretty complicated with especially complex designs, but I’ll leave that for someone smarter to explain. For instance, you can check out [3] or [4]. The latter contains everything you could ever want to know about mixed models and more…much more, while chapters 13, 14, and 19 of [3] will equip you with what you need to know in an easier to digest format.

Building a Model

As always, it’s much easier to show than tell with this stuff, so how about an example. You can find the data for this post here.

## # A tibble: 2,930 x 13
##    athlete date       season session activity field.minutes odometer work.rate
##    <fct>   <date>     <fct>    <int> <fct>            <dbl>    <int>     <dbl>
##  1 Jonath~ 2014-08-13 Fall ~       1 Soccer 1          75.3     7201      95.6
##  2 Dhaki ~ 2014-08-13 Fall ~       1 Soccer 1          69.5     7300     105. 
##  3 Corey ~ 2014-08-13 Fall ~       1 Soccer 1          76.6     7586      99  
##  4 Issac ~ 2014-08-13 Fall ~       1 Soccer 1          80.7     3880      48.1
##  5 Phoeni~ 2014-08-13 Fall ~       1 Soccer 1          59.1     5774      97.7
##  6 Carrol~ 2014-08-13 Fall ~       1 Soccer 1          80.7     7807      96.8
##  7 Neulyn~ 2014-08-13 Fall ~       1 Soccer 1          69.6     7617     109. 
##  8 Jonath~ 2014-08-14 Fall ~       1 Soccer 1          82.3     3962      48.1
##  9 Bobby ~ 2014-08-14 Fall ~       1 Soccer 1          82.1     3792      46.2
## 10 Daniel~ 2014-08-14 Fall ~       1 Soccer 1          79.3     3671      46.3
## # ... with 2,920 more rows, and 5 more variables: plyr.ld <int>, hi.run <int>,
## #   duration <int>, rpetl <int>, phase <fct>

athlete	date	season	session	activity	field.minutes	odometer	work.rate	plyr.ld	hi.run	duration	rpetl	phase
Jonathan Sanchez	2014-08-13	Fall 2014	1	Soccer 1	75.30000	7201	95.6	690	1440	90	450	PRE
Dhaki al-Naim	2014-08-13	Fall 2014	1	Soccer 1	69.46667	7300	105.1	592	1129	90	450	PRE
Corey Klocker	2014-08-13	Fall 2014	1	Soccer 1	76.61667	7586	99.0	660	1222	90	450	PRE
Issac Martinez	2014-08-13	Fall 2014	1	Soccer 1	80.66667	3880	48.1	379	32	90	450	PRE
Phoenix Lee	2014-08-13	Fall 2014	1	Soccer 1	59.08333	5774	97.7	539	718	90	360	PRE
Carroll Bond	2014-08-13	Fall 2014	1	Soccer 1	80.66667	7807	96.8	714	1080	90	540	PRE
Neulyn Whyte	2014-08-13	Fall 2014	1	Soccer 1	69.65000	7617	109.3	718	1300	90	720	PRE
Jonathan Sanchez	2014-08-14	Fall 2014	1	Soccer 1	82.33333	3962	48.1	435	314	90	450	PRE
Bobby Landry	2014-08-14	Fall 2014	1	Soccer 1	82.06667	3792	46.2	439	370	90	270	PRE
Daniel Thompson	2014-08-14	Fall 2014	1	Soccer 1	79.28333	3671	46.3	381	132	90	270	PRE

The full file contains four seasons’ worth of practice sRPE and GPS data. I’ve already done some data cleaning to remove data entry errors (about 40 records total across the four seasons; most were games incorrectly coded as practice), so you should be cognizant of the fact your data won’t always be this “pretty.” Consistent data collection and entry protocols can help mitigate some of these problems, but it’s always good to double check your data prior to modeling or making any decisions based on your models. We’ll use the data to build and test a predictive model for sRPE-derived training load (sRPE * duration; rpetl in the file) before applying it similarly to [6].

If you’ve ever seen a GPS output file, you’ll know they can be pretty overwhelming. When I was with the soccer team, Catapult Sprint could spit out something like 250 variables (I think OpenField has even more). We exported 75; we actually looked at 7. Those other 70-ish variables were of the just-in-case variety–the variables you may want to look at one day, but realistically you know you’ll never give them a second glance. But hey, you have them. Anyway, of those seven variables, research has shown there’s a strong correlation between sRPE, total distance, and Player Load (PL) and a nearly perfect correlation between total distance and PL. We used both when it came to progamming our practice sessions because they do provide slightly different information, but from a model-building standpoint the nearly perfect correlation (aka, multicollinearity) between total distance and PL is problematic. It will bias the model coefficients and affect our ability to interpret the model and individual variable importance within the model. So step one in our process will be to examine the correlation coefficients between the variables in our dataset. We can examine the relationships between our variables very easily with cor() from the base stats package and the corrplot package.

# import the data
rpe.data <- read.csv("prediction-data.csv")

# convert the data to data.table format
rpe.data <- data.table(rpe.data)

# conver the date column to date format
rpe.data[, date := as.Date(date, format = "%m/%d/%Y")]

# Convert all predictor variables to numeric; this will save us a lot of heartache later
rpe.data[, c("field.minutes", "odometer", "work.rate", "plyr.ld", "hi.run", "duration") :=
           lapply(.SD, as.numeric), .SDcols = field.minutes:duration]

# visualize the correlation matrix from cor()
corrplot(corr = cor(rpe.data[, field.minutes:rpetl]), 
         method = "number", type = "upper", diag = FALSE)

Just as suspected, we have a nearly perfect correlation (0.94) between total distance and PL. A general rule of thumb I’ve seen when it comes to multicollinearity is a correlation coefficient > 0.8 or 0.9, so our 0.94 between total distance and PL definitely fits the bill. In such cases, you could choose either variable with essentially interchangeable results. For sake of consistency with other research, I’m going to use total distance, but you’re welcome to re-run the model with PL instead.

# Drop player load from the data frame
rpe.data$plyr.ld <- NULL

# re-visualize the data
corrplot(corr = cor(rpe.data[, odometer:rpetl]), 
         method = "number", type = "upper", diag = FALSE)

That’s better. Let’s get to model building.

Visualizing the Data

The first step in the model building process is always to examine your data visually. Visualizations help us understand what we’re working with and can aid us in fitting models that are appropriate for the dataset. Let’s start off by plotting histograms of the pooled rpetl values and rpetl for each season.

# plot an overall histogram of the rpetl data
ggplot(rpe.data, aes(x = rpetl)) + geom_histogram() + theme_bw()

# plot individual histograms for each season
ggplot(rpe.data, aes(x = rpetl)) + geom_histogram() + facet_wrap(~season) + 
  theme_bw()

In both views of the data, there’s some pretty severe positive skew going on. If you’re familiar with common statistical practices in sports science, you already know how we’re going to deal with this…but humor me for now, please.

Let’s go ahead and plot the relationships between each of the potential predictor variables and rpetl. Each colored line represents the regression line for its respective season. We’ll leverage the gridExtra package to combine the individual plots into a single graphic.

rpe.minutes <- ggplot(rpe.data, aes(x = field.minutes, y = rpetl)) + geom_point() + 
  geom_smooth(aes(colour = season), method = "lm", se = FALSE) + theme_bw()

rpe.odometer <- ggplot(rpe.data, aes(x = odometer, y = rpetl)) + geom_point() + 
  geom_smooth(aes(colour = season), method = "lm", se = FALSE) + theme_bw()

rpe.rate <- ggplot(rpe.data, aes(x = work.rate, y = rpetl)) + geom_point() + 
  geom_smooth(aes(colour = season), method = "lm", se = FALSE) + theme_bw()

rpe.hirun <- ggplot(rpe.data, aes(x = hi.run, y = rpetl)) + geom_point() + 
  geom_smooth(aes(colour = season), method = "lm", se = FALSE) + theme_bw()

rpe.duration <- ggplot(rpe.data, aes(x = duration, y = rpetl)) + geom_point() + 
  geom_smooth(aes(colour = season), method = "lm", se = FALSE) + theme_bw()

gridExtra::grid.arrange(rpe.minutes, rpe.odometer, rpe.rate, rpe.hirun, rpe.duration)

This time, pay attention to the shape of the relationships (using colour on the plots was a jumbled mess, so use your imagination or experiment with facetting by season for each variable). In each case, as the values of the variables increase, the spread of the resultant rpetl values also increases. Again, you probably know how we’re going to deal with this, but let’s start off with some basic models first.

Building the Model

First, let’s get acquainted with the syntax we’ll be using. I’m using the lme4 package today, but nlme is also extremely popular (and can do a few things lme4 can’t). If you’ve used lm() or aov(), the syntax is virtually identical. You’ll write an equation with the left and right-hand sides separated by “~”. The dependent variable is on the LHS, whereas the predictors are on the RHS. The major difference from lm() will be the inclusion of the random effect(s). They should be pretty easy to spot in the following formula.

# Build the random intercept-only model
random.intercept.model <- lmer(rpetl ~ (1|season/athlete), 
                               data = rpe.data, REML = FALSE)

summary(random.intercept.model)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: rpetl ~ (1 | season/athlete)
##    Data: rpe.data
## 
##      AIC      BIC   logLik deviance df.resid 
##  36342.4  36366.3 -18167.2  36334.4     2926 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1801 -0.6797 -0.2002  0.4533  4.5445 
## 
## Random effects:
##  Groups         Name        Variance Std.Dev.
##  athlete:season (Intercept)   586.5   24.22  
##  season         (Intercept)  1093.5   33.07  
##  Residual                   13801.3  117.48  
## Number of obs: 2930, groups:  athlete:season, 103; season, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   201.29      16.89   11.92

This is called the random intercept-only model because, well, that’s the only thing in the model. In lmer random effects are defined with (random slope|random intercept). This is where things get really complicated, and to be honest, I don’t completely understand all the intricacies. Regardless, the general gist is that we can model differences in both the level and slope of the relationship between the predictor variables and rpetl.

In the above model, we’ve defined athlete nested within season as a random intercept effect. That means the model assumes the slope of the relationship between the predictors and sRPE is the same across athletes, but the intercept of the model for each athlete is in a different location each season. This makes sense when you think about it. We know sRPE is a “global” indicator of training intensity; that is, it’s affected by both the physiological and psychological state of the athlete (fitness level, fatigue, outside stressors, etc.). So we would expect the resultant rpetl to differ in magnitude between athletes and to differ within athletes across seasons. We can see this in the example plot below. Notice a majority of the athletes’ regression lines are in different locations.

We could also have a random slope-only model. In that case, the slope of the relationship between the predictors and the DV is allowed to vary, whereas the intercepts are not. In reality, this type of model would make little sense most of the time as differences in slopes would typically lead to differences in intercepts as well.

And finally, we can model both random slopes and random intercepts. In this case, both the level and slope of the relationship are allowed to differ. There is some evidence to support this approach as [1] showed variable importance differs across athletes, but modeling something like that is well beyond an introduction to the method. We’ll stick to a random intercept-only model for today.

Adding Fixed Effects

Model building is a pretty controversial subject. Different approaches exist, including entry of all variables into the model at once, sequential entry of variables into the model, and stepwise entry and deletion of the variables to find the model of best fit. The latter is especially controversial and can be seen as fishing for the best model. It is possible for this approach to over-fit the data and to make little sense from a practical standpoint (statistical significance is prioritized over practical significance), but it tends to be the go-to approach in building predictive models. We can overcome this approach’s shortcomings, though, by 1) using variables that have empirical support (e.g. total distance, heart rate load, pre-training wellness, etc.), 2) performing stepwise analysis on blocks of fixed effects (aka, feature sets; see [2] for an example), and 3) using cross-validation (siccing our models on previously unseen data to compare their abilities to generalize to new data). We’ll get into cross-validation some other time.

We already have our block of fixed effects to test (season, training phase, time spent active in drills [field.minutes], total distance, work rate, hi-speed running distance [> 14.4 kph], and total practice duration), so we can begin adding them to the model and testing their overall effect. Updating models can be done a couple different ways in R. For one, you can manually write the code each time (see the above lme4 code as an example), or you can leverage the update() function. I’ll be using update(), so here’s the skinny on how it works: update() needs two pieces of information, 1) the model you want to update and 2) how you want to update it. We defined the random intercept model earlier, so we would tell the function we want to update random.intercept.model by calling update(random.intercept.model), then spelling out how we want to update it. The syntax is the same as what we used above with the LHS being the dependent variable and the RHS being the predictor variables. So for instance, if we wanted to add total distance to the model, we would call

# Using update() to update statistical models
total.distance.model <- update(random.intercept.model, .~. + odometer)

## Warning: Some predictor variables are on very different scales: consider
## rescaling

We’ll come back to that error in a second. First, in the above code we tell R we want to update the random.intercept.model by adding odometer to the fixed effects. We also tell the function to retain the already-established variables with .~.. Alternatively, if you wanted to start fresh on the predictor side of things, you could call update(random.intercept.model, .~ odometer + (1|season/athlete)).

Now that we’ve created our shiny new total.distance.model, we want to determine if adding total distance as a fixed effect statistically improved the model’s fit. This can be done in R with the anova() function.

# Compare models with the anova() command
anova(random.intercept.model, total.distance.model)

## Data: rpe.data
## Models:
## random.intercept.model: rpetl ~ (1 | season/athlete)
## total.distance.model: rpetl ~ (1 | season/athlete) + odometer
##                        Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)
## random.intercept.model  4 36342 36366 -18167    36334                         
## total.distance.model    5 34539 34569 -17265    34529 1805.3      1  < 2.2e-16
##                           
## random.intercept.model    
## total.distance.model   ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are four main things you can examine in this output: 1) the AIC, 2) the BIC, 3) the log-likelihood, and 4) the p-value for the chi-squared test comparing the log-likelihood of the two models. AIC, BIC, and logLik help us understand how well the models fit the data; values closer to zero are better. The change in logLik allows us determine if the change in the model fit is statistically significant (see [3] if you’re interested in the math), although there are two important caveats. 1) the models must be fit with maximum likelihood, not restricted maximum likelihood (REML = FALSE), and 2) subsequent models should be nested. That is, later models should contain all the variables from previous models (this will be clearer in a second). We can see from the above output that inclusion of total distance as a fixed effect statistically improved the model’s fit of the data with a p-value of 2 x 10^(-16). Let’s go ahead and add work rate, hi-speed running distance, time spent in drills, and total practice duration to the model and compare each step along the way.

model.1 <- update(total.distance.model, .~. + work.rate)

## Warning: Some predictor variables are on very different scales: consider
## rescaling

model.2 <- update(model.1, .~. + hi.run)

## Warning: Some predictor variables are on very different scales: consider
## rescaling

model.3 <- update(model.2, .~. + field.minutes)

## Warning: Some predictor variables are on very different scales: consider
## rescaling

model.4 <- update(model.3, .~. + duration)

## Warning: Some predictor variables are on very different scales: consider
## rescaling

anova(total.distance.model, model.1, model.2, model.3, model.4)

## Data: rpe.data
## Models:
## total.distance.model: rpetl ~ (1 | season/athlete) + odometer
## model.1: rpetl ~ (1 | season/athlete) + odometer + work.rate
## model.2: rpetl ~ (1 | season/athlete) + odometer + work.rate + hi.run
## model.3: rpetl ~ (1 | season/athlete) + odometer + work.rate + hi.run + 
## model.3:     field.minutes
## model.4: rpetl ~ (1 | season/athlete) + odometer + work.rate + hi.run + 
## model.4:     field.minutes + duration
##                      Df   AIC   BIC logLik deviance   Chisq Chi Df Pr(>Chisq)
## total.distance.model  5 34539 34569 -17265    34529                          
## model.1               6 34456 34492 -17222    34444  84.818      1  < 2.2e-16
## model.2               7 34397 34439 -17192    34383  61.045      1  5.578e-15
## model.3               8 34365 34413 -17175    34349  33.916      1  5.756e-09
## model.4               9 33701 33755 -16842    33683 665.894      1  < 2.2e-16
##                         
## total.distance.model    
## model.1              ***
## model.2              ***
## model.3              ***
## model.4              ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Each subsequent model is a statistically significant improvement over the previous. Let’s further add season and phase within season as fixed effects.

model.5 <- update(model.4, .~. + season)

## Warning: Some predictor variables are on very different scales: consider
## rescaling

## boundary (singular) fit: see ?isSingular

model.6 <- update(model.5, .~. + phase)

## Warning: Some predictor variables are on very different scales: consider
## rescaling

## boundary (singular) fit: see ?isSingular

anova(model.4, model.5, model.6)

## Data: rpe.data
## Models:
## model.4: rpetl ~ (1 | season/athlete) + odometer + work.rate + hi.run + 
## model.4:     field.minutes + duration
## model.5: rpetl ~ (1 | season/athlete) + odometer + work.rate + hi.run + 
## model.5:     field.minutes + duration + season
## model.6: rpetl ~ (1 | season/athlete) + odometer + work.rate + hi.run + 
## model.6:     field.minutes + duration + season + phase
##         Df   AIC   BIC logLik deviance    Chisq Chi Df Pr(>Chisq)    
## model.4  9 33701 33755 -16842    33683                               
## model.5 12 33699 33771 -16838    33675   8.1387      3    0.04323 *  
## model.6 15 33358 33448 -16664    33328 346.9856      3    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, inclusion of season and phase statistically improved the model’s fit. We could continue to make our model more complicated by modeling interactions and different random effects structures, but I don’t want to get too far into the weeds today.

Centering the Data

Instead, I want to draw your attention to the error we kept receiving each time a new model was created: “Warning: Some predictor variables are on very different scales: Consider rescaling.” Let’s print a summary of model.6.

summary(model.6)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: rpetl ~ (1 | season/athlete) + odometer + work.rate + hi.run +  
##     field.minutes + duration + season + phase
##    Data: rpe.data
## 
##      AIC      BIC   logLik deviance df.resid 
##  33358.3  33448.1 -16664.2  33328.3     2915 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9356 -0.6017 -0.0475  0.4800  5.7434 
## 
## Random effects:
##  Groups         Name        Variance Std.Dev.
##  athlete:season (Intercept)  956     30.92   
##  season         (Intercept)    0      0.00   
##  Residual                   4796     69.25   
## Number of obs: 2930, groups:  athlete:season, 103; season, 4
## 
## Fixed effects:
##                   Estimate Std. Error t value
## (Intercept)     -1.140e+02  1.622e+01  -7.028
## odometer         2.841e-02  4.529e-03   6.274
## work.rate        2.776e-01  2.037e-01   1.362
## hi.run           4.588e-02  6.904e-03   6.645
## field.minutes   -8.601e-01  2.943e-01  -2.922
## duration         3.348e+00  1.375e-01  24.341
## seasonFall 2015 -1.998e+01  9.113e+00  -2.193
## seasonFall 2016 -1.090e+01  9.876e+00  -1.104
## seasonFall 2017 -3.363e+01  1.051e+01  -3.200
## phaseNC          1.211e+01  3.397e+00   3.564
## phasePOST        3.776e+00  4.262e+00   0.886
## phasePRE         8.011e+01  4.549e+00  17.609
## 
## Correlation of Fixed Effects:
##             (Intr) odomtr wrk.rt hi.run fld.mn duratn sF2015 sF2016 sF2017
## odometer     0.704                                                        
## work.rate   -0.815 -0.892                                                 
## hi.run       0.010 -0.417  0.127                                          
## field.mints -0.683 -0.823  0.791  0.197                                   
## duration    -0.305 -0.164  0.133 -0.052 -0.238                            
## sesnFll2015 -0.311  0.023 -0.005 -0.047 -0.034  0.044                     
## sesnFll2016 -0.310  0.033 -0.023 -0.029 -0.032  0.098  0.510              
## sesnFll2017 -0.309  0.025  0.004 -0.063  0.000  0.048  0.479  0.452       
## phaseNC      0.053  0.046 -0.074 -0.055 -0.014 -0.210 -0.017 -0.084 -0.045
## phasePOST    0.001  0.064 -0.069 -0.149  0.026 -0.132 -0.056 -0.055 -0.037
## phasePRE     0.117  0.118 -0.104 -0.111 -0.073 -0.249 -0.007 -0.068 -0.039
##             phasNC phPOST
## odometer                 
## work.rate                
## hi.run                   
## field.mints              
## duration                 
## sesnFll2015              
## sesnFll2016              
## sesnFll2017              
## phaseNC                  
## phasePOST    0.475       
## phasePRE     0.522  0.367
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## convergence code: 0
## boundary (singular) fit: see ?isSingular

From reading this post it seems this error can happen for a few different reasons, including fixed effects that are magnitudes different from one another and model coefficients that are magnitudes different from one another. If you look at the data and the model output, check and check. It would be a good idea to center our variables then. For a discussion on centering, see [3]; to see it in practice, see [2]. I’m going to use grand mean centering (centering on pooled data) of the fixed effects similar to R1 from [2]. I do want to preserve the fact that we’re examining data across multiple seasons, so I’ll be including by = season in the data centering call below.

rpe.scaled <- rpe.data

# Create scaled versions of each of the fixed effects
# as.vector(scale()) is required as scale() normally returns a matrix
rpe.scaled[, c("field.minutes", "odometer", "work.rate", "hi.run", "duration") := 
             lapply(.SD, function(x) as.vector(scale(x))), 
           .SDcols = field.minutes:duration, by = season]

Grand mean centering alters the model coefficients, but it doesn’t actually change the interpretation of the model. So we’ll skip the rigamarole of building each step of the model in favor of only creating the scaled version of model.6.

scaled.model.6 <- lmer(rpetl ~ (1|season/athlete) + odometer + work.rate + hi.run + 
                         field.minutes + duration + season + phase, 
                       data = rpe.scaled, REML = FALSE)

## boundary (singular) fit: see ?isSingular

summary(scaled.model.6)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: rpetl ~ (1 | season/athlete) + odometer + work.rate + hi.run +  
##     field.minutes + duration + season + phase
##    Data: rpe.scaled
## 
##      AIC      BIC   logLik deviance df.resid 
##  33418.0  33507.7 -16694.0  33388.0     2915 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9205 -0.6155 -0.0426  0.4741  5.5493 
## 
## Random effects:
##  Groups         Name        Variance Std.Dev.
##  athlete:season (Intercept)  964.1   31.05   
##  season         (Intercept)    0.0    0.00   
##  Residual                   4896.3   69.97   
## Number of obs: 2930, groups:  athlete:season, 103; season, 4
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      232.516      7.079  32.845
## odometer          37.013      5.932   6.239
## work.rate          6.486      4.298   1.509
## hi.run            14.987      2.307   6.496
## field.minutes    -11.296      3.933  -2.872
## duration          52.126      2.189  23.813
## seasonFall 2015  -33.928      9.143  -3.711
## seasonFall 2016  -77.060      9.831  -7.838
## seasonFall 2017  -87.543     10.488  -8.347
## phaseNC           13.570      3.432   3.954
## phasePOST          4.376      4.317   1.014
## phasePRE          81.014      4.597  17.624
## 
## Correlation of Fixed Effects:
##             (Intr) odomtr wrk.rt hi.run fld.mn duratn sF2015 sF2016 sF2017
## odometer    -0.037                                                        
## work.rate    0.036 -0.886                                                 
## hi.run       0.034 -0.421  0.121                                          
## field.mints  0.024 -0.816  0.782  0.194                                   
## duration     0.075 -0.157  0.123 -0.047 -0.257                            
## sesnFll2015 -0.702  0.000 -0.004  0.008 -0.007  0.002                     
## sesnFll2016 -0.641 -0.015  0.002  0.026  0.005  0.018  0.511              
## sesnFll2017 -0.609  0.001  0.002  0.006 -0.006  0.003  0.479  0.445       
## phaseNC     -0.278  0.062 -0.091 -0.060 -0.027 -0.210 -0.008 -0.049 -0.019
## phasePOST   -0.178  0.080 -0.083 -0.162  0.015 -0.129 -0.055 -0.046 -0.037
## phasePRE    -0.225  0.122 -0.106 -0.114 -0.079 -0.242 -0.001 -0.034 -0.013
##             phasNC phPOST
## odometer                 
## work.rate                
## hi.run                   
## field.mints              
## duration                 
## sesnFll2015              
## sesnFll2016              
## sesnFll2017              
## phaseNC                  
## phasePOST    0.475       
## phasePRE     0.522  0.366
## convergence code: 0
## boundary (singular) fit: see ?isSingular

Lions and Tigers and Heteroscedasticity, Oh My!

So how well does our model fit the data? Let’s plot the residuals to find out.

plot(scaled.model.6)

plot(lm(predict(scaled.model.6, rpe.scaled) ~ rpe.scaled$rpetl))

Holy heteroscedasticity! That’s not what you want to see. If you’re unfamiliar, heteroscedasticity is more than just that one word your stats professor could never pronounce. One of the assumptions of linear models is that the model variance is consistent across all levels of the predictors. The fan-shaped residuals suggest this isn’t the case. Instead, as the predictors increase, the error in the prediction increases (variability in the athletes’ responses increases). That’s less than ideal when we’re attempting to predict rpetl. On the bright side, though, the Q-Q plot suggests the residuals are relatively normal.

Given we’re dealing with heteroscedastic–but relatively normal–residuals, one way to correct heteroscedasticity in our model is transformation of the dependent variable. Since rpetl is positively skewed, our main options are log transformation and square root transformation. Both tend to be useful for “pulling in” the tail of a skewed distribution, but log transformation is the more common of the two in sports science research. Hopkins goes into some detail as to why if you’re interested. Also, there’s some discussion on log transformation in chapter 5 of [3]. Let’s add natural logged rpetl to the rpe.scaled data frame then re-run model 6. We’ll call this model final.model.

rpe.scaled[, log.rpetl := log(rpetl)]

final.model <- update(scaled.model.6, log.rpetl ~.)

## boundary (singular) fit: see ?isSingular

summary(final.model)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: log.rpetl ~ (1 | season/athlete) + odometer + work.rate + hi.run +  
##     field.minutes + duration + season + phase
##    Data: rpe.scaled
## 
##      AIC      BIC   logLik deviance df.resid 
##   2460.7   2550.4  -1215.3   2430.7     2915 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.9599 -0.6227  0.0843  0.6488  3.9842 
## 
## Random effects:
##  Groups         Name        Variance Std.Dev.
##  athlete:season (Intercept) 0.02931  0.1712  
##  season         (Intercept) 0.00000  0.0000  
##  Residual                   0.12566  0.3545  
## Number of obs: 2930, groups:  athlete:season, 103; season, 4
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      5.29222    0.03815 138.709
## odometer         0.04029    0.03012   1.338
## work.rate        0.20622    0.02180   9.459
## hi.run           0.03150    0.01170   2.691
## field.minutes    0.06492    0.02000   3.246
## duration         0.31320    0.01110  28.217
## seasonFall 2015 -0.17880    0.04960  -3.605
## seasonFall 2016 -0.40376    0.05327  -7.579
## seasonFall 2017 -0.47707    0.05711  -8.354
## phaseNC          0.07035    0.01740   4.043
## phasePOST        0.02321    0.02188   1.060
## phasePRE         0.32283    0.02331  13.848
## 
## Correlation of Fixed Effects:
##             (Intr) odomtr wrk.rt hi.run fld.mn duratn sF2015 sF2016 sF2017
## odometer    -0.035                                                        
## work.rate    0.033 -0.886                                                 
## hi.run       0.032 -0.420  0.120                                          
## field.mints  0.022 -0.816  0.782  0.193                                   
## duration     0.071 -0.155  0.122 -0.047 -0.257                            
## sesnFll2015 -0.706  0.001 -0.004  0.007 -0.007  0.002                     
## sesnFll2016 -0.646 -0.014  0.001  0.026  0.005  0.018  0.510              
## sesnFll2017 -0.610  0.001  0.002  0.005 -0.005  0.003  0.476  0.443       
## phaseNC     -0.261  0.062 -0.090 -0.060 -0.026 -0.210 -0.007 -0.047 -0.018
## phasePOST   -0.167  0.079 -0.083 -0.162  0.016 -0.129 -0.052 -0.043 -0.035
## phasePRE    -0.212  0.122 -0.107 -0.114 -0.078 -0.242  0.000 -0.032 -0.012
##             phasNC phPOST
## odometer                 
## work.rate                
## hi.run                   
## field.mints              
## duration                 
## sesnFll2015              
## sesnFll2016              
## sesnFll2017              
## phaseNC                  
## phasePOST    0.474       
## phasePRE     0.522  0.366
## convergence code: 0
## boundary (singular) fit: see ?isSingular

plot(final.model)

plot(lm(predict(final.model, rpe.scaled) ~ rpe.scaled$log.rpetl))

While still not perfect, log transforming rpetl has improved our heteroscedasticity situation. I have a feeling there’s an interaction missing from the fixed effects, but I’d rather not overcomplicate things today.

Wrapping Up

Now that we have our predictive model, the next step will be using the model to predict rpetl values and (hopefully) identify anomalous responses. Because this post is already super long, I’ve moved the predictive portion into a standalone post. But never fear, I’ve already written part 3, so there won’t be a three month hiatius between posts! Once you’ve recovered from part 2, head on over to part 3 for the epic conclusion…or something.

References

¹ Bartlett JD, O’Connor F, Pitchford N, Torres-Ronda L, Robertson SJ. Relationships between internal and external training load in team-sport athletes: Evidence for an individualized approach. International Journal of Sports Physiology and Performance 2017; 12: 230–234 Available from: https://doi.org/10.1123/ijspp.2015-0791

² Carey DL, Ong K, Morris ME, Crow J, Crossley KM. Predicting ratings of perceived exertion in australian football players: Methods for live estimation. International Journal of Computer Science in Sport 2016; 15: 64–77 Available from: https://doi.org/10.1515/ijcss-2016-0005

³ Field A, Miles J, Field Z. Discovering statistics using r. Sage Publishing, 2012

⁴ Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006 Available from: https://doi.org/10.1017/cbo9780511790942

⁵ Vandewiele G, Geurkink Y, Lievens M, Ongenae F, Turck FD, Boone J. Enabling training personalization by predicting the session rate of perceived exertion (sRPE). 2017;

⁶ Ward P, Coutts AJ, Pruna R, McCall A. Putting the “i” back in team. International Journal of Sports Physiology and Performance 2018; 1–14 Available from: https://doi.org/10.1123/ijspp.2018-0154

athlete monitoring workload management predictive modeling

Matt Sams

Analyst - Performance Science

My interests focus on maximizing athletes’ performance while managing their fatigue and injury risk.