Becoming a Better Runner using Data
This summer I ran a lot – almost every day. It was great to start the morning getting my blood pumping and feeling energized to start the day. However, towards the end of this summer, I started to realize that my form might be unhealthy. Sometimes, depending on the run, my right knee would be more soar than my left.
So, at the beginning of the fall semester, much to my dissatisfaction, I stopped running until I could figure out what was causing the pain.
As recently confirmed by my doctor, my right leg is longer than my left. This is not abnormal but, combined with the fact that I have flat feet, it was most likely what was causing me pain. If I want to run again, I must be especially conscious of my form. A healthy form will minimize the potentially damaging impact on my knee and ensure I don’t feel that same pain. However, that raises the question, what factors lead to a deterioration in my form?
That brings me to this post. I almost always log my runs with Runkeeper. It’s a great app that provides a lot of statistics. It also lets you export the raw data. This is an absolute gold mine for data enthusiasts like myself.
With my renewed motivation to start running again and all this data, I decided to take a deeper look into how I was running before I stopped. Below, I’ll show how to determine what might be affecting my form and what changes I can make to help maintain a more healthy one.
Let’s get started with the imports!
import datetime as dt
import pytz
import numpy as np
import pandas as pd
import xml.etree.ElementTree
import json
import requests
import pyproj
import geopandas as gpd
from shapely.geometry import Point
from shapely.wkt import loads
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from plotly import tools
import cufflinks as cf
from bokeh.io import output_file, output_notebook, show
from bokeh.models import (
GMapPlot, GMapOptions, ColumnDataSource, Circle, LogColorMapper, BasicTicker, ColorBar,
DataRange1d, Range1d, PanTool, WheelZoomTool, BoxSelectTool, ResetTool
)
from bokeh.models.mappers import ColorMapper, LinearColorMapper
from bokeh.palettes import Viridis5
from bokeh.plotting import figure
from bokeh.resources import CDN
from bokeh.embed import file_html
from lib import custom_utils
init_notebook_mode(connected=True)
cf.set_config_file(world_readable=True, offline=True)
[nltk_data] Downloading package stopwords to /home/sean/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Notice I import a module called custom_utils
. This post is part of a larger project to keep track of my health and I found the need to reuse functions often. The file can be found here.
Here I’m just loading the keys I’ll need to use the APIs later.
with open('keys.json', 'r') as f:
keys = json.loads(f.read())
Runkeeper Data
Simply go to https://runkeeper.com/exportData to get a copy of your data. I downloaded the “activity data”. It includes a summary .csv along with .gpx files for each run. I’ll only be using the summary file.
runkeeper_runs = pd.read_csv('data/01-runkeeper-data-export-2019-01-09-162557/cardioActivities.csv', parse_dates=[1])
runkeeper_runs.head()
Activity Id | Date | Type | Route Name | Distance (mi) | Duration | Average Pace | Average Speed (mph) | Calories Burned | Climb (ft) | Average Heart Rate (bpm) | Friend's Tagged | Notes | GPX File | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | d23a13b8-d1b5-42f5-8b08-d1a7bda837ed | 2018-11-18 11:35:37 | Running | NaN | 1.60 | 14:57 | 9:21 | 6.41 | 196.0 | 157 | NaN | NaN | NaN | 2018-11-18-113537.gpx |
1 | 3bc2200c-0279-49b9-bdfe-88396ad9e5a6 | 2018-10-07 06:20:38 | Running | NaN | 1.88 | 16:41 | 8:53 | 6.76 | 219.0 | 57 | NaN | NaN | NaN | 2018-10-07-062038.gpx |
2 | 6090742d-459a-44f1-9b3b-b9a2a22ce9e9 | 2018-10-06 07:26:19 | Running | NaN | 2.78 | 25:31 | 9:10 | 6.55 | 342.0 | 132 | NaN | NaN | NaN | 2018-10-06-072619.gpx |
3 | c525bb16-9942-4e8f-911f-51de038f50cd | 2018-09-29 08:25:23 | Running | NaN | 2.54 | 22:04 | 8:42 | 6.90 | 307.0 | 114 | NaN | NaN | NaN | 2018-09-29-082523.gpx |
4 | 744f6918-1009-41d6-bfe0-72562e7b7057 | 2018-09-17 22:53:31 | Running | NaN | 2.70 | 1:19:11 | 29:18 | 2.05 | 325.0 | 126 | NaN | NaN | NaN | 2018-09-17-225331.gpx |
Pace
In the Runkeeper data, my average pace per run might be the best indicator to how healthy my form was.
Below, I examine some basic factors that might contribute to a slower pace.
# ignore runs with invalid pace
runkeeper_runs = runkeeper_runs.dropna(subset=['Average Pace'])
# convert duration strings to timedeltas
runkeeper_runs['Average Pace'] = runkeeper_runs['Average Pace'].apply(custom_utils.duration_to_delta)
runkeeper_runs['Duration'] = runkeeper_runs['Duration'].apply(custom_utils.duration_to_delta)
# add column with pace in seconds
runkeeper_runs['avg_pace_secs'] = runkeeper_runs['Average Pace'].dt.total_seconds()
# ignore crazy outliers with pace >15 minutes (I sometimes forget to end the run)
runkeeper_runs = runkeeper_runs[runkeeper_runs['avg_pace_secs']/60 < 15].reset_index(drop=True)
Day of Week
dow_to_str = {0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thu', 4: 'Fri', 5: 'Sat', 6: 'Sun'}
runkeeper_runs['dow'] = runkeeper_runs['Date'].dt.dayofweek
pace_dow_avgs = runkeeper_runs.groupby(['dow'])['avg_pace_secs'].mean()
data = []
data.append(
go.Bar(
x=pace_dow_avgs.keys(),
y=pace_dow_avgs.values / 60,
name='Pace Averages'
)
)
layout = go.Layout(
title='Average Pace vs Day of Week',
barmode='group',
xaxis={
'title': 'Day of Week (0=Mon)'
},
yaxis={
'title': 'Minutes'
}
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='DOW avg')
data = []
for i, dow in enumerate(sorted(runkeeper_runs['dow'].unique())):
# get pace histogram, ignoring outliers
pace_vals = runkeeper_runs[runkeeper_runs['dow'] == dow]['avg_pace_secs'].value_counts()
data.append(
go.Bar(
x=pace_vals.keys() / 60,
y=pace_vals.values,
name=str(dow),
yaxis='y' + str(i+1)
)
)
fig = tools.make_subplots(rows=4, cols=2, subplot_titles=[dow_to_str[day] for day in sorted(runkeeper_runs['dow'].unique())])
fig.append_trace(data[0], 1, 1)
fig.append_trace(data[1], 1, 2)
fig.append_trace(data[2], 2, 1)
fig.append_trace(data[3], 2, 2)
fig.append_trace(data[4], 3, 1)
fig.append_trace(data[5], 3, 2)
fig.append_trace(data[6], 4, 1)
for i, sem in enumerate(sorted(runkeeper_runs['dow'].unique())):
fig['layout']['xaxis' + str(i+1)].update(range=[7, 10])
fig['layout']['yaxis' + str(i+1)].update(range=[0, 3])
fig.layout.update(height=1000)
fig.layout.update(title='Pace Distibution by Day of Week')
iplot(fig, filename='dow pace')
This is the format of your plot grid:
[ (1,1) x1,y1 ] [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ] [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ] [ (3,2) x6,y6 ]
[ (4,1) x7,y7 ] [ (4,2) x8,y8 ]
Time of Day
early_avg = runkeeper_runs[runkeeper_runs['Date'].dt.hour < 9]['avg_pace_secs'].mean()
morn_avg = runkeeper_runs[(runkeeper_runs['Date'].dt.hour >= 9) & (runkeeper_runs['Date'].dt.hour < 12)]['avg_pace_secs'].mean()
after_avg = runkeeper_runs[(runkeeper_runs['Date'].dt.hour >= 12) & (runkeeper_runs['Date'].dt.hour < 7)]['avg_pace_secs'].mean()
night_avg = runkeeper_runs[runkeeper_runs['Date'].dt.hour >= 7]['avg_pace_secs'].mean()
data = []
data.append(
go.Bar(
x=['Early Morning', 'Morning', 'Afternoon', 'Night'],
y=[early_avg / 60, morn_avg / 60, after_avg / 60, night_avg / 60],
name='Pace Avgs'
)
)
layout = go.Layout(
title='Average Pace vs Time of Day',
barmode='group',
xaxis={
'title': 'Time of Day'
},
yaxis={
'title': 'Minutes'
}
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='tod pace')
Distance
x = []
y = []
for i in range(5):
x.append(str(i) + ' <= Miles < ' + str(i+1))
y.append(runkeeper_runs[(runkeeper_runs['Distance (mi)'] >= i) & (runkeeper_runs['Distance (mi)'] < i+1)]['avg_pace_secs'].mean() / 60)
data = []
data.append(
go.Bar(
x=x,
y=y,
name='Mile Averages'
)
)
layout = go.Layout(
title='Average Pace vs Run Distance',
barmode='group',
xaxis={
'title': 'Length of Run'
},
yaxis={
'title': 'Minutes'
}
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='distance pace')
Ran 36 hours Before
rr_date_sorted = runkeeper_runs.sort_values('Date').reset_index()
row_mask = (rr_date_sorted['Date'] - rr_date_sorted['Date'].shift(1)).dt.total_seconds()/3600 < 36
data = []
data.append(
go.Bar(
x=['Ran 36 Hrs Before', 'Didn\'t Run 36 Hrs Before'],
y=[rr_date_sorted[row_mask]['avg_pace_secs'].mean()/60, rr_date_sorted[~row_mask]['avg_pace_secs'].mean()/60],
name='Paces'
)
)
layout = go.Layout(
title='Average Pace vs If I Ran 36 hours Before',
barmode='group',
yaxis={
'title': 'Minutes'
}
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='ran before pace')
Observations
Honestly, my paces are a lot less sporadic than I thought they would be. It seems my sweet spot for distance is between two and three miles. Also, I had no idea that I’ve never run in the afternoon. I thought I would’ve done it at least once in the past two years.
Other than that interesting finding, there doesn’t seem to be any factor that affects my average pace. We can dig deeper by examining my speed throughout each run.
Intra-run Data
The .gpx format output by RunKeeper provides latitude, longitude, elevation, and timestamp information throughout the run. For our analysis, we can use the coordinates and timestamp to determine speed.
Using the coordinates, we can also retrieve the location for each run. I’m going to be using a modified version of a function I got from this post.
def get_town(lat, lon):
url = "https://maps.googleapis.com/maps/api/geocode/json?"
url += "latlng=%s,%s&sensor=false&key=%s" % (lat, lon, keys['google_geocoding_api_key'])
v = requests.get(url)
j = json.loads(v.text)
components = j['results'][0]['address_components']
town = state = None
for c in components:
if "locality" in c['types']:
town = c['long_name']
elif "administrative_area_level_1" in c['types']:
state = c['short_name']
return town+', '+state if state else "Unknown"
_GEOD = pyproj.Geod(ellps='WGS84')
all_meas = []
locations = []
for index, gpx_filename in enumerate(runkeeper_runs['GPX File']):
# build path
gpx_filepath = 'data/01-runkeeper-data-export-2019-01-09-162557/'+gpx_filename
# load gpx
root = xml.etree.ElementTree.parse(gpx_filepath).getroot()
# get loop through all points
meas = []
for trkseg in root[0].findall('{http://www.topografix.com/GPX/1/1}trkseg'):
for point in trkseg:
# get data from point
lat, lon = float(point.get('lat')), float(point.get('lon'))
ele = float(point[0].text)
timestamp = dt.datetime.strptime(point[1].text, '%Y-%m-%dT%H:%M:%SZ')
if not meas:
# add first point
v = 0
locations.append(get_town(lat, lon))
time_from_start = 0
else:
# calculate distance
# Source: https://stackoverflow.com/questions/24968215/python-calculate-speed-distance-direction-from-2-gps-coordinates
try:
# inv returns azimuth, back azimuth and distance
_, _ , d = _GEOD.inv(meas[-1]['lon'], meas[-1]['lat'], lon, lat)
except:
raise ValueError("Invalid MGRS point")
# calculate time different
t = ( (timestamp - meas[-1]['timestamp']).total_seconds() )
# calculate speed (m/s)
if t == 0:
continue
# speed in meters per second
v = d / t
time_from_start = (timestamp - meas[0]['timestamp']).total_seconds()
# append point
meas.append({
'timestamp': timestamp,
'lat': lat,
'lon': lon,
'ele': ele,
'location': locations[index],
'speed': v*2.23694, # mph
'run_idx': int(index),
'time_from_start': time_from_start
})
# add this run's points to all points
all_meas.extend(meas)
all_meas = pd.DataFrame(all_meas)
all_meas.speed.iplot(kind='histogram', bins=100)