Data Visualization/Data Analysis

Becoming a Better Runner using Data

This summer I ran a lot – almost every day. It was great to start the morning getting my blood pumping and feeling energized to start the day. However, towards the end of this summer, I started to realize that my form might be unhealthy. Sometimes, depending on the run, my right knee would be more soar than my left.

So, at the beginning of the fall semester, much to my dissatisfaction, I stopped running until I could figure out what was causing the pain.

As recently confirmed by my doctor, my right leg is longer than my left. This is not abnormal but, combined with the fact that I have flat feet, it was most likely what was causing me pain. If I want to run again, I must be especially conscious of my form. A healthy form will minimize the potentially damaging impact on my knee and ensure I don’t feel that same pain. However, that raises the question, what factors lead to a deterioration in my form?

That brings me to this post. I almost always log my runs with Runkeeper. It’s a great app that provides a lot of statistics. It also lets you export the raw data. This is an absolute gold mine for data enthusiasts like myself.

With my renewed motivation to start running again and all this data, I decided to take a deeper look into how I was running before I stopped. Below, I’ll show how to determine what might be affecting my form and what changes I can make to help maintain a more healthy one.

Let’s get started with the imports!

import datetime as dt
import pytz

import numpy as np
import pandas as pd
import xml.etree.ElementTree
import json
import requests

import pyproj
import geopandas as gpd
from shapely.geometry import Point
from shapely.wkt import loads

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from plotly import tools
import cufflinks as cf

from import output_file, output_notebook, show
from bokeh.models import (
  GMapPlot, GMapOptions, ColumnDataSource, Circle, LogColorMapper, BasicTicker, ColorBar,
    DataRange1d, Range1d, PanTool, WheelZoomTool, BoxSelectTool, ResetTool
from bokeh.models.mappers import ColorMapper, LinearColorMapper
from bokeh.palettes import Viridis5
from bokeh.plotting import figure
from bokeh.resources import CDN
from bokeh.embed import file_html

from lib import custom_utils

cf.set_config_file(world_readable=True, offline=True)
[nltk_data] Downloading package stopwords to /home/sean/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Notice I import a module called custom_utils. This post is part of a larger project to keep track of my health and I found the need to reuse functions often. The file can be found here.

Here I’m just loading the keys I’ll need to use the APIs later.

with open('keys.json', 'r') as f:
    keys = json.loads(

Runkeeper Data

Simply go to to get a copy of your data. I downloaded the “activity data”. It includes a summary .csv along with .gpx files for each run. I’ll only be using the summary file.

runkeeper_runs = pd.read_csv('data/01-runkeeper-data-export-2019-01-09-162557/cardioActivities.csv', parse_dates=[1])
Activity Id Date Type Route Name Distance (mi) Duration Average Pace Average Speed (mph) Calories Burned Climb (ft) Average Heart Rate (bpm) Friend's Tagged Notes GPX File
0 d23a13b8-d1b5-42f5-8b08-d1a7bda837ed 2018-11-18 11:35:37 Running NaN 1.60 14:57 9:21 6.41 196.0 157 NaN NaN NaN 2018-11-18-113537.gpx
1 3bc2200c-0279-49b9-bdfe-88396ad9e5a6 2018-10-07 06:20:38 Running NaN 1.88 16:41 8:53 6.76 219.0 57 NaN NaN NaN 2018-10-07-062038.gpx
2 6090742d-459a-44f1-9b3b-b9a2a22ce9e9 2018-10-06 07:26:19 Running NaN 2.78 25:31 9:10 6.55 342.0 132 NaN NaN NaN 2018-10-06-072619.gpx
3 c525bb16-9942-4e8f-911f-51de038f50cd 2018-09-29 08:25:23 Running NaN 2.54 22:04 8:42 6.90 307.0 114 NaN NaN NaN 2018-09-29-082523.gpx
4 744f6918-1009-41d6-bfe0-72562e7b7057 2018-09-17 22:53:31 Running NaN 2.70 1:19:11 29:18 2.05 325.0 126 NaN NaN NaN 2018-09-17-225331.gpx


In the Runkeeper data, my average pace per run might be the best indicator to how healthy my form was.

Below, I examine some basic factors that might contribute to a slower pace.

# ignore runs with invalid pace
runkeeper_runs = runkeeper_runs.dropna(subset=['Average Pace'])

# convert duration strings to timedeltas
runkeeper_runs['Average Pace'] = runkeeper_runs['Average Pace'].apply(custom_utils.duration_to_delta)
runkeeper_runs['Duration'] = runkeeper_runs['Duration'].apply(custom_utils.duration_to_delta)

# add column with pace in seconds
runkeeper_runs['avg_pace_secs'] = runkeeper_runs['Average Pace'].dt.total_seconds()

# ignore crazy outliers with pace >15 minutes (I sometimes forget to end the run)
runkeeper_runs = runkeeper_runs[runkeeper_runs['avg_pace_secs']/60 < 15].reset_index(drop=True)

Day of Week

dow_to_str = {0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thu', 4: 'Fri', 5: 'Sat', 6: 'Sun'}
runkeeper_runs['dow'] = runkeeper_runs['Date'].dt.dayofweek
pace_dow_avgs = runkeeper_runs.groupby(['dow'])['avg_pace_secs'].mean()
data = []
        y=pace_dow_avgs.values / 60,
        name='Pace Averages'

layout = go.Layout(
    title='Average Pace vs Day of Week',
        'title': 'Day of Week (0=Mon)'
        'title': 'Minutes'

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='DOW avg')
data = []
for i, dow in enumerate(sorted(runkeeper_runs['dow'].unique())):
    # get pace histogram, ignoring outliers
    pace_vals = runkeeper_runs[runkeeper_runs['dow'] == dow]['avg_pace_secs'].value_counts()
            x=pace_vals.keys() / 60,
            yaxis='y' + str(i+1)

fig = tools.make_subplots(rows=4, cols=2, subplot_titles=[dow_to_str[day] for day in sorted(runkeeper_runs['dow'].unique())])

fig.append_trace(data[0], 1, 1)
fig.append_trace(data[1], 1, 2)
fig.append_trace(data[2], 2, 1)
fig.append_trace(data[3], 2, 2)
fig.append_trace(data[4], 3, 1)
fig.append_trace(data[5], 3, 2)
fig.append_trace(data[6], 4, 1)

for i, sem in enumerate(sorted(runkeeper_runs['dow'].unique())):
    fig['layout']['xaxis' + str(i+1)].update(range=[7, 10])
    fig['layout']['yaxis' + str(i+1)].update(range=[0, 3])

fig.layout.update(title='Pace Distibution by Day of Week')

iplot(fig, filename='dow pace')
This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]
[ (4,1) x7,y7 ]  [ (4,2) x8,y8 ]

Time of Day

early_avg = runkeeper_runs[runkeeper_runs['Date'].dt.hour < 9]['avg_pace_secs'].mean()
morn_avg = runkeeper_runs[(runkeeper_runs['Date'].dt.hour >= 9) & (runkeeper_runs['Date'].dt.hour < 12)]['avg_pace_secs'].mean()
after_avg = runkeeper_runs[(runkeeper_runs['Date'].dt.hour >= 12) & (runkeeper_runs['Date'].dt.hour < 7)]['avg_pace_secs'].mean()
night_avg = runkeeper_runs[runkeeper_runs['Date'].dt.hour >= 7]['avg_pace_secs'].mean()
data = []
        x=['Early Morning', 'Morning', 'Afternoon', 'Night'],
        y=[early_avg / 60, morn_avg / 60, after_avg / 60, night_avg / 60],
        name='Pace Avgs'

layout = go.Layout(
    title='Average Pace vs Time of Day',
        'title': 'Time of Day'
        'title': 'Minutes'

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='tod pace')


x = []
y = []
for i in range(5):
    x.append(str(i) + ' <= Miles < ' + str(i+1))
    y.append(runkeeper_runs[(runkeeper_runs['Distance (mi)'] >= i) & (runkeeper_runs['Distance (mi)'] < i+1)]['avg_pace_secs'].mean() / 60) 
data = []
        name='Mile Averages'

layout = go.Layout(
    title='Average Pace vs Run Distance',
        'title': 'Length of Run'
        'title': 'Minutes'

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='distance pace')

Ran 36 hours Before

rr_date_sorted = runkeeper_runs.sort_values('Date').reset_index()

row_mask = (rr_date_sorted['Date'] - rr_date_sorted['Date'].shift(1)).dt.total_seconds()/3600 < 36
data = []
        x=['Ran 36 Hrs Before', 'Didn\'t Run 36 Hrs Before'],
        y=[rr_date_sorted[row_mask]['avg_pace_secs'].mean()/60, rr_date_sorted[~row_mask]['avg_pace_secs'].mean()/60],

layout = go.Layout(
    title='Average Pace vs If I Ran 36 hours Before',
        'title': 'Minutes'

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='ran before pace')


Honestly, my paces are a lot less sporadic than I thought they would be. It seems my sweet spot for distance is between two and three miles. Also, I had no idea that I’ve never run in the afternoon. I thought I would’ve done it at least once in the past two years.

Other than that interesting finding, there doesn’t seem to be any factor that affects my average pace. We can dig deeper by examining my speed throughout each run.

Intra-run Data

The .gpx format output by RunKeeper provides latitude, longitude, elevation, and timestamp information throughout the run. For our analysis, we can use the coordinates and timestamp to determine speed.

Using the coordinates, we can also retrieve the location for each run. I’m going to be using a modified version of a function I got from this post.

def get_town(lat, lon):
    url = ""
    url += "latlng=%s,%s&sensor=false&key=%s" % (lat, lon, keys['google_geocoding_api_key'])
    v = requests.get(url)
    j = json.loads(v.text)
    components = j['results'][0]['address_components']
    town = state = None
    for c in components:
        if "locality" in c['types']:
            town = c['long_name']
        elif "administrative_area_level_1" in c['types']:
            state = c['short_name']

    return town+', '+state if state else "Unknown"
_GEOD = pyproj.Geod(ellps='WGS84')

all_meas = []
locations = []

for index, gpx_filename in enumerate(runkeeper_runs['GPX File']):
    # build path
    gpx_filepath = 'data/01-runkeeper-data-export-2019-01-09-162557/'+gpx_filename
    # load gpx
    root = xml.etree.ElementTree.parse(gpx_filepath).getroot()
    # get loop through all points
    meas = []
    for trkseg in root[0].findall('{}trkseg'):
        for point in trkseg:
            # get data from point
            lat, lon = float(point.get('lat')), float(point.get('lon'))
            ele = float(point[0].text)
            timestamp = dt.datetime.strptime(point[1].text, '%Y-%m-%dT%H:%M:%SZ')
            if not meas:
                # add first point
                v = 0
                locations.append(get_town(lat, lon))
                time_from_start = 0
                # calculate distance
                # Source:
                    # inv returns azimuth, back azimuth and distance
                    _, _ , d = _GEOD.inv(meas[-1]['lon'], meas[-1]['lat'], lon, lat) 
                    raise ValueError("Invalid MGRS point")
                # calculate time different
                t = ( (timestamp - meas[-1]['timestamp']).total_seconds() )
                # calculate speed (m/s)
                if t == 0:
                # speed in meters per second
                v = d / t
                time_from_start = (timestamp - meas[0]['timestamp']).total_seconds()
            # append point
                'timestamp': timestamp,
                'lat': lat,
                'lon': lon,
                'ele': ele,
                'location': locations[index],
                'speed': v*2.23694, # mph
                'run_idx': int(index),
                'time_from_start': time_from_start
    # add this run's points to all points

all_meas = pd.DataFrame(all_meas)
all_meas.speed.iplot(kind='histogram', bins=100)