My Last Day at AIR Worldwide

Today is my last day with AIR Worldwide, a global natural catastrophe modeling firm, based in Boston.  I have been with AIR since June 20, 2010, working in the Research and Modeling group as the lead of the Data Management Group.

This was my first job out of grad school.  During the spring of 2010 I was finishing my PhD. in geography from the University of Connecticut.  My dissertation focused on building an impact model from deterministic interdiction scenarios (think terrorists taking out a supply network) in a GIScience framework.  At the time, pretty much all academic job postings, which were few and far between, were in places that I wasn’t going to move to, or were looking for a different type of geographer (plenty of nontechnical jobs, but few hardcore GIScience positions).  I had a couple of work options available to me in Connecticut, but I wasn’t particularly excited about any of them.  So, towards the end of the spring semester I said I was going to take the summer off to relax and plan my next move.  My wife hated that plan, but I went forward with it.

After about two weeks of sitting around my apartment, watching reruns of CSI and Law and Order SVU on an endless loop, I came to the realization that I needed to get a job.  The economy was horrible at the time and I had many friends and classmates who had lost their jobs, or couldn’t find one.  I was worried that I was extremely over educated for the types of positions I was seeing available or not qualified enough for anything mid-level.  With this worry, I wasn’t applying for a number of jobs, and constantly questioning why I even got a PhD .

During the later stages of May 2010, I think I applied to a couple government jobs, and tech jobs, but nothing came of them.  Then one day I went to the old monster.com and entered the search team “spatial analyst” because, why not.  Needless to say not much came up in the results, but there was one position in Boston that caught my attention – Analyst, Data Management at AIR Worldwide.  I don’t remember the job description or requirements now, but it looked good to me.  My wife and I talked about the possibility of me applying to job the since it was in Boston and we would have to move not only away from Connecticut, but displace her from the job she had at the time at UMASS Amherst.  We worked out some plans and I applied for the job.  By June 20th, I was in my desk at AIR, starting my career in the field of natural catastrophe modeling and analysis.

I was first hired as an analyst, working with the Research teams on building out various data management workflows, contributing to a wide variety of analysis and technical projects for clients around the world, and expanding and sharpening my technical skills.  I joined a research team of over 50 earth scientists, engineers, statisticians, and programmers, with a majority having a PhD.  A few weeks prior to joining AIR I was worried I made the wrong choice in spending years getting a PhD., and now I was in a job that allowed me to use and expand my skills and education.

Day 1

I was really, really lucky.  I got a job pretty quickly in a down economy, and one that took advantage of my education and skillset.  Many of my friends and classmates couldn’t say that at the time.

At AIR, I got to build innovative tools to solve big problems, provide critical analysis and data for clients, and grow professionally in many aspects of my career.  I got promotions and moved into the manager ranks.  I was also fortunate to build a successful team that produced many skilled individuals who moved on to bigger and better roles throughout the years.

It was a perfect fit for many years.

Flash-forward to September 8, 2017 and I am typing up my goodbye emails and cleaning-up my desk. Like with any job, I eventually did all I could do. I knew I was capping-out in my organization and potential for career growth was slowing. I knew if I was to advance my career, I needed to move into the user space (like working for an insurance company, doing natural catastrophe analysis), where I had been in the vendor space for the past seven years.

This past June, I was approached with a opportunity to join the Natural Catastrophe team at Homesite, a Boston based insurance company, as a Catastrophe Risk Manager. I accepted the job in late August.

I am looking forward to moving from the vendor space to the user space in my industry and to continue to utilize my background and skillset.  I have a TON to learn, but I am incredibly excited to so. I am energized to learn new things, gain a new perspective, and to be challenged.

I will be eternally grateful to my coworkers at AIR from the past seven years.  I really enjoyed this job, and for a vast majority of days I was excited to go to work.  I worked with some incredibly smart and hardworking people.  I was also very fortunate to have a great manager, and a great team to work with.

Without them I wouldn’t have had the opportunity to move into this new role.

Day 2,637

Thanks AIR!

I also need to thank my wife, Danielle.  Before we moved to Boston she had a good job at UMASS, and she gave it up to move to Boston (she tried the commute for a while but it was too tough).  Since the entire higher ed field was down and jobs were scarce, she went back to school and changed her career in the first couple years we lived here.  She made a huge sacrifice for my career and I can never repay her for that.

See you on Monday, Homesite.  I am sure we’ll be very, very, very busy.

Things to do in Boston during FOSS4G 2017

FOSS4G Boston 2017 is less than a month away!

You should go.

Why?

Well, because I probably won’t be able to make all of it (insert sad face emoji here)

I have a great reason – my second child will be born days before the start of FOSS4G and it wouldn’t be fair to my family if I am having the time of my life with all of you spatial nerds while there is a new human at home.

So, you should go. You should really go.

There is a chance that this will be the largest FOSS4G ever, but the conference will still be of a size where you will be able to meet all your twitter pals in person, get to share ideas with the some of the smartest people in spatial, and listen to talks from some of the most influential minds that are moving spatial into realms people were only dreaming of a few years ago.

The program has something for everyone.  You can focus on analysis, viz, or databases and walk away with more geo-knowledge than you came to Boston with.  The workshops look great too, and the people teaching the workshops are the same people building or contributing to some of your favorite FOSS4G projects. The conference planning teams have done an amazing job putting this all together!

But, what should you do when you aren’t geo-ing, hacking, or conferencing? Well, here are a few suggestions from a *local that will get you into the city to see the sights.

Boston/Cambridge/Somerville is a great place to visit during the summer. There is lots to do in the city center, but I encourage all those visiting Boston to get out and check out some of the neighborhoods.  Bring comfortable shoes, because you’ll be doing a lot of walking.

Eat your way through Somerville!  Where do I start.  You’ll get plenty of advice on where to eat in Boston, so let’s take a trip North to where I roam. When you are at Harvard for the workshops, take the Red Line north to Davis Square to start your binge.  I am a big fan of Boston Burger Company and Five Horses Tavern.  Posto, Foundry, and Saloon are also really good.  If you are looking for a great Irish Pub, check out the Burren.  After dinner, get in line for some ice cream at JP Licks, and hang out in the square to do some people watching and listen to the street performers.  If you a comic/pop culture nerd, check out Comicazi, a few steps down Highland Street from the Square.  It’s one of the best comic shops you will ever go to.

If you want a totally different Somerville experience, take the Orange line to Sullivan Square and explore East Somerville up Broadway.  Taco Loco, La Brasa, Casey’s, Vinny’s – there is something for all tastes.  All are within a 15 minute walk from Sullivan Square and all are on Broadway.

Other awesome Somerville restaurants that aren’t easily T accessible include Sarma (make your reservations now), anywhere in Union Square, Highland Kitchen, Soundbites, Ball Square Cafe (brunch) , Magouns Saloon, Daddy Jones, and many, many more.

I’m on a Boat -Jump on a ferry and check out the Boston Harbor Islands.  The tickets are pretty cheap and you’ll get a boat ride with some awesome views of the city.  Georges Island is my go-to island, but Spectacle Island has lots of trails and great history with the Big Dig.

Not the actual boat

Explore Cambridge with Hubway – Head up to Cambridge and take the Hubway to explore (and let me know when you do so I can isolate the data for the 2018 Hubway challenge (which I may or may not enter)). Cambridge is one of the most bike friendly cities in the United States, with an expanding network of dedicated bicycle lanes. While you are in Cambridge, pedaling around Harvard and MIT, travel north on Massachusetts Ave and stop by Ward Maps and pick up some cool Boston map-morabilia.

Baseball!  America’s increasingly boring past time!  The Red Sox will be in town that week where they play the Cardinals and Yankees.  The Sox have been pretty good this year, but interest in town hasn’t been as high as in past years, so you might be able to get bleacher seats at a reasonable price.  In general, Fenway is expensive, so if you do go to a game, be prepared for steep food and drink costs.

Alternative idea – head down to Fenway on game night, walk around pregame and soak up all the atmosphere, and catch the first few innings at a bar or restaurant in the neighborhood.  Or, go on a tour during the day (but at the same time, not skipping sessions!).

Second Alternative Idea – Check out the Boston Breakers, professional women’s soccer team. They play over in the Harvard Athletic Complex in Allston and they have two home games the week of FOSS4G.  The Revolution play in Foxoboro and the stadium is not easy to get to.

Go for a Run – I know a lot of people will be doing the first annual FOSS4G 0.0k, but a few may want to get in some miles and do some tourism-running (which I do everywhere I go).  There are lots of open group runs in Boston/Cambridge/Somerville, but only one has been running every week, since 1995. The Somerville Road Runners have been hosting their Thursday night 4.06 run, no matter the conditions, for the past 22 years.  The run is on a challenging, but good course that will take you through a few different Somerville neighborhoods.  The race starts and ends at Casey’s Bar, an East Somerville institution, that is about a 10-15 minute walk from Sullivan Square (Orange Line).  It also has the best course map of any run in the country.

I love this course map

Map Nerd Stuff – Any trip to Boston for a geo/map/spatial nerd isn’t complete unless you visit the Mapparium at the Mary Baker Eddy Library. I could try to describe what it’s like to visit, but I wouldn’t do it justice.  You just have to go (also, their cartography would be a rad Mapbox style) and see it yourself.  You should also check out Leventhal Map Center at Boston Public Library, a premier public collection in a premier public library.

Seriously, this would be an awesome tile set for a map app!

I hope everyone really enjoys the conference and the city! It should be a ton of fun!

Footnotes:

* I’ve been in Somerville/Medford for 7 years, and since I wasn’t born here, I can never, ever say I am technically from here…That’s just how it works

This is Probably the Last GISDoctor.com Post

Hello people who find this blog through googling something about GIS, click on my interesting tweets, or being sent here through the geo-reddit, or stackoverflows.

I have some news.  I think this will be the last post to GISDoctor.com.  99% sure.

GISDoctor.com started seven years ago.  As I was finishing my Ph.D. in geography from the University of Connecticut, my former boss at the Map and Geographic Information Center (MAGIC) gave me a hosting plan and BenjaminSpaulding.com as a graduation gift. Soon after I was getting my personal site running and one of my coworkers at MAGIC was joking around and calling me the GIS Doctor and the name kinda stuck. I registered the domain and started blogging.

After a 151 posts, the most popular post is still my ArcGIS Sucks piece, which has so many deep roots in the internets and message boards that many, many keyword searches return the post – that has no technical value – just a lot of people bitching in the comments.  Also, lots of coworkers over the years have found my blog through that post, which has led to some awkward conversations in the office.

I am not out of geo, spatial, or the GIS fields. I just feel that now is the time to move my writing and ideas over to my personal website, BenjaminSpaulding.com and to let GISDoctor.com ride out into the sunset.  I have been hesitant in the past to post about non-spatial topics, but moving away from GISDoctor.com will allow me to visit other areas of technical interest.

My career has moved this way as well. It’s been many, many years since I have been a “GIS Specialist”.  For the last few years I have managed a technical team that solves many data and spatial problems, and fewer and fewer are based in the traditional GIS space.  Today, I would be considered a data scientist/data manager/spatial scientist and I want the blog moving forward will reflect that.  I feel I that I can reach a larger audience focusing on the science and analysis of “where”, without having to worry about explaining what GIS stands for in my website name or twitter handle.

For the next few months I will keep the content on GISDoctor.com as I migrate it to BenjaminSpaulding.com.  I hope to have BenjaminSpaulding.com in a blog format within the next week (maybe two, don’t hold your breath).  I’ll probably keep GISDoctor.com up for the foreseeable future, but I’ll just redirect it to the new site.

This should be fun.  Thanks for all the clicks.

Also, I switched my twitter handle from @GISDoctor to @theBenSpaulding.  Follow me, if you twitter.

 

Starting the 2017 Hubway Data Challenge

Time for a New Project 

A couple weeks ago, Hubway, Boston’s bike share service, announced their 2017 Data Challenge. For the challenge, Hubway is providing trip data for previous years, station data, as well access to real time data. Those who enter the challenge will build a wide variety of visualizations and analysis.  I think I might participate, so I downloaded the detailed month-by-month data for 2015 and 2016, as well as the station data and started to experiment.  This post will outline some of my early work before I actually figure out what I will (might) do for the challenge.

For those interested, submissions are due on April 10, 2017.

Making the Data Usable

Hubway provided the data in the only format that matters, csv files.  Since I don’t do much with text files (I am a database person), I wrote a few PostgreSQL scripts to wrangle the data from csv files into PostgreSQL.

The first script to I wrote was a loading script –  Hubway2017_loading.sql. The script is pretty simple. and does the following:

  • Build the tables – the schema is pretty straight forward
  • Load data into staging tables
  • Check for ‘bad values’ in each column – values that don’t meet the data type – they used a ‘\N’ for null. Make sure you check for that.
  • Load data into the final tables – I have each year in seperate tables.
  • Build geometry values for geographic analysis and visualization

The second set of scripts I wrote are analysis scripts.  To start with the data analysis, I wrote three simple analysis scripts:

Feel free to check out my github page for this project and grab whatever code you like.  I anticipate I will be adding more to this project over the next couple weeks.

Starting the Visualization

At the end of the Hubway2017_loading.sql script, I loaded the station data into its own table.  With that data, I created a GeoJSON file of the stations with the reported capacity value with QGIS.  I am using the GeoJSON format for a couple reasons.  It works more seamlessly with CARTO, and it can properly store a date value (something shapefiles don’t do well).

I have uploaded the dataset to my hubway github here.

For anyone who knows Boston/Somerville/Cambridge/Brookline, this pattern of stations will make sense.  The stations with lots of capacity stand out near South Station, MIT, and Mass General.  There are 187 stations in this dataset, however, I need to double check to make sure the stations that appear in the map below were actually in use during 2015/2016, as stations aren’t necessarily permanent.

The next visualization I wanted to make was a time series map displaying the daily starts across the entire system for 2016.  The first step was to build a table with all the relevant data.  For those interested in the script, check out the OriginsByDay_Hubway2016.sql script. Once the script was run and the data created, I built a GeoJSON file in QGIS and uploaded it into CARTO. CARTO is a great online mapping service that is easy to use. If you are looking to make some maps for this challenge and don’t want to spend a lot of time leaning how to map or use mapping specific software, I encourage you to check out CARTO.

The following map steps through each day, visualizing the number of trip starts using CARTO’s Torque feature.  It is fun to watch as the trip starts ebb and flow with the seasons.  One can see stations come in and out of service across the city throughout the year, see major peaks and valleys in usage, and observe the strong relationship in trip starts between downtown Boston and the outlying stations.

Click here for the full size version (that works much better than the version limited by my wordpress CSS).

This simple visualization has given me a number of ideas on what to look into next including:

  • Quantify the relationships between usage and weather
  • The Giver and Taker stations – what is the net usage by station for each day
  • Is that station at MIT really that busy every single day?
  • Relationships between population density and usage
  • Usage in regards to major days in the city, i.e., Marathon Monday, MIT/Harvard/BU move in days, college graduation days, Boston Calling, bad T days (for those who ride the T, you know what I mean).

There are some real patterns in this dataset and it will be fun to look into them and share the results.

Busiest Days in 2016

The last script I put together was to find the busiest days in regards to trip starts.  The busiest day was August 9, 2016, with 6949 starts.  This was a Tuesday, which blows my mind. I am shocked that the busiest day wasn’t a weekend day.

The rest of the busiest days all had over 6k starts and all happened between the end of June and the end of September.  And again, all were on weekdays.  This is really weird to me, as I tend to think that Hubway is used by tourists, and presumably, on weekends (especially downtown). Seeing that the busiest days are weekdays, is actually a real positive for the system, as it can be seen as a viable alternative transit option.

As you can see, there is still a ton to do.  I need to get into this data some more and start to plan the story I want to tell.  Also, I need to do some more QA on the data, so I fully understand what I am dealing with. The biggest part of any data analysis project isn’t the generation of fancy interactive graphics (which no one uses)  or writing ground breaking algorithms; it is the dirty data work.  Without checking and double checking the inputs, the analysis could be wrong, and no one wants that.

 

Racing Myself – Using Torque in CARTO with runBENrun

As most runners do, I run a lot of the same routes over, and over again. During a run yesterday, I had the idea that I could pull all my runs on my three mile loop and race them against each other using CARTO‘s Torque feature.  It took a little bit of data prep to get my GPS data into a format to “race itself”, but I will spare the technical details for later.

Here are 25 separate runs I ran from 2016 on my Somerville 3 mile loop.  Each point is the lead GPS point of an individual run, with time steps synced, visualized by meters per second speed. To see the full size, click here (it is much better in full size).

 

Couple points about the race

  • Winner – 10/11/2016 – 3.13 miles, 19:10 time,6:07 pace
  • Loser – 12/12/2016 – 3.13 miles, 24:26, 7:48 pace
  • There are a few deviations on the route, especially at the end.  This is because of number of factors, either because I made a different turn or had to run a little longer to get the required distance due to GPS errors earlier in the run.
  • I am able to race myself because the data I generate with runBENrun project uses elapsed time, so I am able to compare run against run.
  • I used a Nike+ watch, and scraped the data into my own environment using Smashruntapiriik, and my own code.
  • The very last point to leave the map is a run where I didn’t turn off my watch at the end and walked into my house!

Here is How I Created the Race

Warning – Technical Details Ahead! Ahh Yeah!

In 2016, I ran my Somerville loop 25 times.  It’s a pretty flat and fast course, that has a good long straightaway down the bike path, but it does have a couple tight turns and pauses waiting for traffic to cross Broadway.

I run this loop in many different phases of training periods.  Sometimes I try to run fast on this loop, but other times I am using this course for a recovery run. As I was preparing the data I thought my pace and times would be all over the place.

First step was to run a query against all of my 2016 runs to find all three mile runs, that where not classified as interval runs (github here). The script returns any run that rounds to three miles. So I had to do some post processing.

The query returned 42 three mile runs in 2016.  The next step was to pull all of the shapefiles I generated for these datasets a while back (code here!) and check the routes using QGIS. I removed a number of races I ran, and a few three mile runs that weren’t along this route. Once the set was cleaned, I ended up with following 25 runs.

You will notice that the routes don’t all follow the same path.  In fact, I often end at different places on different streets.  This is for a couple reasons: I may have to run a little extra at an end of a run due to pauses in my GPS, or I took a turn a little early toward the end of the run down and I had to make up the distance at the end. Overall, the 25 runs represent a pretty consistent route.

Querying my runBENrun database, I can get my stats for the 25 runs, and checkout how consistent, or inconsistent, I am on this route (github here). The spread of times isn’t too bad, so it should show a decent race.

From here, I wrote a script to create a postgreSQL table with all the relevant runs from the master GPS point table for 2016 (github here).  I made sure to cast the finaltimecounter column as time so that I could use it in CARTO later on.

The output table contains over 29k points, as seen below.  This dataset is what I need to use in CARTO for the animation using Torque.  Using QGIS, I exported the dataset  as a GEOJSON.  Why GEOJSON? Because I had a time field and shapefiles don’t play nice with time data.

I imported the GEOJSON dataset into CARTO and then used the following settings in the Torque Cat wizard.  I found the following settings gave the best view of the “race.” CARTO is super easy to use, and the Torque Cat tool provided a lot of options to make the map look really sharp.

In the end, I got a nice map showing me race myself.  I have a few ideas on how to improve the map and data, but that will be for another time.

Thanks for reading.

The Best Way to Map a Run

From my runBENrun project I have generated a lot of data; over 1.2 million data points in 2.75 years.  It is easy enough to write SQL scripts to analyze the data and gain insight into the runs, however, trying to build meaningful maps that help me interpret my runs isn’t as easy.  I have made plenty of maps of my running data over the past year, some good, some bad. In this post, I will explore a few different methods on how to best visualize a single 5k race dataset from my runBENrun project.

The Problem

With most GPS running apps and fitness trackers, you are often generating lots and lots of data.  My old Nike+ watch collects a point every ~0.97 seconds.  That means if you run a six minute mile in a 5k you can log over a 1000 points during the run.  The GPS data collected by my Nike+ watch is great, and I can generate lots of additional derivative attributes from it, but is all that data necessary when trying to spatially understand the ebbs and flows of the run?

Software

I will be using PostgreSQL/PostGIS, QGIS and CARTO in this project. In my maps, I am using Stamen’s Toner Light basemap.

The Data

For this post, I am using a single 5k race I ran in November 2016, in Wakefield, Massachusetts.  The race course loops around Lake Quannapowitt, and is flat and fast with several good long straightaways, and some gentle curves.  I’ve run a couple races on this course, and I recommend it to anyone looking for a good course to try to PR on.  I am also selecting this dataset because the course is a loop, not an out-and-back.  Out-and-back running datasets are a lot harder to visualize since the data often interferes with itself.  I plan on doing a post about visualizing out-and-back runs sometime in the near future.

In case anyone is interested, I have exported a point shapefile and a multiline shapefile of this data, which can be found on my github account.

Before We Starting Mapping…

What’s spatially important to know about this run?  Beyond mile markers, speed is what I am most interested in.  More importantly, how consistent is my speed throughout the run.  I will add mile markers and the Start/Finish to the maps to give some perspective. I will also provide histograms from QGIS of the value and classification breakdowns to help give context to the map.

Let’s Make a Lot of Maps of One Run 

Mapping all 1,117 Points – Let’s start with a simple map. When only visualizing the points I get a map of where I was when I ran. Taking a point about every second, the GPS data isn’t very clear at this scale.
AllPoints Is this a good running map?  No.

Mapping Meters Per Second Bins using Point Data

Points on a map don’t tell us much, especially when the goal here is understand speed throughout the race. The next step in this project is to visualize the range of values in the Meters per Second (MpS) field.  This is a value I calculate in my runBENrun scripts.  The next set of maps will take a couple different approaches to mapping this point data, including visualizing the MpS data by quantile, natural, and user defined breaks.

Quantile Breaks

The first MpS map uses quantile breaks to classify the data.  Since there is a tight distribution of values, quantile breaks will work (there are no major outliers in the dataset). In the following histogram from QGIS we see the distribution of values coded to the five classes.  In all of the maps green equals faster speeds while red values are slower.

QuantileBreaks_MpSHistogram

The map displays the points classified as such.  What’s important to note from the point based map, is that since there are so many points in such a tight space, that seeing some type of meaningful pattern is tough.  To the naked eye there are many “ups and downs” in the data.  There are clear sections of the race where I am faster than others, but in other parts of the race a “slow” point is adjacent to a “fast” point.  This pattern will show up in the next maps as well.  I am looking into this noise and will hopefully have a post about understanding this type of variation in the GPS data.

Points_QuantileBreaks

Is this a good running map? Not really.  The data is busy; there are too many points to get a real perspective on how consistent the speed was.

Natural Breaks

The next map uses natural breaks classification scheme.  When comparing the histogram using quantile breaks to the natural breaks, one will see that natural breaks algorithm puts fewer values into the lowest (or slowest) bin.
NaturalBreaks_MpSHistogram

The difference in binning is apparent in the map.  Overall, the reader is given the impression that this is a better run, since there is more non-red colors on the map. Without a MpS legend one wouldn’t know one run was faster than the other. Overall, the general speed patterns are better represented here, as I believe there is a better bins transition between the bins.

Points_NaturalBreaks

Is this a good running map? It’s better.  The natural breaks classification works better than quantile breaks with this dataset, but there is still too much noise in the dataset. That noise won’t be eliminated until the dataset is smoothed.

 Self Defined Classification – Ben Breaks

In this example, I wanted to set my own classification scheme, to create more friendly bins to the “faster” times.  I call this classification scheme the “Confidence Booster.”

BenBreaks_MpSHistogram

One can see that I have larger bins for the faster speeds, and really minimize the red, or slower bins.  The resulting map has a smoother feel, but again, there is too much noise between the MpS values from point to point.

Points_BenBreaks

Is this a good running map?  It’s not bad, but as with all the point maps, there is a lot of data to communicate, and at this scale it doesn’t work as well as I would have liked.

Overall, the point data, using every point in the dataset isn’t a good approach for mapping the run.

Mapping Multiline Data

Using my runBENrun scripts, I generated not only point geometries, but also multiline geometries (single line calculated between each sequential point).  At the scale we are viewing these maps, their isn’t much visual difference between the point and line maps, which is understandable.  The multiline datasets are much better utilized when one wants to zoom into a specific area or see the actual details of the route.

I generated the same set of maps using the multiline based data as I did with the points, so I won’t repeat the maps here. However, I will share a map of the multiline data loaded into CARTO, symbolizing the MpS value with the multiline data using a natural breaks classification.

Is this a good running map?  Yes and No. The line data symbolized with natural, quantile, or self-defined breaks works better in an interactive setting where the user can pan and zoom around the dataset. However, the static versions of these maps have the same issues the point data maps do.

Mapping Multiline Data Aggregated to Tenth and Quarter Mile Segments

For this dataset (and almost all running datasets), visualizing every point in the dataset, or every line between every point in the dataset isn’t a good idea.  How about we try a few methods to look at the data differently.  The first approach is to smooth and aggregate the data into quarter mile and tenth of a mile segments.

Using PostGIS, I simply aggregated the geometry based the distance data in the table, and then found the average MpS for that span.  I wrote the output to a table and visualized in QGIS.

Quarter Mile Segments – Quantile Breaks

Since there is less data to visualize, we get a much cleaner, albeit, dumbed down version of the race.  There are clear patterns where I was faster than where I was slower (green=fast, red = slow, relatively speaking).  The consumer of the map isn’t wondering why there was so much variation.  I made this map with both natural breaks and my self-defined breaks, but the quantile classification gave the best view of the race.

QuarterMile_QuantileBreaks

Is this a good running map?  Yes, if you just want to know the general trends of how your race went, then this map will let you know that. My second mile, as always, was my worst mile. I traditionally struggle in mile two.

One Tenth Mile Segements – Quantile Breaks

How about comparing different aggregation approaches?  Let’s look at the race broken into tenth of a mile segments using a quantile classification scheme.  In this approach, there is more detail in MpS differences during the race than the quarter mile map. The color for the middle bin does get washed out in the map, so I should probably go back and fix that.

TenthMile_QuantileBreaks

Is this a good running map?  Yes.  The general message – where was I fast and where was I slow – is answered and the data isn’t distracting, like it is in the point maps.  A way to improve this visualization would be to add the actual breaks between tenth mile segments, and maybe a table with the time splits.

Using Standard Deviation and Average Bins

The last set of maps will visualize the race using some basic statistical measures – standard deviation and average.

Standard Deviation

The distribution of values are fairly compact.  The resulting maps using the standard deviation bins reflect that.
StandardDev_MpSHistogram

With the point dataset, MpS values classified using standard deviation, you actually get a pretty decent looking map.  Since there are so few very fast or very slow MpS values, you don’t get many points in those bins extreme bins. This means that the color ranges fall more in the middle of the range. This map won’t tell you have fast or slow you were really going, but it gives you an idea of how well your run was relative to the rest of the race.  For what I plan to do in a race, I would hope to see a majority of values in the +1 or -1 standard deviation bins.  This would mean that I was pretty consistent in my MpS.  Ideally, I would also see values in the higher plus standard deviation bins towards the end of the race, as I really try to pick up the pace.

Points_StandardDev

Is this a good running map? If you know what you are looking at, then this map can tell you a lot about your run.  However, if you aren’t familiar with what a standard deviation is, or how it is mapped, then this might not be a good approach.

Average Values

The last map for this post is simply mapping those points that are above, at, or below the average MpS for the race.  In this race, my average MpS was 4.52 (For reference, Mo Farah won the 2016 Olympic 5k in 13:03, or 6.39 MpS!).  I created three classes – green – points with an above average MpS, yellow – points that were average, and red – points that had a below average MpS. The view of the run isn’t that bad with this approach.  The user get a fairly clear indication of relative speed during the race, without all the noise from previous attempts to classify the data.  Using the average value here though only works because the range of values is fairly tight.  If there was a wider swing in values, this approach might not work.
Points_Averages

Is this a good running map? Yeah, it’s not that bad. The colors are a little harsh. In this case it works, but depending on the range of values, mapping compared to the average may not work. Another test would be to compare values against the median.

What map was the best approach?

In the end, what map was the best approach to visualizing the data from the race with the goal of best understanding my MpS?  I had two maps that I think met the requirements:

  • Quarter Mile Segement Quantile Breaks – smooth transitions between classes, easy to view, and informed readers of the general race speed trends
  • Standard Deviation – good approach if you know what a standard deviation is, and if your data is compact (don’t have huge swings in value).  This approach gives the reader a clear indication of how they were doing relative to the rest of the run, without worrying about the individual MpS values.

There is value in all the maps, and with a little work, they could be improved as well. However, these two maps were my picks.

What’s Next?

I actually made another 10 or so maps when working on this blog, including maps using proportional symbols, incorporating more data smoothing, and some ideas about flow maps.  The next steps will include exploring those visualization methods with the goal of getting them into the blog.

Have any other suggestions? Send me a note on twitter @thebenspaulding!

GISDoctor.com 2016 Year in Review!

It’s 2017, so let’s talk about 2016.

Back in January of 2016 I wrote a blog post about my goals for the up coming year. I had a few goals I wanted to accomplish during the past 12 months. Unfortunately, I didn’t learn Mandarin Chinese (didn’t even really start), but I did become a better runner (and check out my runBENrun posts!). My main goal for the year was to become better at what I do and what I do is geo.

The first step to achieve my main goal was to reactivate my github account. I started several new repositories including uploading the code from my dissertation, adding a couple projects I reference often for  Spatial SQL and PostGIS queries, and runBENrun, a code base where I took my raw Nike+ data and built tools to analyze and visualize my running data.

Posting on GISDoctor.com was more active throughout 2016 with 10 new posts to be exact.  Not a lot, but enough to keep me motivated and active.  I hope for more posts in 2017! As always, I have plenty of ideas. Finding time to write them up is a totally different challenge.

Being an active OSM contributor was another goal for 2016 and early in the year I craft-mapped a ton. I mapped almost everyday in January, bringing some sweet craft mapping skills to some under-mapped areas. Perhaps I’ll do another OSM-mapathon sometime in early 2017.

Python, PostgreSQL/PostGIS, Node/Javascripting/Leaflet, and QGIS were my tools of choice for 2016 (and probably for most of you as well). I expanded my skills in each, keeping current to new trends and tech. I wanted to get better, and I think by taking the time (a few hours here and there over nights and weekends throughout the entire year)  I was able to learn new skills and technologies.

Now, why do I do all this extra work?  I do work in a job where I get to do a lot of very technical geospatial work, where I continually get to push my skills. However, due to the nature of the work, I don’t and can’t share it here. It was through these “at home” projects and posts where I pushed myself to continue to learn more, expand my skills, and share them with you.

There was one thing I wanted to do in 2016 that I totally missed out on.  I wanted to get more involved in the geo-community.  I didn’t. I will try again in 2017.  One good thing about our community is that there are always plenty of opportunities to get involved and make a difference.

The stats of 2016

The pageviews from GISDoctor.com were down this year compared to 2015. I think this is mostly due to the fact that in 2015  I had a post get on HackerNews that lead to a ton of traffic.

The top ten viewed pages for the past year are seen below.  Many of these posts are actually pretty old, but they all have long comment histories or have been posted in other locations leading readers back to the site.

gisdoctor2016

What’s on tap for 2017?  I have a few plans, but that is another post!

runBenrun – These aren’t Heat Maps

I’ve gone back into my running data from 2014 and 2015 to build some density maps to compare to what I have run so far in 2016. Building a 10m grid for the region, I did some simple aggregations based on the GPS points captured by my Nike+ watch and processed through my runBENrun project (see it here on github).

These aren’t heat maps.  These are simple density maps.  There is a difference.

<start rant>

Please stop calling every single choropleth map a heat map.

</end rant>

From my running data, I can see some pretty clear patterns in where I ran.  In 2014, I kept my runs in Winter Hill, but ventured out into Cambridge and Boston a few times. A couple races in Boston show up, but the blue color range is only for a couple points per pixel.

2014 Run Density
2014 Run Density

In 2015, I changed the geography of my runs. I stopped with my Winter Hill routes and went out to the Minuteman Bikeway, venturing out as far as Lexington. The darker reds indicate where most of my runs were. Again, a race in Boston stands out as a single run, as do a couple runs into Medford and the southern reaches of Somerville.

2015 Run Density
2015 Run Density

My 2016 run density map to date is much different than the previous two years.  Firstly, I have put on a lot more miles this year than in past years, but almost all my miles were on the Minuteman Bikeway! I did run quite a bit into Cambrigde and Boston, mostly on my long Sunday runs as I prepared for my marathon. Like 2015, a vast majority of my runs were in Somerville and Medford, along the bike path.

2016 Run Density
2016 Run Density

When I combine all years I get a view of my running history that I have developed quite the habit for running close to home! The runs along the Minuteman Bikeway radiate red, as I have logged hundreds of miles along the route over the past couple years.  Even my adventures into Cambridge and Boston start to stand out, as I tend to use the same routes down Mass Ave, Boylston Street, and back into Somerville and Medford along Broadway in Cambridge.

All Run Density Map
All Run Density Map

This exercise didn’t reveal anything new to me, but it was a good exercise in thinking about different ways to display the data collected from my Nike+ watch through my runBENrun project.

runBENrun Part 3 – Starting with PostgreSQL and Running Analysis

Kernel Density Analysis Performed in ArcGIS
Red – more runs – Blue – less runs

Run Ben, Run! Data Analysis!

Finally! I am at a point with my Nike+ app data transformation process from the raw TCX files to data I can work with.  It is now time to now build the data foundation for further analysis. For this phase of the project I decided take all the run data created in my text parsing code and load it into a PostgreSQL database.  I work with SQL scripts all the time, but professionally I work in a Microsoft environment, so it is nice to branch out and work with PostgreSQL.

The biggest PostgreSQL challenge I ran into was remembering to add the semicolon to the end of all my queries!  Otherwise, any other difference in syntax/code editors/software between Transact SQL and PostgreSQL/PostGIS were easy to learn.

The first step was to design and build a simple database to store the data.  The database is built around three tables:

  • rbr_AllRuns_2016_Points – table where I upload the points and attribute data built in the TCXtoText.py script. The  table will also store the geometry point objects and the geometry line segments between each point for a given run. To tie the individual runs to other tables I added a runid field, which was added to the input dataset in the TCXtoText.py script.
  • rbr_AllRuns_2016_ID – table where each run’s characteristics are stored, including date, runid, descriptive info about the run, total time, average pace, fastest mile in minutes,  and the fastest mile (which mile in the run).
  • rbr_AllRuns_2016_MileSplits – table that stores the runid, mile marker, and time (in minutes) I completed that specific mile.  The time data was calculated in the TCXtoText.py script and imported into the rbr_AllRuns_2016_Points table.

There are also several “temp” tables that are built to support the three main tables.  These tables were built to clean values, generate the line geometries, add the mile markers, and mile splits.  I call these “temp” tables, but I wrote them all to the database.  There only “temp” in the sense that I won’t use them (probably) for analysis.  Everything I need from them is in the main tables.

The code for to generate the required tables and populate the necessary data can be found on my github account – rBr_ProcessNikeGPSData.sql 

If you check my code on github, my table naming isn’t very consistent for these temp tables.  I will clean it up.

Early Analysis Results

I have started thinking about the analysis I want to start to build and I have played with a few ideas.  Some early queries have included classifying my runs by distance and speed, finding my fastest and slowest miles, and comparing mile splits across runs and distances.

  • To this point in 2016, my GPS has logged 219 runs and 442,174 GPS points, which account for 117 hours, 39 minutes and 14 seconds of running and 1126.78 miles. My marathon, for whatever reason, won’t export out of Nike+.
  • The 442,174 GPS points sometimes create interesting patterns.  For example, when zoomed into a street where I run every day, I get an interesting stripping of points. Without seeing the individual runs, it is tough to see if this is just noise or a real pattern. I know my GPS takes a reading every 0.97 seconds. Since I run the same routes so much, I believe the pattern gets amplified, creating the striping. It’s neat to see.

pointsample

  • Not tracked in my data – the three pairs of running shoes I have gone through this year. Adidas Supernova Sequence.
  • I built a Run Type field in my ID table, where I pseudo categorize my runs by distance and speed.  This categorization needs work, but so far I have more Awesome runs and Ehh runs. I’ll post the details on how I categorize these runs later.

Run Type

Total Runs

OK 83
Intervals 59
Awesome 34
Great 33
Ehh 10
  • My fastest mile that I ran that wasn’t in a race or during intervals was on April 13 at a 5:48 pace, cruising down the bike path in Somerville.

fastestmile

  • My slowest mile was on July 31 at an 8:08 pace, but I didn’t map that!

What’s Next

Now that I have my data in a format that I can quickly query the deeper analysis will now follow.  There are some data cleaning steps I need to add in during the loading process (like how to deal with pauses and breaks in the GPS data) and refining how I measure distance.

Feel free to check out the code on github and shoot me any suggestions/comments/ideas through twitter @GISDoctor.