Digging into Ford GoBike Data with Tableau
--
Did you know that Motivate, the operator of Ford GoBike, our first regional bikeshare network, publishes ride data for free on their website?
I looked at their September ride data, which included rider birthdate, station locations and names, trip durations, and more. I had a couple questions I wanted to ask:
- Was GoBike mainly being used by tourists or residents? I’d seen some flack on NextDoor by locals about how GoBikes were catered towards tourists at the expense of the neighborhood landscape that residents had to look at every day. However, I know a lot of urbanists are fiercely pro-bikeshare as a last mile solution.
- What were the predominant age groups using GoBike? The only rider data the data set provides is birthdate and gender; we can’t see into socioeconomic status, location, etc. However, I was curious to know if we could see if bikes were overwhelmingly for broke students hopping around campus, or retirees cruising along the Embarcadero.
Exporing Rider Age Data
The question of age was a lot more straightforward to figure out, so I started there. I plotted median age (calculated from self-reported birthdate) and number of trips on a map of SF.
From here we can see, unsurprisingly, that the majority of trips are along the Embarcadero, the Market Street commercial corridor, and around the Caltrain 4th and King transit hub. If GoBike really was just for tourists, I’m surprised that, say, Alamo Square is so underrepresented.
483 trips total at 3 adjacent stations for September is roughly 16 trips a day starting from that area — and that’s going downhill in most directions from there. Considering how nice San Francisco September weather is, not to mention the view from Alamo Square, this was surprising. (But I digress.)
Adding a Temporal Dimension
I wanted to dig deeper on the tourists vs. locals thing, so I decided to slice it up by the rounded hour of day for the beginning of the trip. This also gave me an excuse to play with a parameter filter, which is a fun Tableau trick.
Nice! This really allowed me to see deeper on a number of levels.
First of all, as we can see above, there’s major ‘humps’ in number of rides at commute hours, far bigger than I anticipated. I broke this down in a simpler way, which made it look even more dramatic:
This, for me, felt like the strongest evidence that GoBike was being used more by commuters than by tourists.
From here, I created a set I called “Commute Hours” to try to see how many total trips were within “commute hours” vs. outside of them. A very inexact measure, but I was curious.
It seemed clear that there were far more trips taking place in commute hours (7–10AM and 4–7PM).In a sort of null hypothesis world where the same # of trips were being taken every hour from 7am to 11pm, “commute hours” would represent 38% of trips taken in that universe, rather than the 60+% of trips taken that we see.
N = ?
At this point, I took a detour. As soon as I plotted a map of median rider age by trip hour, I naturally wanted to see who was doing what in the middle of the night — mostly young people, I suspected. I glanced at the Valencia corridor, around the 1AM hour, and sure enough:
However, when I took a step back and looked around, I realized this may have just been confirmation bias. I started to notice that the median age was way older at some stations in the middle of the night. Looking at the legend, I realized something obvious: N was way too small to make any meaningful observations. Many stations had 0–5 trips total within a given hour (say, the 3–4AM hour) for the entire month. This data wasn’t particularly meaningful, except to tell me that no, 18–22 year olds aren’t really using GoBike in the middle of the night, ever.
This data was only for September 2018. I should have aggregated the 2017 data in total instead from the beginning since there isn’t really enough for us to make many ultra-meaningful observations here.
So I went back to the Ford GoBike website, grabbed the 2017 data, and created a union to pull it into my data set. Now I could (well, somewhat messily) see all of the 2017 ride data, as well as the September 2018 ride data. This would in some respects ruin chances of us being able to get ultra-clean data about, say, comparing stations (since different stations open at different times) or months (since we were missing 1/18–8/18). But for my purposes — aggregating how busy popular “core” stations were, and checking out aggregate rider data — this only made things more useful.
Returning to the map armed with more data, we can see that rides that start at 5PM overwhelmingly cluster in (what I interpret to be) employment areas.
You’re probably not just getting off work if you’re hanging out at Alcatraz Landing. But on this map we can definitely see clusters around major employment areas — and if the two Caltrain bike stations were visible as one in this map, it would be by far the largest circle on the map.
I stared at the median age data again, trying to see what I could make of it, but I can’t feel confident in inferring much of anything. Some of the most touristy spots (City Hall, Alcatraz Landing) did seem to correlate to some of the oldest median ages of riders (38–39 years) when I looked at 5PM departures. Removing the departure time filter and looking at all, trips, the “oldest” stations all clustered along the Embarcadero (except for two along the end of the J line, not pictured.)
The ‘youngest’ stations, on the other hand, were all in unremarkable residential areas. By far the most popular, Webster and O’Farrell, is an area I’m not familiar with, but based on the local attractions — an AMC theater, YMCA, Safeway, Kabuki Springs, and lots of shopping — not a tourist mecca.
I binned member ages into four-year bins. Tableau seems to label them by the minimum number, so bin “16” includes riders age 16–19, bin “20” is riders 20–23, etc.
What have we learned?
My conclusions:
- Station activity appears to be disproportionately centered around employment centers (to some extent) and during commute hours (to a great extent). This would suggest that GoBike, while popular with tourists, definitely is getting plenty of use from commuters as well.
- There may be a weak correlation between rider age and resident status — perhaps younger riders are more likely to be SF residents commuting and going shopping, while older riders are more likely to be perusing the Embarcadero.
A couple caveats:
- Within the data, birthdates are self-reported and unverifiable, from what I can tell. There were a good number of null values in the data, but over 10% were not self-reported. I’d hypothesize these folks skew young and/or tech-savvy enough to realize they probably don’t have to enter that data, and perhaps are more likely to be tourists. I also got excited when I saw someone’s birthdate listed as 1888 (“Cool!”) until Patrick pointed out to me that that would mean that person was 130 years old.
When you include NULL birthdates, the histogram looks like this:
Caveat #2 — GoBike’s own signup friction probably discourages a lot of tourists who might otherwise use the service. They push the monthly membership very visibly on their website, and it’s not always clear that other payment options are available. I didn’t even take my first trip until last month, since I didn’t realize that paying $2 for a single ride was even an option.
Caveat 3 — A lot of ride data could be distorted by the boundaries of the GoBike service area. For example, when you see the shape of the ride frequency along Market Street, it’s good to remember that there are zero bikes for long stretches north of Market downtown. I’m not sure whether this would mean more trips along the edges (e.g. going on errands along the perimeter and returning the bike to the same station) or fewer (making fewer trips from the “edge” going “outwards”, since there’s nowhere to drop the bike off outside the service area.) I’m more inclined towards the latter.
There are more caveats and limitations to this data set than I’m identifying here, but those are some things that come to mind.
Potential fun future things to do:
— Most “male”, “female”, or “decline to respond” stations.
— Dig deeper on ways San Francisco, East Bay, and San Jose may be unique with regards to times of day, gender, age, etc. The data extends across all three regions, but for this project I filtered only to San Francisco.
— Hopefully more slicing by more detail in the future. If Motivate captured and provided the riders’ billing postal code, for example, that could potentially allow us to make inferences along socioeconomic lines.
— Most popular origin stations, vs. most popular destination stations.