Stops to Tracts: Transforming Data

As anyone who works regularly with large datasets will tell you, a lot of the work and time involved in analysis is not just the act of analyzing itself, but rather evaluating, cleaning, organizing, and often transforming the data needed to conduct the analysis. The phrase “garbage in, garbage out” is a cliché, but a true one: it’s impossible to identify what the data is telling us and make good data-driven decisions if we have bad data.

In the world of transit data, we often think spatially, and we have a lot of point-level data (e.g., bus stops or subway stations) and line data (e.g., bus routes or train lines). If we want to compare these data to area-level data (such as data from the US Census), we need a way to transform these data so that the comparisons are valid. This could help us estimate, for example, how many people access stations from different areas, or vice versa—where people exiting a station or bus could be expected to go. This could also help transform data about stops or routes to compare with areas, helping us estimate for example the “weighted average reliability” of bus service in an area which has multiple routes, rather than having to look at each route individually.

We could simply sum up all the point data that is within the area boundary, but since transit routes and stops often follow major roads or historical rail rights-of-way (which themselves determine neighborhood boundaries), many transit routes and stations are near or directly on the boundaries of census areas. This is illustrated in the map below.

Map showing many transit stops along roads that also separate census tracts.

As you can see, many stops and stations are on or near a border of census tracts, so just selecting the points that are inside a tract could give you some misleading data. For Symphony station above, the GTFS point location assigned to the station is barely within tract 104.05, but as we know, the people entering and exiting the station would come from or go to any of the nearby tracts—including tract 104.03, which doesn’t border the station. In this case, we should really consider the “Symphony area” to be all the tracts in the vicinity of Symphony station.

A second option would be to create a buffer around the boundaries and allocate the data from any points that fall within this buffer equally between the two (or more) areas. This would probably provide a more accurate estimate for stations directly on the border of tracts like Symphony, but it may not help us evaluate stations such as Mass Ave, which, as you can see, is a bit away from the border of the tract. Also, the GTFS point is located in the center of the platform, not at the entrance to the station. This option also would not account for people who might travel from non-neighboring tracts to a station, such as those who might access either station from tract 104.03.

We thought we could do better. The rest of this post describes the methodology we developed and a way we were able to test its accuracy with real data from the MBTA Systemwide Passenger Survey.

The Weighted Allocation Method

In developing this method, we wanted to approximate the real origin and destination locations for people accessing a stop. These origins and destinations could be many places, but we thought the best nationally-available dataset would be two products from the US Census: the ACS estimates of total population, and the LEHD Workforce Area Characteristics, which contains estimates of the number of payroll jobs in an area. Both contain data at the block level, though we used the blockgroup as the base unit of analysis to account for potential errors in smaller geographies.

Our full methodology is more complicated, but our steps are summarized in the following:

  1. Build the dataset in GIS software; ensuring that the population and total jobs data is joined to the census geographies so you have a total number of residents and jobs in each census blockgroup.
  2. Using GIS, find the weighted centroid of each census tract, weighted by the sum of jobs and population in each blockgroup. Essentially, the centroid will be the “center of gravity” that is nearest the majority of jobs and population in the tract. We used the Mean Center tool in ArcMap to find these centroids.
  3. Using the Generate Near Table tool in ArcMap (or a similar tool in other software), generate a table with the Euclidian distances from each stop point to each of the weighted centroids (selecting a generous maximum distance of 1200 meters so as to keep the resulting table manageable).

At this point you should have: a table with each stop/tract pair within a (large) distance, the distance between each centroid and the stop and the rank from closest to furthest of the considered centroids for each stop.

To calculate the allocation of the stop-level data to the tracts, the following formula was used:

Math formula: Sum of (T minus dn)

Where:

T = Walking distance threshold (the maximum walking distance you expect people to travel to access the transit stop)

d = Distance from each stop point to each weighted centroid of the tract

Here is an example with the data for Lechmere station:

Stop—Tract Pair NumberStationCensusDistance from Stop to CentroidRankDistance Below Threshold (m)Sum of All Distances for This Station (m)Percentage of Data Allocated to Tract
1Lechmere250173521022801525108548%
2Lechmere250173521015002305108528%
3Lechmere250173522006643140108513%
4Lechmere250173523006894116108511%
5Lechmere2502504040110005010850%
6Lechmere2501735150010456010850%

For this station, the allocation for stop-tract pair 1 would be 525/1085, or .48, meaning that 48% of the ridership at Lechmere would be expected from this method to have originated in tract 25017352102, and vice versa for the destination side. So, for every 100 passengers who walk to Lechmere, our methodology predicts that 48 of them came from this tract. For stop-tract pair 5, the centroid is further from the stop than the threshold we set of 804 m or 0.5 miles, so it is ignored in the sum.

How Does It Fare?

We validated each of the above methodologies with real-life data from the MBTA systemwide survey. The survey asked passengers what mode they used to access their transit trip, and also asked them where they began that segment of the journey (usually their home or work location). We created datasets for each of the methods and validated the 53 stations for which we had sufficient origin data with the predicted origin locations that the method produced. We then calculated the mean squared error of each method:

MethodMSE
Stop Contained in Tract5.60
50 Meter Buffer Even Allocation4.34
Distance to Population & Employment Weighted Mean3.11

Using this method, we were able to approximate fairly accurately where passengers boarding at a certain location actually originated from. We have used this method to transform stop-level data to area-level data, which was especially helpful for our current project examining ridership changes as it allowed us to better compare Census data with data from our datasets, which are usually point or line-based.