Potholes of New York City

Pothole on 2nd Avenue

Aurora Koch-Pongsema

Project for CMP 464: Data Science at Lehman College, Spring 2016

Motivation

Drivers, bikers, and pedestrians all hate potholes, and with fluctuating temperatures, heavy traffic, and a constant battle for public resources, the City of New York has to pick and choose which potholes to fix as well as when it can fix them. What affects how quickly a pothole is repaired? Are all boroughs treated equally? Do other factors, such as collisions related to potholes or distance from governmental offices affect repair time? This project explores these questions using data sets from NYC OpenData, analyzed for statistical distribution and mapping using Bokeh, in a course at Lehman College in Spring 2016: Computer Science 464 (Special Topics): Data Science, taught by Professor Katherine St. John.

Data Sets

Primary: 311 Service Requests from 2010 to Present, downloaded April 17, 2016.

This data set contains all the reports, requests, and complaints from New Yorkers that contact 311 from any source. For this project, I filtered it to only examine potholes by filtering it for created dates in 2015, a complaint type of 'Street Condition', and a descriptor of 'Pothole.' As this data set is constantly updated with new entries and status for existing entries, some updates for pothole reports created in 2015 may be out of date. For data types in this set, I focused on the created date, closed date, status, borough, and location (latitude and longitude) of each report in this range.

Secondary: NYPD Motor Vehicle Collisions, downloaded April 17, 2016.

This data set contains the 'Details of Motor Vehicle Collisions in New York City provided by the Police Department(NYPD).' Of particular interest in this data set are the 'Contributing Factor[s]' given for each collision, some of which specify 'Pavement Defective' as a reason for the collision. The location, date, and estimated time of the accidents are also included.

Techniques and Analysis

I combined mapping with statistical analysis reduce the dimensionality of the data and look for extreme outliers, first examining repair times. I extracted the time between the creation of a pothole complaint and the closing date to see how long it took for a pothole to be officially resolved (if it was) and its report closed. In doing this, I examined the mean, standard deviation, interquartile range to see trends, and then plotted the appropriate data in both 2D plots (comparing time lengths) and plotting the latitude and longitude of those outliers on a basemap background. In this first analysis, I reduced dimensionality by only considering potholes that were of closed status, and also discarding potholes with negative or zero repair times. Here, I plotted all potholes that were successfully closed with their created report date against their closed report date. In the first, blue graph, I looked at all data with a valid location and state of 'Closed.' In the second, red graph, I threw out tuples with zero repair times.
The mean time for a pothole to be repaired once reported in 2015 was approximately 4.7 days, with a standard deviation of seven days. For each borough, the breakdown of time to be fixed is as follows:

Potholes By Borough

Time Open Manhattan The Bronx Brooklyn Queens Staten Island
>11.7 Days 1468 1414 4391 222 3074
>18.8 Days 587 522 1567 5 1434
>25.9 Days 103 64 161 4 352
All 10991 8117 20992 24319 12429
These trends are illustrated in the boxplots below. The first boxplot shows the total time between each created and closed date. The second boxplot trims the data to eliminate extreme outliers to only show data within three standard deviations of the mean, that is, potholes that were repaired within less than 25.9 days of the pothole report being created.
Location data also allows us to examine where within the five boroughs these potholes occur. The following mapping data examines the four subsets discussed above, with the first being all potholes, the second being those above the first standard deviation (11.7 Days), the bottom left being those above the second standard deviation (18.8 Days), and the bottom right being those above the third standard devation (25.9 Days). Of particular interest are the potholes in Queens. In Queens, only five potholes remained unrepaired past 18.8 days, and only four past the 25.9 day mark.
For the second data set, I examined the reported reasons for collisions and extracted the ones that cited 'Pavement Defective' and had a valid location, selecting only those collisions that had been reported by the NYPD as being due to problems with the road surface. While a number of collisions had no borough specified, the rest were distributed amongst four of the boroughs, with Staten Island being the outlier in terms of identified pothole-related crashes.

Pothole-Related Collisions By Borough

Manhattan The Bronx Brooklyn Queens Staten Island Unlabelled Total
Collisions 41 28 63 39 9 66 246

Collisions due to potholes, mapped:

Citations

Code and Inspiration

Project GitHub Repository

Grus, Joel. "Data Science from Scratch: First Principles with Python." O'Reilly Media, 30 April 2015. Original repository of code at https://github.com/joelgrus/data-science-from-scratch with additional code downloaded and used as of April/May 2016.

Data Sets

"311 Service Requests from 2010 to Present." City of New York, NYC OpenData. https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9 Web. 17 April 2016.

"NYPD Motor Vehicle Collisions." City of New York, NYC OpenData. https://nycopendata.socrata.com/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95 Web. 17 April 2016.