Federal, state, and local governments have started making more of their data publicly available. This is part of the so-called open data movement. Application Programming Interfaces (APIs), which many institutions have begun supporting, allow researchers and other individuals interested in working with and analyzing data to access information, often large amounts of it, programmatically.
DataSF is the open data portal for the city (and county) of San Francisco.
In this notebook, I will access incident reports from the San Francisco Police Department (SFPD), focusing on traffic-related entries. I will show how to plot the data using the Seaborn library. I will also plot the traffic incidents for the past three months on a Leaflet map.
Using the requests module, we access the API with the SFPD data. We specify, using the limit parameter, the maximum number of records that should be returned. It is important to read the API documentation for information on limits, appropriate uses, and the various available options. The documentation for this specific data source can be found here. It is important to note that this data is updated daily, so your results will vary.
The data will be returned in a format known as JSON, which stands for JavaScript Object Notation, and will be stored in an object called response. Depending on your connection, this may take a few minutes.
import requests, json, pandas as pd
url = 'https://data.sfgov.org/resource/tmnf-yvry.json?$limit=50000'
response = requests.get(url)
Now that the raw data is in the response variable, we can load it into a Pandas DataFrame. Here, we use json.loads() to get it in the proper format.
data = json.loads(response.text)
df = pd.DataFrame(data)
df
This DataFrame includes information on the location, time, type of incident, and the police department district that responded.
In this section, we want to begin processing the data so that it's in a standardized format that we can use.
First, we transform the values in the included date variable to a Python date. While the documentation is not clear on the type of data that is returned (this might also depend on the access method), it looks like the values represent the number of seconds since the epoch, which is "the point where time starts." On Unix-based systems, such as the one I'm currently on, the epoch is January 1, 1970. For more information on times in Python, see: Time access and conversions.
The way to determine the epoch on your system is to use the .gmtime() method in the time module. For an example, see below.
import time
time.gmtime(0)
Use the .fromtimestamp() method to create a date variable from the number of seconds since the epoch.
import datetime
df['real_date'] = df['date'].map(lambda x: datetime.date.fromtimestamp(x))
Because the values in the time variable are in a string format with the hours and seconds separated by a colon (:), we split on that value and create a time object.
df['real_time'] = df['time'].map(lambda x: datetime.time(int(x.split(':')[0]), int(x.split(':')[1])))
Here, we convert data to lowercase, where possible. We use the try and except approach because not all columns are of type string and, thus, cannot be converted to lowercase.
for col in df.columns:
try:
df[col] = df[col].map(lambda x: x.lower())
except:
pass
In this section, we create a new DataFrame for all traffic incidents. If the description column contains the word "traffic," we use it. Because the values in the location column are of type dict and because its values are already represented in separate variables in the DataFrame, we drop the column. By default, each row keeps its original index. This results in non-sequential numbering of the index values. So, we want to reset them.
traffic = df[df['descript'].str.contains("traffic")]
del traffic['location']
traffic = traffic.drop_duplicates().reset_index(drop=True)
traffic
Now that the data is clean, we want to begin plotting. We'll create the following graphs:
To make plots using matplotlib, you must first enable IPython's matplotlib mode. To do this, run the %matplotlib magic command to enable plotting in the current notebook.
For more information, see: Plotting with Matplotlib.
%matplotlib inline
Here, we want to count up the number of incidents for each day of the week. We do this using the .groupby() method.
# Group by
dayofweek = traffic.groupby('dayofweek')['dayofweek'].count()
# Put into a DataFrame
dow_df = pd.DataFrame(dayofweek)
# Rename column
dow_df.columns = ['count']
# Create a new column based on the day of the week
dow_df['dayofweek'] = dow_df.index
# Capitalize the first letter of the weekday name
for entry in dow_df['dayofweek']:
dow_df['dayofweek'] = dow_df['dayofweek'].str.title()
# Create a dictionary with the order of the days
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
mapping = {day: i for i, day in enumerate(weekdays)}
key = dow_df['dayofweek'].map(mapping)
# Sort the DataFrame
dow_df = dow_df.iloc[key.argsort()]
# Drop the dayofweek index; reset the index
dow_df = dow_df.reset_index(drop=True)
dow_df['order'] = dow_df.index
dow_plot = dow_df.groupby(['order', 'dayofweek'])['count'].sum()
dow_df
import seaborn as sns
rc={"figure.figsize": (10, 8), 'axes.labelsize': 18, 'font.size': 18, 'axes.titlesize': 18, 'xtick.labelsize': 12, 'ytick.labelsize': 12}
sns.set(rc=rc)
sns.barplot(x='dayofweek', y='count', data=dow_df, palette="Paired", x_order=weekdays)
There don't seem to be drastic differences in the number of incidents on each day of the week. I excepted Friday and Saturday to be the highest, but this sample shows Tuesday and Wednesday having just as many, if not more, incidents. My assumption of the number of incidents during the weekend may be true for nighttime hours. This barplot does not control for that.
Next, we want to see if there are any patterns in the number of incidents across days in the sample.
# Group by
bydate = traffic.groupby('real_date')['real_date'].count()
# Put into a DataFrame
bydate_df = pd.DataFrame(bydate)
# Rename column
bydate_df.columns = ['count']
# Create a new column based on the day of the week
bydate_df['date'] = bydate_df.index
# Reset the index
bydate_df = bydate_df.reset_index(drop=True)
rc={"figure.figsize": (10, 8), 'axes.labelsize': 12, 'font.size': 12, 'legend.fontsize': 12.0, 'axes.titlesize': 12, 'xtick.labelsize': 0}
sns.set(rc=rc)
bydate_plot = sns.barplot(x='date', y='count', data=bydate_df, palette="Blues")
There are a few cases of exceptionally low traffic incidents, especially more recently, but there doesn't seem to be a discernible overall pattern. This plot seems to indicate greater variance in the number of traffic incidents the closer to the current date. However, it's important to take note of the scale. The highest recorded traffic indicents during the past three months are around 19 cases; the average seems to be close to 5 or 6 per day.
Finally, we want to look at the number of incidents in each district.
# Group by
pddistrict = traffic.groupby('pddistrict')['pddistrict'].count()
# Put into a DataFrame
pd_df = pd.DataFrame(pddistrict)
# Rename column
pd_df.columns = ['count']
# Create a new column based on the day of the week
pd_df['pddistrict'] = pd_df.index
# Reset the index
pd_df = pd_df.reset_index(drop=True)
# Capitalize the first letter of the weekday name
for entry in pd_df['pddistrict']:
pd_df['pddistrict'] = pd_df['pddistrict'].str.title()
import seaborn as sns
rc={"figure.figsize": (10, 8), 'axes.labelsize': 18, 'font.size': 18, 'axes.titlesize': 18, 'xtick.labelsize': 12, 'ytick.labelsize': 12}
sns.set(rc=rc)
sns.barplot(x='pddistrict', y='count', data=pd_df, palette="Paired")
While it's informative to know about where traffic incidents are occurring, it would be useful to have context. For example, does the Richmond area have very few traffic incidents because there are few drivers there or because it's a safe neighborhood? Geographic scale can also skew these results. If the Mission district is much larger than the other districts, the higher number of incidents may simply be due to that fact.
Now, we use the data to create a map of traffic incidents in San Francisco.
import json
geo_data = {
'type': 'FeatureCollection',
'features': []
}
for i in traffic.index:
if traffic['y'][i]:
# Each tweet is a GeoJSON "feature"
feature = {
'type': 'Feature',
'geometry': {
"type": "Point",
"coordinates": [float(traffic['x'][i]), float(traffic['y'][i])]
},
# A feature's "properties" become attribute columns in GIS
'properties': {
'day': traffic['dayofweek'][i],
'district': traffic['pddistrict'][i],
'resolution': traffic['resolution'][i],
'address': traffic['address'][i]
}
}
# Add the feature into the GeoJSON wrapper
geo_data['features'].append(feature)
with open('sftraffic.geojson', 'wb') as f:
json.dump(geo_data, f, indent=2)
print len(geo_data['features']), 'geotagged entries saved to file'