Feature Engineering for Car Crash Prediction
A Machine Learning Tour of Colorado
In order to make a predictive model, we need to determine what features we should consider, engineer, and how they can fit into a predictive model. For car accidents, we can roughly break these into a few different categories
- Human Factors - anything related to humankind's influence on car accidents, a very long list.
- Physical Factors - properties intrinsic to a point on the road.
- Environmental Factors - factors that change the crash risk in certain circumstances
Remember, a feature in machine learning terms is just a number or list of numbers. In the end, we want to provide as many possible features that could help inform a predictive model. During the modeling phase, we'll work with these features and whittle them down into what creates the most useful model. Let's motivate some of these features in the following tour.
For many roads, a state or local DOT will record traffic in the form of annual average daily traffic (AADT) on major roads. Other times, higher resolution traffic counts may be available as well. In addition to these, population density can be used to describe population centers and add value to roads that maybe don't have a measured AADT or other traffic counts.
There are other variables that we may want to use to enrich our model, such as this layer, which shows where a large number of people bike to work. This may help a model identify zones where bicycle related incidents are more likely.
There are, of course, numerous possible data sources to consider for a model like this, can you think of more?
Once we've enriched our roads with relevant information, we can begin to look more at other types of features. We'll focus on geometric and contextual features.
If we look at a road out of context, there are some things we can discover from the feature's attribute table:
- Speed Limit
- Average Traffic Counts
- Number of Lanes
- Shoulders
- Surface Type
- ...
- The list goes on. Curvature is the first variable that we'll mention that we would want to calculate in the feature engineering process, and certainly an impactful feature. This road appears very curvy, but could you tell the difference between this road and...
...this road? This road has nearly identical speed limit, lane count, and many other properties. It's hard to say which is more hazardous. But when you include it's geographic context...
The story gets a whole lot more interesting. Now we can analyze attributes such as road slope and elevation. Let's return to that first road again to compare.
Clearly there is a huge difference between these two roads. This road has far fewer crashes on it, and it's no surprise.
But a machine learning model wouldn't be able to tell the difference if you only supplied the properties of the road itself, we need to create new features to create a model with true predictive power.
Both of these roads are away from cities, so using something like population density is unlikely to help much.
But can we actually measure curvature? There are quite a few ways, but the one we've chosen, mostly because of simplicity to calculate is the inverse curve radius.
Imagine placing a circle at every point on the road and adjusting it's radius so that the circle is as close as possible. We do this at every triplet of points along the polyline and we are left with the following result. Points with a small radius have high curvature.
Let's look at another example, one of my favorite locations in Colorado, the mountain town of Ouray. Clearly this is a very curvy road, but this is a small town with fairly low traffic. You miss the greater context without looking at its surrounding geography.
Not only is this a beautiful view, there is more content we can use for feature engineering. The road in the distance has steep dropoffs. This could dramatically change the way people drive on the road, and certainly raises the fatality rate in the case of a drive off. Falling rocks could provide additional road hazards.
We aren't necessarily able to account for all of the possibilities, but using a digital terrain model, we can use our imagination when adding new features.
Clearly, we're missing the greater context if we're just looking at the roads...
...or ignoring dynamic conditions, like the time of day.
But let's change to a completely different area, The I-25 corridor in Denver. Around this area, we see some of the largest count of accidents in the state.
Sometimes when looking at a map like this, it's difficult to appreciate what the scene looks like in real life.
By using 3D content, such as these building models. We can calculate additional features to model visual obstructions or challenging driving situations...
...such as driving into the setting sun...
...or potentially unseen hazards.
In many suburban environments, you can see far ahead. But in a city like Denver, you can't see as far. Using the buildings and terrain, we can model visibility along the road.
From this place on the road, we can check the amount of visibility that a driver would, taking into account terrain slope and buildings.
ArcGIS allows us to use obstructions such as buildings and terrain to estimate what points are blocked to the driver.
We can use this to calulate new features such as the mean field of view of a driver, the dominant visible direction, or the density of visual obstructions. Depending on the exact situation, these and other features can be constructed...
...To offer a new perspective to a predictive model.
This is all, of course, just scratching the surface of what we could use in a predictive model. The point to really drive home is that Geo+AI doesn't just mean one thing - it's truly a different way of applying Artificial Intelligence to spatial problems. In this story map we demonstrated how features in a GIS sense can be used to create features in a Machine Learning sense.
By engineering features, we can inject geography directly into a non-spatial machine learning model. We can take this even further by creating geographically aware machine learning models, but that's a topic for a future date.