Mapping Riparian Old-Growth Forest
Using Random Forest Algorithmic Analysis
Minnesota State University Moorhead’s Regional Science Center is undertaking a prairie restoration project which would restore 143 acres to native prairie along with 17 acres of riparian terrace forest. The goal is to restore the landscape to what it would have been before European settlers arrived in the area. One of the challenges involved with this project is the difficulty of knowing where exactly the pre-settlement riparian forest would have stood. Few old-growth trees have survived to present day; fortunately for our purposes here, we are provided with a dataset put together as part of the project containing point locations where trees either currently stand or are confirmed to have stood in the past, and point locations where confidence is high that old-growth trees did not become established.
Due to the climatic precipitation patterns across North America and more specifically the northern Great Plains region, widespread forest does not naturally establish in western Minnesota. The drier conditions here rather favor the establishment of deep-rooted tallgrass prairie which is better suited to surviving and rebounding from the fires that periodically occur. The old-growth forest which would become established in western Minnesota without the influence of human activity is confined to the vicinity of rivers and lakes, and is dependent on several spatial variables relating to the physical geography of the area. We will examine these variables in more detail in the next section. The specific area of study is comprised of around 800 acres surrounding a stretch of the Buffalo River in the Regional Science Center, about 15 miles east of Moorhead, MN.
Spatial variables
The goal of this study is to use the 'Random Forest' machine learning algorithm to generate a prediction for where along the Buffalo River trees would have survived to "old-growth" (150 years or more) were it not for human intervention by European settlers. To do this requires a dataset of rasters representing a number of geospatial variables which could have an influence on the establishment of forest. These variables are:
SOIL_TYPE – The type of soil present across the landscape certainly could have an impact on whether oak trees would survive long-term. Sandier soils containing less organic matter and nutrients would likely not support the life of a tree for 150 years or more.
SOLAR – Areas receiving more annual solar radiation relative to the surrounding area could be expected to be more supportive of tree growth, with amounts of sun exposure having an influence on soil temperature and annual time spent in freeze or thaw.
ELEV – Elevation could be a factor even though the landscape here is mostly flat; areas of higher elevation would be exposed to stronger winds, while areas of lower elevation would be subjected to more saturated soil in rainstorms.
SLOPE_ASP - The direction a slope is facing would be a factor; trees on slopes facing the prevailing winds would be subjected to greater weathering.
SLOPE_ANG - The angle of the slope on which a tree grows could influence the tree’s potential for long-term survival; where the slope is steeper the trees would be more susceptible to being downed in strong rainstorms.
DIST_RIV – The distance from the river may be the variable of highest impact on trees’ survival. Rivers offer some protection from the spread of wildfires, are sources of water for root systems during dry spells, and are carriers of nutrients which are absorbed through the trees’ root systems.
DIRC_RIV – Direction from the river could also play a role in the trees’ survival. Wildfires are spread by wind; trees in areas exposed most directly in the direction of the prevailing winds and not buffered by the river would be at a disadvantage.
The Random Forest Algorithm
Random Forest is a machine learning algorithm which is based on decision trees. In the process of running data through a decision tree, the algorithm takes a portion of the dataset and uses it as so-called “training data,” whereby it runs the selected data through a series of connected nodes; the algorithm is trained in the process based on the result of the training data being run through the series of decision tree nodes. Rather than just having one decision tree, however, Random Forest generates multiple different decision trees from the same set of training data. One way it does this is by “bootstrapping,” or the process of sampling randomly from the dataset and replacing the sampled value back to the dataset as you go. Additionally, the Random Forest algorithm introduces randomness through what is called the ‘random subspace method,’ where decision nodes are created, and a small number of features are picked randomly from the training dataset and run through the decision nodes, and the resulting output of the series of nodes is used as a template by which to crunch the remainder of the dataset – in this case our ‘test’ data.
Random Subspace Method
This diagram details the steps involved in the random subspace method. The first check that the algorithm makes is on the purity of the data, or in other words to check if the data samples belong to the same class. According to what is referred to as the Gini Index, if two items randomly sampled belong to the same class of data 100% of the time then the data population is ‘pure,’ corresponding to a Gini coefficient equal to 1. As shown in the diagram below, if the data is determined to be pure then classification can proceed with no further steps; if it is determined to be not pure then a series of further steps must be followed.
The next step is to determine the level of entropy (or “disorder”) within the data; what we are going for here is to determine where the lowest levels of entropy in the data are. The higher the entropy, the lower the level of purity in the data, so we instruct the algorithm to determine where the lowest entropy is and place a split in those locations.
With the locations of lowest entropy determined, a function for determining the location of the best split is provided along with a function to physically place a split in the data; these splits are what serve as nodes in our decision trees. A function for the decision tree algorithm itself is then defined, along with a function to return the predictions of the decision tree. With these preparatory functions in place, we are now ready to implement a random forest prediction of our data.
Random Forest in R
When we run the Random Forest algorithm in R, we are provided a graph showing relative importance of the aforementioned spatial variables. ‘%IncMSE’ is the calculated percent increase in mean square error. Mean squared error (MSE) is the average squared difference between estimated and actual values; %IncMSE measures the increase in MSE as a result of the transformation placed on each variable by the randomForest algorithm. A greater increase in mean squared error corresponds to a greater level of statistical importance for each variable; hence, we see here that the variables of most importance to the establishment of old-growth forest here are slope angle, distance from river, and soil type. I expected distance from river and soil type to be highly influential on old-growth forest here; I was surprised at the level of importance placed on slope angle. ‘IncNodePurity’ represents the increase in node purity (more detail on that later) for each variable as a result of the transformation; but again, the higher the increase here, the greater the importance of the variable.
In both graphs, three variables show significantly higher importance than the rest: slope angle, soil type, and distance from the river.
Correlograms
Correlograms are graphical representations of correlations between variables. They run from 1 for perfect correlation to -1 for perfect anticorrelation, with 0 in the middle representing neutrality. In a correlogram, the dependent variables are laid out along both the x- and y-axes with box colors representing blue for positive correlation and red for negative. Darker color is stronger correlation, and neutral is white. As is illustrated in the correlograms below, each variable obviously has perfect correlation with itself; what’s more interesting are the variables that show stronger positive or negative correlation to other variables. When we run the script in R that executes the correlogram, we see that the strongest positive correlation is between ‘distance to river’ and ‘elevation,’ and that the strongest negative correlation is between ‘slope angle’ and ‘solar radiation.’
The R software contains the capability to create a new raster based on the input spatial variables and the result of the Random Forest algorithm. The output raster depicts the probability of establishment of old-growth forest within the extent of our study area which can then be loaded and symbolized in GIS software such as ArcMap.
The map shown here depicts low probability of tree establishment in purple, and high probability in green. It is interesting to see if we examine the map closely the artifacts left by the algorithm such as the yellowish circles surrounding the forested area, or the "teeth-like" shapes on the lower part of the map. Note that the green double line is a highway running through the study area and should be disregarded.
Additional spatial analysis
As we inspect the tree points' locations, we see that they are clustered mostly within a certain visual range. Pulling up the measurement tool reveals that the range appears to be mostly limited to about 200 meters. In order to create a buffer around the river, I first digitized it into a new polyline shapefile; then using the Buffer tool created a buffer of 200 meters from the river to see which points fall within that radius.
To quantify the percentage of points (both tree and no-tree) within the 200m buffer, I used the Intersect tool to merge the river buffer layer with the points layer. Opening the attribute tables of the intersected layer and the original points layer, I compared the number of features and found that 146 of the 162 total tree points (90%) fell within the 200 meter buffer, while only 73 of the 236 'no tree' points (31%) were within the buffer area.
Elevation Contours
This image represents the elevation data layer, with topographical contour lines generated with the 'Contour with Barriers' spatial analysis tool. The tree point locations are again represented with green triangles, while the 'no tree' points are in black.
During the course of the project, we found that one of the variables most highly correlated to the establishment of riparian forest was the variable for slope angle. What's interesting is that the correlation seems to be the reverse of what I had expected. As we can see with our topographical contour lines in place, it becomes obvious that the 'Tree' point locations are clustered largely around the areas of steepest elevation change, while the 'No tree' point locations mostly occur in flatter areas. My assumption would have been that steeper slopes would cause greater difficulty in establishing long-lived trees due to increased erosion and risk of mudslides during rainstorms, but evidently that is not the case.
Directional Distribution
Depicted here are ellipses representing the the directional distribution of the tree point data. The distribution of all of the points as a whole (shown in purple) has significant overlap with the distribution of both the tree (in green) and no-tree (red) points. The tree points occupy a smaller area and are clustered around the riparian forest buffer at the southern edge of this particular stretch of the river; The 'no tree' points take up an area slightly larger and slightly further south.
Elevation Hillshade
The Hillshade tool provides a simplified way of viewing the topography of the study area without the distraction of the various types of land cover. The image here was generated by simply inputting the elevation raster data and specifying the output location; options for relief viewing angles were kept as the defaults. We previously established with the elevation contours that many of the tree point locations are where the elevation changes most steeply; the Hillshade model makes it easier to visually determine which of the point locations are on the relative highlands and which are within the river valley. Displaying the tree point locations on the Hillshade makes it even more obvious how the majority of them are on the steep-sloping edges of the river valley.