Information Instead of Imagery

Combining GIS, sUAS, CV, and ML to make a basic drone parking cop

Ben Stabley

December 4, 2023

Introduction

Drones, or formally sUAS, offer a many benefits to organizations through their ability to facilitate asset inspections, digital twin creation, aerial photography, and mapping. The programmability of a drone makes it well suited to perform these tasks frequently and in a repeatable fashion, allowing organizations fresh and consistent data.

But the camera on a drone offers more than just pictures of things. When combined with modern computer vision, image classification, object identification, and information extraction techniques being pioneered in the realm of machine learning and artificial intelligence, a drone can also be used as part of a system to collect useful information about things. I would posit that rarely is the imagery itself the desired product, and that it is almost always information and insight about the subject(s) of the imagery that is desired. What things are in the image? How many are there? What are their properties and relationships?

Inspiration

In the classes Modeling with Drones I and II (GEO252 and GEO254), we were privileged to two guest speakers from BNSF Railway , Nick Dryer and Vivian Young. Both talks highlighted the variety of innovative ways BNSF is deploying drones to gather information about its rail and yard assets, and to use that information to streamline business operations and improve safety. The example that struck me most was their use of a "drone in a box" to regularly fly through an intermodal facility, identify parked cargo containers, and determine if any containers were not in their assigned locations. This seemingly simple task saves BNSF untold headaches (and money) by preventing cargo from becoming misdirected in transit.

I wondered if I could do something similar with the various GIS skills I've gained from the PCC program and the wealth of machine learning resources available on the internet. It was also an excellent opportunity to finally try a programming project that uses computer vision and "AI", toolsets I have not had much motivation to use in the past.

To that end, I decided to explore the concept of information extraction through the implementation of a "drone parking cop."

Project Goal

To write a program that uses drone acquired video footage to identify vehicles in a parking lot, read their license plates, determine which parking space each vehicle is in, and decide if each vehicle has the correct parking permit for the space in which it is parked.

Setup

I decided I needed 3 main components for a working system: a drone, parking space information, and vehicle permit registration information.

The Drone

Clearly the most important component in a drone parking cop is the drone itself.

The drone used to capture each video was a DJI Mini 3 Pro. Video was shot at 4K resolution at 30 fps. In total, 5 videos were used, and each was between 8 and 90 seconds long.

During the flight, the drone records detailed logs. These logs were processed through the Airdata website and downloaded as CSV files. The drone records a record every 200ms. Each record contains the following fields.

time(millisecond), datetime(utc), latitude, longitude, height_above_takeoff(feet), 
height_above_ground_at_drone_location(feet), ground_elevation_at_drone_location(feet),
altitude_above_seaLevel(feet), height_sonar(feet), speed(mph), distance(feet),
mileage(feet), satellites, gpslevel, voltage(v), max_altitude(feet), max_ascent(feet),
max_speed(mph), max_distance(feet), xSpeed(mph), ySpeed(mph), zSpeed(mph),
compass_heading(degrees), pitch(degrees), roll(degrees), isPhoto, isVideo,
rc_elevator, rc_aileron, rc_throttle, rc_rudder, rc_elevator(percent),
rc_aileron(percent), rc_throttle(percent), rc_rudder(percent),
gimbal_heading(degrees), gimbal_pitch(degrees), gimbal_roll(degrees),
battery_percent, voltageCell1, voltageCell2, voltageCell3, voltageCell4, voltageCell5,
voltageCell6, current(A), battery_temperature(f), altitude(feet), ascent(feet),
flycStateRaw, flycState, message

The fields important for this project were time(millisecond), latitude, longitude, altitude_above_seaLevel(feet), compass_heading(degrees), isVideo, and gimbal_pitch(degrees). This information allowed me to calculate the position of the drone's camera target.

The Parking Lot and Spaces

The second vital component is information about the parking lot and the spaces within the parking lot.

The drone's log provides only the drone's altitude above mean sea level (MSL). However, to calculate the position of the camera's target, the drone's height above ground level (AGL) is required.

A digital terrain model (DTM) of the parking lot provides the ground's MSL at the drone's location. Subtracting the drone's altitude from the ground's altitude, we get the drone's height AGL.

Each parking space within the lot must be digitized and represented by a polygon. Each parking space has attributes such as an space number or ID, but most importantly is the space's required permit kind.

The permit kinds I set include general, staff, handcap, ev, service, and motorcycle.

This image shows the spaces in Lot 7, and the drone's flight paths for videos in this lot.

Parking Lot 5 contains mostly general spaces. This image also shows the drone's flight paths for videos in this lot.

Among both lots, I digitized 432 spaces, though I only used a small number of the ones I did digitize. Each lot contains many more spaces.

Vehicle Information

Finally, there must be a database containing the vehicle registration information. The important aspect of this information is the association of a plate's number and to the permit kind held by the vehicle owner.

In the 5 videos, I recorded the information of 69 vehicles. These were the ones parked in the block of spaces the drone was targeting. I ignored vehicles parked in more distance spaces as their plates were unreadable.

Program

To process each flight video from the drone, I wrote a program in Python. It takes as input the raw video and log from the drone, as well as the supporting DTM, parking database, and vehicle database. Its output is an annotated version of the input video and a CSV file of all vehicles identified in each frame of the video.

The program's source code is available on Github .

The camera's target

Since the drone's position and orientation in space is known, including its height AGL, it is fairly straight forward to use trigonometry to calculate the position on the ground of the camera's target.

The drone's latitude and longitude (as well as all the other GIS assets) were projected to the NAD 1983 (2011) State Plane Oregon North (feet) ( EPSG:6559 ) coordinate system. It was critical to use a projected coordinate system because the trigonomic calculations would not work correctly in a geographic (non-euclidian) system. The specific choice of 6559 was due to the location being in northern Oregon, as well as the log and DTM altitude being in feet—might as well stay in one measurement system.

The complete SOHCAHTOA. The left is a side view, and the right is a top-down view.

An astute reader will have noticed that these calculations assume the ground is flat. I choose to ignore the slope of the terrain in to simplify the various calculations, and probably because the slope of the parking lot is fairly mild, I did not find that ignoring slope significantly impacted the resulting position of the camera's target.

This diagram illustrates a few flight lines and their corresponding "target lines" overlayed on the parking spaces of Lot 7.

An early test map of the drone's flight paths and target locations. "F" is for Flight, and "T" is for Target.

Extracting information from videos

Each frame of a video is put through an identical series of steps to locate and read the text of license plates. I had tried a few different methods and configurations, but had the best results with the one described below.

A python package named YOLOv8 was used to perform vehicle detection, tracking, and segmentation. YOLO is a pre-configured series of convolutional neural networks , built on the pytorch package, that can be trained to identify objects in images. Fortunately, the people who maintain YOLO offer a pretrained model that can identify a variety of object classes, including vehicles.

YOLO also offers object tracking and segmentation. Tracking allows a particular object to be identified as the same object across multiple frames of videos. This is a key component for video analysis so that continuity is maintained through the video, and in essence adds a temporal dimension to the data which can be helpful in producing higher quality results. Segmentation allows a particular object to be cleanly extracted from the video frame so it can be further processed in isolation. Using only detection gives a bounding box that can sometimes include parts of nearby vehicles, which sometimes causes an incorrect association between a vehicle and its neighbors. Segmentation avoids this problem by providing a mask which can be used to remove neighboring vehicles, as demonstrated below.

Segmentation allows a vehicle to be isolated from its surroundings.

Once a vehicle is tracked, detected, and segmented, I use YOLO again, but with a different pretrained model, to detect the license plate for the isolated vehicle. This two-step process ensured that the license plate text being read is actually for that vehicle and not one of its neighbors. An example of a final identified and cropped plate is shown below.

100% crop of a frame, illustrating an image provided to EasyOCR.

Another python package named EasyOCR allowed me to extract text from each plate image. OCR stands for optical character recognition , and is probably one of the earliest examples of computer vision and machine learning being used to extract complex data from images. Like YOLO, EasyOCR is "batteries included," and does not require the user to train their own model. One simply gives it an image and it gives back some text and a confidence estimate. Although EasyOCR was very easy to use, it did often have trouble producing correct text results. However, I believe the license plates themselves are often difficult to read (even sometimes for a human!) and have characters that look very similar—"O" and "Q" being a prominent example.

Parking validation

With the vehicle's location within the video frame and its license plate text, I could proceed to check if vehicle is parked in a space that its permit allows, which is the main objective of the drone parking cop.

I assumed that the center of the video frame coincides with the location that the drone's camera is targeting. Each video and each log is roughly synchronized, both starting at t = 0. Each frame's temporal position in the video can be calculated using t = frame_number / video_speed_fps. Each log entry's temporal position is given in the time(milliseconds) field. Thus a particular video frame can be associated with a log entry, and the drone's position in space and camera target.

After obtaining the camera target's coordinate for a particular frame, I use the parking space database to determine which space intersects with that coordinate. This is the basic GIS operation "intersect."

I can also obtain information about the vehicle, specifically the vehicle's parking permit, by looking up that information by license plate from the vehicle registration database. This is a non-GIS query, where the vehicle's information is keyed by plate number. One thing to note is that, although the information in the database reflects all vehicles in each video being parked correctly, I added a chance for each vehicle to get a random permit kind upon loading. It would be boring if all vehicles were correctly parked!

With the space information and vehicle registration information, I simply check that the space kind matches with the vehicle's permit kind. For example, that a vehicle with a "staff" permit is in a "staff" space, or that one with a "handicap" permit is in a "handicap" space.

This parking validation was performed on vehicles when they were roughly in the center of the video frame. Specifically, this is when the vehicle's bounding box contained the center point of the frame.

Annotating the video

With all the parts brought together, I simply had to draw the information about a frame onto the actual image of the frame. This annotated frame is shown to the program user as each frame is processed as a kind of live preview. Additionally, each annotated frame is written to a newly created video file that can be viewed using any video software such as Windows Media Player or VLC.

I also wrote a CSV that contains some information about each vehicle in each frame, most importantly if the license plate for that vehicle was found in the vehicle database or not. I used this do some simple quantitative analysis about how the program performed for each video.

The videos and analysis results are provided below.

Results

Each video and some simple performance statistics are provided below.

Vehicle bounding box color has the following meaning:

White: Detected. No permit validation being done at that time.
Blue: Unable to validate. Vehicle plate could not be found in the database.
Green: Valid. Vehicle plate found, and permit is valid for the parking space.
Red: Invalid. Vehicle plate found, and permit is invalid for the parking space. (Issue ticket!)

Furthermore, at the top of each bounding box the vehicle's tracking ID, plate text, and text confidence value are shown; the space ID number and required permit kind are also shown when the vehicles is in the center of the video. At the bottom is shown the vehicle's color, manufacturer, and permit kind.

Flight 1 - Video 665

Video 665 annotated.

Flight 2 - Video 666

Video 666 annotated.

Flight 3 - Video 667

Video 667 annotated.

Flight 4 - Video 668

Video 668 annotated.

Flight 5 - Video 669

Video 669 annotated.

Conclusion

Overall, I am fairly pleased with the results of this project.

The aspects which presented the most difficulty or poorest performance were in OCR (reading each plate's text) and in precisely identifying which parking space the drone was targeting at any particular time.

I believe the EasyOCR package, especially for free software, is generally quite good at extracting text. However, it is made for general text extraction from a variety of sources such as signs or printed text, and even works in a handful of languages. License plates, though seemingly simple and regular, pose a number of challenges to OCR. First is the variety of decorative plates available. In Oregon alone there are 27 choices , some choices also have additional variations (eg veterans plates). Then multiply that by 50 states in the United States. Second, the typeface used on the plates can often make similar characters difficult to distinguish, even for a human. In this project's videos, I noticed that "O", "0" (zero), "Q", and "D"; "H", "W", and "M"; and "3", "8", and "B" were some of the most commonly misidentified groups of characters. Because of the variety of plate numbering schemes among different states, it was impossible to use traditional heuristics (like a regular expression ) to filter out most incorrectly read plate text. Third, the video itself created problems in exposure and perspective distortion. The retro-reflective quality of license plates caused many plates to exhibit overexposure and fringing effects, even though the video overall was exposed correctly. I believe this is why video 668 had the highest OCR accuracy—the plates were shaded by their vehicles, whereas in every other video, the sun was directly on and reflecting from the plates. The Mini 3 Pro has a somewhat wide angle lens at 82⁰ FOV, and at the left and right edges of the video plates exhibit very noticeable skew distortion, slanting letters, and making text difficult for OCR to read correctly. Finally, some vehicles had no visible plates, "dealer" plates , temporary paper "plates", plates in the window instead on the bumper, or plastic covers over the plate, making it difficult, impossible, or possibly meaningless to read.

The other issue was that as my program began to produce annotated videos and I reviewed them, I noticed that the very center of the video was visually within a parking space but the calculated location for that video frame and consequently the reported space was "fast" by more than half a space width in many cases. For example, the video might be centered on space number 10, but the reported space is 11, the next space to the right. I could not figure out why this was happening, despite exploring errors in video-log timing, projection and datum issues, mistakes in the triangulation math, or misinterpreting the meaning of log data (specifically the compass_heading field). My current best explanation is that the drone's compass is a few degrees out of calibration, or that the effect is happening due to some aspect of lens and perspective distortion that I had not considered. Admittedly, I have made many simplifying assumptions about the geometry of the drone and parking spaces, and have ignored lens optics completely other than the assumption that a ray cast directly through the center of the lens should remain straight.

Future directions

Probably the largest improvements to made would be to train an OCR model specifically for American license plates. As discussed above, reading the text of plates was the the least successful part of the process. I would also like to include more optical calculations into the program, hopefully being able to relate an object's location within the frame/image to its location in actual space. If I could do this, I would not need to rely on a vehicle being frame's center (at the camera target) in order to determine which space it is in, and then multiple vehicles could be validated each frame.

Although a drone seems a very attractive option for implementing such a "drone parking cop" system, I actually do not believe it would be the most practical in reality. Currently, drones are subject to limitations in fly time due to battery capacity, in weather conditions, and in federal regulation. A parking lot can be a large, exposed, and busy place, full of both people and property whose safety is paramount. In contrast, BNSF's intermodal facilities are comparatively tame and on private property. I believe that in the context of a parking lot, a land-based drone would be more suitable and robust. Its near-ground camera height would also be more ideally located relative to the license plates. In fact, commercial plate recognition systems exist that can be attached to ground vehicles, and some police forces have experimented with autonomous ground-based patrol robots .

One of the most interesting twists this project took was the use of object segmentation. At the start, I was not aware segmentation was a capability of packages like YOLO, only being aware of detection providing a more simple bounding box. Now that I know an easy-to-use tool like YOLO can do segmentation, I can think of a number of other GIS application which would benefit. Tree canopy segmentation and species identification using a custom-trained model was one of the first applications that came to mind, and something I might try experiment with. Other applications might be drone-based inventory in of cattle (possibly even identification) on ranches, material piles in stock yards, or orchard trees.

Generally, sUAS-based or not, the power of machine learning computer vision systems cannot be denied. Although sUAS has enabled relatively inexpensive, high resolution and high frequency acquisition of aerial imagery, that imagery has still required "old fashioned" techniques to extract meaningful data. These techniques usually require a human to visually identify and manually digitize objects, which not only is a time consuming process, but a waste of that person's creative and problem solving potential in any organization. High level planning and decision making rarely depends on what something looks like in a particular image, but upon specific attributes gleaned from imagery. The intersection of drones and other automated platforms, machine learning, and GIS is the future of bringing fresh, detailed, meaningful information directly to humans to digest and act upon.