Predict flight delays by creating a machine learning model in Python

12 min readNov 8, 2021

Import airline arrival data into a Google Colab Notebook and use Pandas to clean it. Then, build a machine learning model with Scikit-Learn and use Matplotlib to visualize the output.

Learning objectives

In this module, we will:

Create Google Colab Notebook and import flight data
Use Pandas to clean and prepare data
Use Scikit-learn to build a machine-learning model
Use Matplotlib to visualize the output

Introduction

Python is one of the world’s most popular programming languages. It’s used extensively in the data science community for machine learning and statistical analysis. One of the reasons it’s so popular is the availability of thousands of open-source libraries such as NumPy, Pandas, Matplotlib, and scikit-learn, which enable programmers and researchers alike to explore, transform, analyze, and visualize data.

Google Colab is a Free tool or environment that facilitates interactive programming and data analysis using Python and other programming languages. And it’s web-based, making it an ideal solution for collaborating online.

In this module, We will import a dataset containing on-time arrival information for a major U.S. airline, and load the dataset into the notebook. Then, we will clean the dataset with Pandas, build a machine-learning model with scikit-learn, and use Matplotlib to visualize output from the model.

Create a Google Colab Notebook and Import Data

First We Create Google Colab Notebook then we import the Dataset.

The DataFrame that we created contains on-time arrival information for a major U.S. airline. It has more than 11,000 rows and 26 columns. (The output says “5 rows” because DataFrame’s head function only returns the first five rows.) Each row represents one flight and contains information such as the origin, the destination, the scheduled departure time, and whether the flight arrived on time or late. We will look at the data more closely a bit later in this module.

Clean and prepare data

Before we prepare a dataset, we need to understand its content and structure. We have imported a dataset containing on-time arrival information for a major U.S. airline. That data included 26 columns and thousands of rows, with each row representing one flight and containing information such as the flight’s origin, destination, and scheduled departure time. We also loaded the data into a notebook and used a simple Python script to create a Pandas DataFrame from it.

A DataFrame is a two-dimensional labeled data structure. The columns in a DataFrame can be of different types, just like columns in a spreadsheet or database table. It is the most commonly used object in Pandas.

One of the first things we want to know about a dataset is how many rows it contains. To get a count, type the following statement into an empty cell at the end of the notebook and run it:

Confirm that the DataFrame contains 11,231 rows and 26 columns.

Now take a moment to examine the 26 columns in the dataset. They contain important information such as the date that the flight took place (YEAR, MONTH, and DAY_OF_MONTH), the origin and destination (ORIGIN and DEST), the scheduled departure and arrival times (CRS_DEP_TIME and CRS_ARR_TIME), the difference between the scheduled arrival time and the actual arrival time in minutes (ARR_DELAY), and whether the flight was late by 15 minutes or more (ARR_DEL15).

Here is a complete list of the columns in the dataset. Times are expressed in 24-hour military time. For example, 1130 equals 11:30 a.m. and 1500 equals 3:00 p.m.

The dataset includes a roughly even distribution of dates throughout the year, which is important because a flight out of Minneapolis is less likely to be delayed due to winter storms in July than it is in January. But this dataset is far from being “clean” and ready to use. Let’s write some Pandas code to clean it up.

One of the most important aspects of preparing a dataset for use in machine learning is selecting the “feature” columns that are relevant to the outcome you are trying to predict while filtering out columns that do not affect the outcome, could bias it in a negative way, or might produce multicollinearity. Another important task is to eliminate missing values, either by deleting the rows or columns containing them or replacing them with meaningful values.

One of the first things data scientists typically look for in a dataset is missing values. There’s an easy way to check for missing values in Pandas. To demonstrate, execute the following code in a cell at the end of the notebook:

Confirm that the output is “True,” which indicates that there is at least one missing value somewhere in the dataset.

2. The next step is to find out where the missing values are. To do so, execute the following code:

3. Curiously, the 26th column (“Unnamed: 25”) contains 11,231 missing values, which equals the number of rows in the dataset. This column was mistakenly created because the CSV file that you imported contains a comma at the end of each line. To eliminate that column, add the following code to the notebook and execute it:

4. The DataFrame still contains a lot of missing values, but some of them aren’t useful because the columns containing them are not relevant to the model that we are building. The goal of that model is to predict whether a flight you are considering booking is likely to arrive on time. If we know that the flight is likely to be late, we might choose to book another flight.

The next step, therefore, is to filter the dataset to eliminate columns that aren’t relevant to a predictive model. Pandas provide an easy way to filter out columns we don’t want. Execute the following code in a new cell at the end of the notebook:

5. The only column that now contains missing values is the ARR_DEL15 column, which uses 0s to identify flights that arrived on time and 1s for flights that didn’t. Use the following code to show the first five rows with missing values:

Pandas represent missing values with NaN, which stands for Not a Number. The output shows that these rows are indeed missing values in the ARR_DEL15 column.

6. The reason these rows are missing ARR_DEL15 values is that they all correspond to flights that were canceled or diverted. We could call dropna on the DataFrame to remove these rows. But since a flight that is canceled or diverted to another airport could be considered “late,” let’s use the fillna method to replace the missing values with 1s.

Use the following code to replace missing values in the ARR_DEL15 column with 1s and display rows 177 through 184:

The dataset is now “clean” in the sense that missing values have been replaced and the list of columns has been narrowed to those most relevant to the model. But you’re not finished yet. There is more to do to prepare the dataset for use in machine learning.

Now, we will “bin” the departure times in the CRS_DEP_TIME column and use Pandas’ get_dummies method to create indicator columns from the ORIGIN and DEST columns.

Use the following command to display the first five rows of the DataFrame:

Observe that the CRS_DEP_TIME column contains values from 0 to 2359 representing military times.

2. Use the following statements to bin the departure times:

Confirm that the numbers in the CRS_DEP_TIME column now fall in the range 0 to 23.

3. Now use the following statements to generate indicator columns from the ORIGIN and DEST columns, while dropping the ORIGIN and DEST columns themselves:

The dataset looks very different than it did at the start, but it is now optimized for use in machine learning.

Build Machine Learning Model

To create a machine learning model, we need two datasets: one for training and one for testing. In practice, we often have only one dataset, so we split it into two. Now, we will perform an 80–20 split on the DataFrame we prepared in the task so we can use it to train a machine learning model. We will also separate the DataFrame into feature columns and label columns. The former contains the columns used as input to the model while the latter contains the column that the model will attempt to predict in this case, the ARR_DEL15 column, which indicates whether a flight will arrive on time.

Again Open that Notebook.
In a new cell at the end of the notebook, enter and execute the following statements:

The first statement imports scikit-learn’s train_test_split helper function. The second line uses the function to split the DataFrame into a training set containing 80% of the original data, and a test set containing the remaining 20%. The random_state parameter seeds the random number generator used to do the splitting, while the first and second parameters are DataFrames containing the feature columns and the label column.

3. train_test_split returns four DataFrames. Use the following command to display the number of rows and columns in the DataFrame containing the feature columns used for training:

4. Now use this command to display the number of rows and columns in the DataFrame containing the feature columns used for testing:

One of the benefits of using scikit-learn is that you don’t have to build these models or implement the algorithms that they use by hand. Scikit-learn includes a variety of classes for implementing common machine learning models. One of them is RandomForestClassifier, which fits multiple decision trees to the data and uses averaging to boost the overall accuracy and limit overfitting.

Execute the following code in a new cell to create a RandomForestClassifier object and train it by calling the fit method.

The output shows the parameters used in the classifier, including n_estimators, which specifies the number of trees in each decision-tree forest, and max_depth, which specifies the maximum depth of the decision trees. The values shown are the defaults, but we can override any of them when creating the RandomForestClassifier object.

2. Now call the predict method to test the model using the values in test_x, followed by the score method to determine the mean accuracy of the model:

The mean accuracy is 86%, which seems good on the surface. However, mean accuracy isn’t always a reliable indicator of the accuracy of a classification model. Let’s dig a little deeper and determine how accurate the model really is that is, how adept it is at determining whether a flight will arrive on time.

There are several ways to measure the accuracy of a classification model. One of the best overall measures for a binary classification model is the Area Under Receiver Operating Characteristic Curve (sometimes referred to as “ROC AUC”), which essentially quantifies how often the model will make a correct prediction regardless of the outcome.

Before you compute the ROC AUC, we must generate prediction probabilities for the test set. These probabilities are estimates for each of the classes, or answers, the model can predict. For example, [0.88199435, 0.11800565] means that there's an 89% chance that a flight will arrive on time (ARR_DEL15 = 0) and a 12% chance that it won't (ARR_DEL15 = 1). The sum of the two probabilities adds up to 100%.

Run the following code to generate a set of prediction probabilities from the test data:

2. Now use the following statement to generate a ROC AUC score from the probabilities using scikit-learn’s roc_auc_score method:

3. Use the following code to produce a confusion matrix for your model:

4. Scikit-learn contains a handy method named precision_score for computing precision. To quantify the precision of your model, execute the following statements:

5. Scikit-learn also contains a method named recall_score for computing recall. To measure our model’s recall, execute the following statements:

Visualize Output of Model

Execute the following statements in a new cell at the end of the notebook. Ignore any warning messages that are displayed related to font caching:

The first statement is one of several magic commands supported by the Python kernel that you selected when you created the notebook. It enables Jupyter to render Matplotlib output in a notebook without making repeated calls to show. And it must appear before any references to Matplotlib itself. The final statement configures Seaborn to enhance the output from Matplotlib.

2. To see Matplotlib at work, execute the following code in a new cell to plot the ROC curve for the machine-learning model we built in the previous task:

The dotted line in the middle of the graph represents a 50–50 chance of obtaining a correct answer. The blue curve represents the accuracy of our model. More importantly, the fact that this chart appears at all demonstrates that we can use Matplotlib in a notebook.

The reason we built a machine-learning model is to predict whether a flight will arrive on time or late. Now, we will write a Python function that calls the machine-learning model we built in the task to compute the likelihood that a flight will be on time. Then we will use the function to analyze several flights.

Enter the following function definition in a new cell, and then run the cell.

This function takes as input a date and time, an origin airport code, and a destination airport code, and returns a value between 0.0 and 1.0 indicating the probability that the flight will arrive at its destination on time. It uses the machine-learning model we built in the previous task to compute the probability. And to call the model, it passes a DataFrame containing the input values to predict_proba. The structure of the DataFrame exactly matches the structure of the DataFrame we used earlier.

2. Use the code below to compute the probability that a flight from New York to Atlanta on the evening of October 1 will arrive on time. The year we enter is irrelevant because it isn’t used by the model.

3. Modify the code to compute the probability that the same flight a day later will arrive on time:

4. Now modify the code to compute the probability that a morning flight the same day from Atlanta to Seattle will arrive on time:

We now have an easy way to predict, with a single line of code, whether a flight is likely to be on time or late.

Execute the following code to plot the probability of on-time arrivals for an evening flight from JFK to ATL over a range of days:

Summary

n this module, we learned how to:

Create a notebook in Google Colab
Import data into a notebook using Libraries
Use Pandas to clean and prepare data
Use scikit-learn to build a machine-learning model

Use Matplotlib to visualize the results

Pandas, scikit-learn, and Matplotlib are among the most popular Python libraries on the planet. With them, we can prepare data for use in machine learning, build sophisticated machine-learning models from the data, and chart the output. Colab Notebooks provide a ready-made environment for using these libraries.

Thank You,