Creating Airline Ticket Prices Predictor using Linear Regression

Karan Singh
8 min readJan 2, 2023

--

The science behind Airline Ticket Prices is very simple and complex at the same time. With Machine Learning, we can easily understand the art behind getting a cheaper ticket. You might ask how? Well, its not that tough!

In this article, we would step-by-step create a Linear Regression model to predict the prices of Airline Tickets. And in the end, with the use of Linear Regression Coefficients we would determine which aspects or factors influence the price of airline tickets most.

Requirements and Pre-requisites

To attempt this problem, it is recommended to have basic programming knowledge of Python. Along with that, you require knowledge of Data Analytics which includes libraries like Numpy and Pandas, and Data Visualization which involves libraries like Matplotlib and Seaborn.

You would also require an environment to run this project. Either you could install Jupyter Notebook or Jupyter Lab on your machine. You can install Jupyter Notebook or Jupyter Lab from this link: https://jupyter.org/install

Otherwise, you can use Google Colab to run this project, which is an online tool and doesn’t require any installation. You would just need a Google account. You can sign-up for Google Colab from here: https://colab.research.google.com/

I’ve taken Dataset from Kaggle. You can follow this link to download the Dataset: https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction

Section 1: Importing Libraries and Reading the Dataset

First we add required libraries. The libraries we use include: Numpy, Pandas, Matplotlib and Seaborn.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Then we read the dataset, that we downloaded from Kaggle. We used df.head() command to check if the dataset has been loaded correctly.

df = pd.read_csv('Clean_Dataset.csv')
df.head()
Output from the above code

Section 2: Gathering Information About the Dataset

Now, we will use built-in Data Frame functions to view information about the given dataset. Firstly, we use df.info() to view information about the columns of the dataset:

df.info()
Output from the above code

Now, we use df.describe() to get a picture about the data from the rows of the dataset. However, this function works with rows storing integer or float data types.

df.describe()
Output from the above code

Section 3: Exploratory Data Analysis

Now, let’s visualize the data and the exploratory analysis to get a better idea about the trends within this Dataset. We would use the concept of Data Visualization here.

1. Price vs Duration Jointplot

plt.figure(figsize=(10,6))
sns.jointplot(x = 'duration', y = 'price', data = df.sample(5000),
kind = "kde", fill = True)

2. Price vs Duration Jointplot for Economy and Business Class

plt.figure(figsize=(10,6))
sns.jointplot(x = 'duration', y = 'price', data = df.sample(5000),
hue = "class", kind = "kde", fill = True)

3. Countplot for different classes

plt.figure(figsize=(10,6))
sns.countplot(x = 'class', data = df, palette='pastel')

4. Boxplot for Price vs Class

plt.figure(figsize=(10,6))
sns.boxplot(x = 'class', y = 'price', data = df, showfliers = False,
palette='pastel')

5. Price vs Number of Days Left Boxplot

plt.figure(figsize=(10,6))
sns.jointplot(x = 'days_left', y = 'price', data = df.sample(1000),
kind = "reg" )

6. Number of Tickets Offered by Each Airline

plt.figure(figsize=(10,6))
sns.countplot(x = 'airline', data=df, palette='rainbow')

7. Number of tickets offered by each Airline on the Basis of Class

plt.figure(figsize=(10,6))
sns.countplot(x = 'airline', data=df, palette='pastel', hue='class')

8. Boxplot for Price vs Airline

plt.figure(figsize=(10,8))
sns.boxplot(x = 'airline', y = 'price', data = df, palette='rainbow',
showfliers = False)

9. Boxplot for Price vs Airline on the Basis of Class

plt.figure(figsize=(12,10))
sns.boxplot(x = 'airline', y = 'price', data = df, palette='pastel',
showfliers = False, hue='class')

10. Price vs Number of Stops Boxplot

plt.figure(figsize=(10,9))
sns.boxplot(x = 'stops', y = 'price', data = df, palette='rainbow',
showfliers = False)

11. Price vs Number of Stops Boxplot on the Basis of Class

plt.figure(figsize=(10,7))
sns.boxplot(x = 'stops', y = 'price', data = df, palette='pastel',
hue = 'class', showfliers = False)

12. Price vs Source City Boxplot

plt.figure(figsize=(10,6))
sns.boxplot(x = 'source_city', y = 'price', data = df, palette= 'rainbow',
showfliers = False)

13. Price vs Source City Boxplot on the Basis of Class

plt.figure(figsize=(12,6))
sns.boxplot(x = 'source_city', y = 'price', data = df, palette='pastel',
hue = 'class', showfliers = False)

14. Price vs Destination City Boxplot

plt.figure(figsize=(10,6))
sns.boxplot(x = 'destination_city', y = 'price', data = df,
palette= 'rainbow', showfliers = False)

15. Price vs Departure City Boxplot on the Basis of Class

plt.figure(figsize=(12,6))
sns.boxplot(x = 'source_city', y = 'price', data = df, palette='pastel',
hue = 'class', showfliers = False)

16. Price vs Departure Time Boxplot

plt.figure(figsize=(10,6))
sns.boxplot(x = 'departure_time', y = 'price', data = df, palette= 'rainbow',
showfliers = False)

17. Price vs Departure Time Boxplot on the Basis of Class

plt.figure(figsize=(10,6))
sns.boxplot(x = 'departure_time', y = 'price', data = df, palette= 'pastel',
hue = 'class', showfliers = False)

18. Price vs Arrival Time Boxplot

plt.figure(figsize=(10,6))
sns.boxplot(x = 'arrival_time', y = 'price', data = df,
palette= 'rainbow', showfliers = False)

19. Price vs Arrival Time Boxplot on the Basis of Class

plt.figure(figsize=(10,6))
sns.boxplot(x = 'arrival_time', y = 'price', data = df, palette= 'pastel',
hue = 'class', showfliers = False)

Section 4: Pre-Porcessing

Now, we would be preparing the data for creating the Linear Regression model.

1. Removing Flight Column

Since it doesn’t provide any useful information and will not be processed by the algorithm.

df.drop('flight', axis = 1, inplace= True)
df.head()

2. Converting Stops to Numbers

We convert the number of stops to numbers since Linear Regression requires numeric input for creating the model.

def convert_stops(stop):
if stop == 'zero':
return 0
elif stop == 'one':
return 1
else:
return 2
df['stops'] = df['stops'].apply(lambda a: convert_stops(a))
df

3. Converting the Class

Converting the traveling class. 0: Economy and 1: Business

def convert_class(Class):
if Class == 'Economy':
return 0
else:
return 1
df['class'] = df['class'].apply(lambda a: convert_class(a))
df

4. Converting Other Remaining Columns Into Dummies

Now, we would convert remaining columns such as airline, source_city, departure_time, arrival_time, and destination_city into dummy columns for the processing, to create Linear Regression model.

dummy_variables = ['airline','source_city', 'departure_time', 'arrival_time', 
'destination_city']
dummies = pd.get_dummies(df[dummy_variables], drop_first=True)
dummies

5. Adding Dummies to the Main Dataset

Now, we add dummy variables to the dataset of the existing columns, while we remove the airline, source_city, departure_time, arrival_time, and destination_city columns since those don’t are not required.

df.drop(dummy_variables, axis = 1, inplace=True)
df = pd.concat([df,dummies], axis = 1)
df

6. Creating Heatmap for the Final Dataset

Finally, we create a Heatmap of all the columns, so that we can get an idea and correlation between all the columns.

plt.figure(figsize=(20,18))
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.heatmap(df.corr(), mask = mask, vmin = -1.0, vmax = 1.0,
annot=True, cmap = 'RdBu_r')

Section 5: Linear Regression Model

Now, it’s the time! We can now start creating and training our Linear Regression model, which would predict prices of airline tickets.

1. Train-Test Split

We now split the dataset into Training and Testing data, so that Training data could be used to train our model, while we could use Testing data to evaluate our Linear Regression model.

We would use train_test_split from Sklearn’s Model Selection for this.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('price', axis = 1),
df['price'], test_size=0.15)

2. Training the Model

We will now train our Machine Learning model on the given training dataset. We would use Linear Regression from Sklearn’s linear model for this.

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)

Section 6: Evaluation

Now, we would evaluate the model by comparing the test values with the predicted values from the Linear Regression model.

1. Creating Predictions

We will generate prediction data by using test data on our model.

predications = lm.predict(X_test)
plt.scatter(y_test, predications)
plt.xlabel('Y Test')
plt.ylabel('Predications')

2. Calculating Metrics of Evaluation

We will now calculate metrics like Mean Absolute Error, Mean Squared Error and Root Mean Squared Error to evaluate the accuracy of our prediction model.

from sklearn import metrics
print("MAE: ", metrics.mean_absolute_error(y_test, predications))
print("MSE: ", metrics.mean_squared_error(y_test, predications))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, predications)))

Distribution Plot for plotting difference between test and prediction values.

Section 7: Conclusion

Now, lets discuss conclusion. For that, we need to display Linear Regression coefficient values.

coefficients = pd.DataFrame(lm.coef_, df.drop('price', axis = 1).columns)
coefficients.columns = ['Coefficients']
coefficients

From the above, we can see that number of stops and type of class influences the price of a ticket the most. While, the duration of the flight also influences the prices of airline ticket.

--

--

Karan Singh
Karan Singh

Written by Karan Singh

Microsoft Student Partner | Samsung Brand Ambassador | Bachelors in Computer Science Student | Aviation geek | Formula One Fan

No responses yet