Creating Airline Ticket Prices Predictor using Linear Regression
The science behind Airline Ticket Prices is very simple and complex at the same time. With Machine Learning, we can easily understand the art behind getting a cheaper ticket. You might ask how? Well, its not that tough!
In this article, we would step-by-step create a Linear Regression model to predict the prices of Airline Tickets. And in the end, with the use of Linear Regression Coefficients we would determine which aspects or factors influence the price of airline tickets most.
Requirements and Pre-requisites
To attempt this problem, it is recommended to have basic programming knowledge of Python. Along with that, you require knowledge of Data Analytics which includes libraries like Numpy and Pandas, and Data Visualization which involves libraries like Matplotlib and Seaborn.
You would also require an environment to run this project. Either you could install Jupyter Notebook or Jupyter Lab on your machine. You can install Jupyter Notebook or Jupyter Lab from this link: https://jupyter.org/install
Otherwise, you can use Google Colab to run this project, which is an online tool and doesn’t require any installation. You would just need a Google account. You can sign-up for Google Colab from here: https://colab.research.google.com/
I’ve taken Dataset from Kaggle. You can follow this link to download the Dataset: https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction
Section 1: Importing Libraries and Reading the Dataset
First we add required libraries. The libraries we use include: Numpy, Pandas, Matplotlib and Seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Then we read the dataset, that we downloaded from Kaggle. We used df.head() command to check if the dataset has been loaded correctly.
df = pd.read_csv('Clean_Dataset.csv')
df.head()
Section 2: Gathering Information About the Dataset
Now, we will use built-in Data Frame functions to view information about the given dataset. Firstly, we use df.info() to view information about the columns of the dataset:
df.info()
Now, we use df.describe() to get a picture about the data from the rows of the dataset. However, this function works with rows storing integer or float data types.
df.describe()
Section 3: Exploratory Data Analysis
Now, let’s visualize the data and the exploratory analysis to get a better idea about the trends within this Dataset. We would use the concept of Data Visualization here.
1. Price vs Duration Jointplot
plt.figure(figsize=(10,6))
sns.jointplot(x = 'duration', y = 'price', data = df.sample(5000),
kind = "kde", fill = True)
2. Price vs Duration Jointplot for Economy and Business Class
plt.figure(figsize=(10,6))
sns.jointplot(x = 'duration', y = 'price', data = df.sample(5000),
hue = "class", kind = "kde", fill = True)
3. Countplot for different classes
plt.figure(figsize=(10,6))
sns.countplot(x = 'class', data = df, palette='pastel')
4. Boxplot for Price vs Class
plt.figure(figsize=(10,6))
sns.boxplot(x = 'class', y = 'price', data = df, showfliers = False,
palette='pastel')
5. Price vs Number of Days Left Boxplot
plt.figure(figsize=(10,6))
sns.jointplot(x = 'days_left', y = 'price', data = df.sample(1000),
kind = "reg" )
6. Number of Tickets Offered by Each Airline
plt.figure(figsize=(10,6))
sns.countplot(x = 'airline', data=df, palette='rainbow')
7. Number of tickets offered by each Airline on the Basis of Class
plt.figure(figsize=(10,6))
sns.countplot(x = 'airline', data=df, palette='pastel', hue='class')
8. Boxplot for Price vs Airline
plt.figure(figsize=(10,8))
sns.boxplot(x = 'airline', y = 'price', data = df, palette='rainbow',
showfliers = False)
9. Boxplot for Price vs Airline on the Basis of Class
plt.figure(figsize=(12,10))
sns.boxplot(x = 'airline', y = 'price', data = df, palette='pastel',
showfliers = False, hue='class')
10. Price vs Number of Stops Boxplot
plt.figure(figsize=(10,9))
sns.boxplot(x = 'stops', y = 'price', data = df, palette='rainbow',
showfliers = False)
11. Price vs Number of Stops Boxplot on the Basis of Class
plt.figure(figsize=(10,7))
sns.boxplot(x = 'stops', y = 'price', data = df, palette='pastel',
hue = 'class', showfliers = False)
12. Price vs Source City Boxplot
plt.figure(figsize=(10,6))
sns.boxplot(x = 'source_city', y = 'price', data = df, palette= 'rainbow',
showfliers = False)
13. Price vs Source City Boxplot on the Basis of Class
plt.figure(figsize=(12,6))
sns.boxplot(x = 'source_city', y = 'price', data = df, palette='pastel',
hue = 'class', showfliers = False)
14. Price vs Destination City Boxplot
plt.figure(figsize=(10,6))
sns.boxplot(x = 'destination_city', y = 'price', data = df,
palette= 'rainbow', showfliers = False)
15. Price vs Departure City Boxplot on the Basis of Class
plt.figure(figsize=(12,6))
sns.boxplot(x = 'source_city', y = 'price', data = df, palette='pastel',
hue = 'class', showfliers = False)
16. Price vs Departure Time Boxplot
plt.figure(figsize=(10,6))
sns.boxplot(x = 'departure_time', y = 'price', data = df, palette= 'rainbow',
showfliers = False)
17. Price vs Departure Time Boxplot on the Basis of Class
plt.figure(figsize=(10,6))
sns.boxplot(x = 'departure_time', y = 'price', data = df, palette= 'pastel',
hue = 'class', showfliers = False)
18. Price vs Arrival Time Boxplot
plt.figure(figsize=(10,6))
sns.boxplot(x = 'arrival_time', y = 'price', data = df,
palette= 'rainbow', showfliers = False)
19. Price vs Arrival Time Boxplot on the Basis of Class
plt.figure(figsize=(10,6))
sns.boxplot(x = 'arrival_time', y = 'price', data = df, palette= 'pastel',
hue = 'class', showfliers = False)
Section 4: Pre-Porcessing
Now, we would be preparing the data for creating the Linear Regression model.
1. Removing Flight Column
Since it doesn’t provide any useful information and will not be processed by the algorithm.
df.drop('flight', axis = 1, inplace= True)
df.head()
2. Converting Stops to Numbers
We convert the number of stops to numbers since Linear Regression requires numeric input for creating the model.
def convert_stops(stop):
if stop == 'zero':
return 0
elif stop == 'one':
return 1
else:
return 2
df['stops'] = df['stops'].apply(lambda a: convert_stops(a))
df
3. Converting the Class
Converting the traveling class. 0: Economy and 1: Business
def convert_class(Class):
if Class == 'Economy':
return 0
else:
return 1
df['class'] = df['class'].apply(lambda a: convert_class(a))
df
4. Converting Other Remaining Columns Into Dummies
Now, we would convert remaining columns such as airline, source_city, departure_time, arrival_time, and destination_city into dummy columns for the processing, to create Linear Regression model.
dummy_variables = ['airline','source_city', 'departure_time', 'arrival_time',
'destination_city']
dummies = pd.get_dummies(df[dummy_variables], drop_first=True)
dummies
5. Adding Dummies to the Main Dataset
Now, we add dummy variables to the dataset of the existing columns, while we remove the airline, source_city, departure_time, arrival_time, and destination_city columns since those don’t are not required.
df.drop(dummy_variables, axis = 1, inplace=True)
df = pd.concat([df,dummies], axis = 1)
df
6. Creating Heatmap for the Final Dataset
Finally, we create a Heatmap of all the columns, so that we can get an idea and correlation between all the columns.
plt.figure(figsize=(20,18))
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
sns.heatmap(df.corr(), mask = mask, vmin = -1.0, vmax = 1.0,
annot=True, cmap = 'RdBu_r')
Section 5: Linear Regression Model
Now, it’s the time! We can now start creating and training our Linear Regression model, which would predict prices of airline tickets.
1. Train-Test Split
We now split the dataset into Training and Testing data, so that Training data could be used to train our model, while we could use Testing data to evaluate our Linear Regression model.
We would use train_test_split from Sklearn’s Model Selection for this.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('price', axis = 1),
df['price'], test_size=0.15)
2. Training the Model
We will now train our Machine Learning model on the given training dataset. We would use Linear Regression from Sklearn’s linear model for this.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
Section 6: Evaluation
Now, we would evaluate the model by comparing the test values with the predicted values from the Linear Regression model.
1. Creating Predictions
We will generate prediction data by using test data on our model.
predications = lm.predict(X_test)
plt.scatter(y_test, predications)
plt.xlabel('Y Test')
plt.ylabel('Predications')
2. Calculating Metrics of Evaluation
We will now calculate metrics like Mean Absolute Error, Mean Squared Error and Root Mean Squared Error to evaluate the accuracy of our prediction model.
from sklearn import metrics
print("MAE: ", metrics.mean_absolute_error(y_test, predications))
print("MSE: ", metrics.mean_squared_error(y_test, predications))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, predications)))
Distribution Plot for plotting difference between test and prediction values.
Section 7: Conclusion
Now, lets discuss conclusion. For that, we need to display Linear Regression coefficient values.
coefficients = pd.DataFrame(lm.coef_, df.drop('price', axis = 1).columns)
coefficients.columns = ['Coefficients']
coefficients
From the above, we can see that number of stops and type of class influences the price of a ticket the most. While, the duration of the flight also influences the prices of airline ticket.