Airbnb Exploratory Analysis

Introduction

The data that I will be analyzing for this project pertains to Airbnb renters. As many investors choose to take the real estate route, Airbnb has become a competitive platform for those trying to get their share in the real estate market. The main question here is, “does flexibility affect rental popularity?” This question is heavily packed with different factors such as cancellation policies and service fees. This analysis emphasizes its focus on the details of renters as opposed to customers for the purpose of competitor research. With this data, we will be able to compare variables between highly ranked & priced rentals. This may give insight to prospective Airbnb hosts about the specific qualities that they may want to incorporate into their own rental properties.

This file of data was collected by a website called insideairbnb.com, where they track active listings of Airbnb rentals throughout the world. I am particularly interested in this set of data because it includes specific variables such as cancellation policy, service fees, and minimum nights requirements. These values are important in defining flexibility. This data contains 102599 rows of inputs within 26 different columns. The specific scope of this data is the state of New York. With this abundance of different information, I can expect to initially perform a linear regression analysis between the higher threshold of ‘flexible’ renters to the lower ones. Additionally, I could perform a multiple regression analysis with other specific variables. But as always, further interpretation may reveal other insights throughout the process.


Pre-Analysis Observations

By glancing over the dataset, it does not contain Null, NaN, etc., only empty spaces. I only want to consider the information that contains all of the column values. So it will be important to remove all unnecessary symbols or missing values. I can see from the columns that have price values, I will need to drop symbols and cast the columns as int instead of strings. This will make it possible for me to do regressions during analysis.

The data contains a mix of different data types. The ‘instant_bookable’ column is the only one that shows binary. The only two values possible are ‘True’ and ‘False’. Some of the discrete values present in the ‘price’ and ‘service fee’ columns. Lastly, the other data types we will see are categorical. The ‘review rating number’ is an example of this and the logistic regression will include a comparison between it and the price. The other columns that I will reference aren't necessarily going to be used as quantitative variables. This is because I will create two new lists, 'flexible' and 'else', where they will function as categorical data. 


# import libraries

import pandas as pd, numpy as np, matplotlib as mlt

# import and read csv file

# define na values

# cast license column to str data type

df = pd.read_csv('/Users/haley/Desktop/Capstone/Airbnb_Open_Data.csv', na_values=['-', '?', 'na', 'NaT', 'None'], dtype={'license': str})

desired_width=150

pd.set_option('display.width', desired_width)

pd.set_option('display.max_columns',26)

print(df.head())

Data Preparation 

My first task was to remove all possible repeating rows. I used the 'id' column as the primary indicator instead of the renter's 'name' because it is possible for renters to have more than one property. Meaning that it may be valid for the renter's name to appear more than once, but it shouldn't be possible to have duplicate id's for individual properties. After dropping the duplicates, I created a new list, 'new_df',  with all of the relevant columns that would be used for the analysis. At this point, we have decreased our columns from 26 to 9. 

With the new list, I was able to create a for loop to identify any missing values. If the value was missing, the column name was appended to a new list, 'problems'. I used this as a reference list to compare my cleaned data to once I have completed the necessary steps. To remove missing values, I initiated a drop function for our 'new_df' list. Now I could compare the contents of 'problems' to 'new_df'. Since 'new_df' returned an empty list after going through the previous for loop, this indicated that all missing values were removed. As a second confirmation, I was able to see that our original data decreased to 99729 rows. 

Method

#sort and drop duplicate id rows

df.sort_values("id", inplace=True)

df.drop_duplicates(subset='id', keep=False, inplace=True)


#create new df with relevant columns

new_df = df[['id', 'instant_bookable', 'cancellation_policy', 'service fee', 'minimum nights','price', 'review rate number', 'number of reviews']].copy()


#identify columns with missing values

problems = []


for col_name in new_df.keys():

   if True in list(new_df[col_name].isna()):

       problems.append(col_name)


#drop rows with missing values

new_df = new_df.dropna()


#check for missing values again

problems2 = []

for col_name in new_df.keys():

   if True in list(new_df[col_name].isna()):

       problems.append(col_name)

After cleaning the columns with missing values, I realized that I needed to also clear some of our numerical variables of their symbols. Specifically, I needed to remove the '$' sign from 'service fee' and 'price'. Also, since the two columns were known as object types, I casted them to integers so I could use them for statistical descriptions and for the analysis. 

Before creating our new variable 'flex', I had to ensure that I removed any values that could skew the statistical description incorrectly. For this set, this meant removing any negative values. But instead of dropping the value, I chose to replace them with 0 as there cannot be "negative minimum nights". But, we can have 0 for those requirements. Once the negative values were replaced, I could then create the flex variable. This is a new category created based on my qualitative discretion. I wanted this new list to be a collection of highly rated rentals that meet a specific criteria as it relates to flexibility. Some of the conditions are as follows:


Since some of the conditions needed to be determined based on percentile, it was important to make sure that all of the columns were free of unnecessary values and were recognized as integers. 

Method

#id cleaning

new_df.drop(new_df[new_df['id'] == 365].index, inplace = True)


#price cleaning

new_df['price'] = new_df['price'].str.replace('$', '', regex=True)

new_df['price'] = new_df['price'].str.replace(',', '', regex=True).astype(int)


#review rate number cleaning

new_df[new_df['review rate number'] > 5] = 5

new_df.drop(new_df[new_df['review rate number'] < 1].index, inplace = True)

new_df['review rate number'] = new_df[['review rate number']].loc[(new_df['number of reviews'] >= 9)]


#service fee cleaning

new_df['service fee'] = new_df['service fee'].str.replace('$', '', regex=True).astype(int)


#minimum nights cleaning

new_df[new_df['minimum nights'] < 0] = 0

new_df.drop(new_df[new_df['minimum nights'] > 365].index, inplace = True)

Modeling & Analysis

Preliminary Analysis

For the preliminary analysis, I decided to look at some of the columns individually. I initially planned to focus on rentals that met a ‘flexible’ requirement. But, it came to my attention that the remaining data was too little to be able to represent that group. Out of 95,000+ entries, the filtered data returned less than 200 rows. This obstacle made me reevaluate the conditions that were set. Since the project was inspired by exploring highly rated renters and the flexibility that they offered, I assumed that there would be a clear correlation between the review rate and flexibility options that they offered. 

The next approach was to observe the columns individually. I evaluated the instant_bookable, cancellation_policy, and minimum nights columns since they are the main indicators of flexibility. The only filter that I set for this part of the analysis was to look at renters who are ranked as ‘5’ in the ‘review rate number’ column. My goal with this filter was to observe any patterns or trends between these highly rated renters and the options that they offer their customers.

Method

#Instant bookable

#filter renters that are rated '5'

ib5 = new_df[['instant_bookable']].loc[(new_df['review rate number'] == 5)]

#Finding quantity of each value

ib_yes = len(ib5[['instant_bookable']].loc[(ib5['instant_bookable'] == True)])

ib_no = len(ib5[['instant_bookable']].loc[(ib5['instant_bookable'] == False)])

#Create pie chart

colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']

plt.pie([ib_yes, ib_no], labels = ['Yes', 'No'], colors=colors ,autopct = '%.1f%%')

plt.title('Option to Instant Book')


#Cancellation Policy

#filter renters that are rated '5'

cnp = new_df[['cancellation_policy']].loc[(new_df['review rate number'] == 5)]

#Finding quantity of each value

cnp_f = len(cnp[['cancellation_policy']].loc[(cnp['cancellation_policy'] == 'flexible')])

cnp_m = len(cnp[['cancellation_policy']].loc[(cnp['cancellation_policy'] == 'moderate')])

cnp_s = len(cnp[['cancellation_policy']].loc[(cnp['cancellation_policy'] == 'strict')])

#create pie chart

colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']

plt.pie([cnp_f, cnp_m, cnp_s], labels = ['Flexible', 'Moderate', 'Strict'], colors=colors ,autopct = '%.1f%%')

plt.title('Cancellation Policy Flexibility Rating')


#Minimum Nights

#filter renters that are rated '5'

minn = new_df[['minimum nights']].loc[(new_df['review rate number'] == 5)]

unique_minn = []

for x in minn['minimum nights']:

   if x not in unique_minn:

       unique_minn.append(x)


unique_count = []

for x in unique_minn:

   count = len(minn[['minimum nights']].loc[(minn['minimum nights'] == x)])

   unique_count.append(count)


# Create bar graph

plt.bar(unique_minn, unique_count)

plt.xlabel("Minimum Nights")

plt.ylabel("Count")

plt.title("Minimum Night Requirement")

plt.xticks(unique_minn)

# Create box plot

plt.boxplot(minn, labels=['minimum nights'], showfliers=False)

Instant Book

The instant_bookable column was the first to be observed. The method I used was to compare how many renters did and did not offer the instant_bookable options. I visualized this by creating a pie chart to clearly show the ratio of each. Surprisingly, the result showed that there was almost a 50-50 split between True or False. Meaning, that 50.3% of 5 star renters did offer the instant_bookable option, and 49.7% did not. 

Cancellation Policy

The next evaluated column was cancellation_policy. I used the same approach with this column as the previous. I created a pie chart to show a distinction between the three categories: ‘flexible’, ‘moderate’, and ‘strict’. These values were already categorized with the original data. After producing the chart, I again saw that the divide between the three options were about the same. ‘Flexible’ and ‘strict’ both showed 33.3% and moderate showed 33.4%. This shows that 5 star renters equally offer the options for the cancellation policy. 

Minimum Nights

Lastly, the next column that I evaluated was minimum nights. I chose to visualize this data first with a boxplot because I wanted to evaluate the percentiles. From the chart, you can see that the range is between 1 and 275.But, 275 would be considered an outlier since it exceeds the IQR standard. Additionally, I created a bar graph to better show the frequency of the variables. For renters who are rated ‘5’, the bar graph shows a concentration towards lesser amounts of minimum nights required. Based on the statistics, the majority of these renters have a requirement of two or less nights. 

Based on the individual analysis of the three main columns, we can see that ‘5’ star renters don’t show a preference when it comes to offering the instant book option or a flexible cancellation policy. As for the minimum night requirement, we can see that these highly rated renters tend to stay towards the least amount of nights. Of the three columns used to analyze ‘flexibility’, only one of them showed a slight bias. To continue the analysis, I will be comparing these findings to the rest of the data population that did not rank ‘5’. This will help to further understand if there is a true difference between highly rated renters and others when it comes to flexibility options. 

Final Analysis 

For the second half of the analysis, I created the same visuals for the instant_bookable, cancellation_policy, and minimum nights columns. But, instead of filtering the data for renters rated as ‘5’, I chose to visualize all other rows where the rate is not ‘5’. By doing so, I now have a side-by-side visual of the two groups of renters. As mentioned before, the instant_bookable and cancellation_policy columns returned the similar statistical values. As for the minimum nights column, you can see a slight difference where ‘5’ rated renters trend towards 1 or 2 nights. As for the rest of the population, you can see that they trend towards 2 or 3 nights. Since the analysis of these columns showed minor to no differences in review rating, I decided to reach back into some of the original columns to further explore. I created a heat map to broadly determine any correlations that may be present. We can see that there is a direct correlation shown between the price and service fee. To visualize their relationship, I performed a linear regression. The positive linear slope suggests that the service fee increases as the price does. 

'5' Rated Renters

Instant Bookable

Non '5' Rated Renters

Instant Bookable

'5' Rated Renters

Cancellation Policy

Non '5' Rated Renters

Cancellation Policy

'5' Rated Renters

Minimum Nights

Non '5' Rated Renters

Minimum Nights

'5' Rated Renters

Minimum Nights

Non '5' Rated Renters

Minimum Nights

Method


# Create a heatmap

sns.heatmap(new_df.corr())


# Perform visual linear regression analysis:

sns.set(color_codes=True)

sns.regplot(x='service fee', y='price', data=new_df)


# Calculate the percent that service fees represent

round(((new_df['service fee'] / new_df['price']) * 100).mean(),2)

Correlation Heat Map

Service Fee X Price

There were a few outliers present in the columns except for instant_bookable. The two possibilities for these values are either they were incorrectly recorded, or they are a ‘special case’ scenario. Since the quantities of these values were insignificant, I chose to keep them in consideration during the analysis. Since I didn’t perform any methods that required the use of the mean, outliers did not affect my findings. 

The most prominent issue that I faced was trying to quantify some of the variables to be able to use for modeling. For example, the instant_bookable column returned a boolean response. To visualize it, I took the approach of calculating the frequencies of ‘True’ or ‘False’. Another column that took additional wrangling was the minimum nights column. I had to create new lists consisting of the unique values that were present. This would act as the x variable. Then I counted the frequencies of these values to be used in the y axis. 

Aside from the technical issues, I had limitations with the insights that I found. From my preliminary analysis, I realized that the variables that I chose to analyze did not show any direct correlations that would be found useful. This was the point where I decided to reevaluate the other columns, thus the result of the heat map.

Conclusion

After comparison, the following insights can be taken from the data that was analyzed:



For future projects, this data could be used to identify specific areas in New York that price on the higher or lower end. This could potentially help renters to choose specific areas to invest into when starting an Airbnb rental. Other considerations that one may take when doing an analysis is the rental amenities, construction build, and location. These factors may provide insight into the determination of rental prices and the rental's review rate.

Libraries: Pandas, Seaborn, Matplotlib

Visualization: Pie chart, box-plot, bar graph, linear regression, heat map

Methods: Stratified sampling, regression analysis, descriptive statistics, exploratory analysis