Predicting AirBnB Prices and Review scores

50 minute read

Business Understanding

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of its growth initiative, AirBnB is seeking to enhance customer experience and improve business. These insights will help Customer Experience, Business Development and Marketing departments. We will monitor local trends and gain strategic insight on the growth of AirBnB in seattle and on ways of improving the customer experience.

Data mining objectives

AirBnb is seeking to know what vibes are happening in each neighbourhood, if there is a general upward trend of new AirBnb listings in Seattle, which factors influence price and customer experience. We want to find if there are metrics which the business can associate with good prices and increased listings.Also to find a good way to measure customer satisfaction and ways that AirBnb hosts can do to get more bookings.

We will use the listings, calendar and reviews datasets for our analysis. We will wrangle and explore this data to understand what data we have and find insights from it to meet our objectives. A large part of this will involve applying transformations to the data, and dealing with missing data.

To summarise, translating our business goals into data mining goals, we have:

What features of the listings are related to price?    
What vibes are happening in each neighbourhood?    
Are there factors/metrics for that help understand customer experience?    
Are there things that guests/AirBnB can do to boost guest experience and get more clients?
Has there been an increase in the number of AirBnb hosts?
What are the busiest times of the year in seattle and by how much do the prices change

clarify who customer and host is

Data mining problem type

We will mine the data to answer or attempt to answer the data mining objectives. The question of vibes happening in each neighhbourhood,factors that help understand/improve customer experience will require exploratory data analysis. The EDA will be perfomed on features on the listings and review data sets to find the metrics.

The question of factors related to price and guest experience will become either a regression or classification problem for which we will do the modelling. The challenges will be selecting the features that are important and preparing the data for modelling (removing outliers, imputing missing values etc).

To answer the question on price and customer experience will deal with feature engineering. Choosing which features will be used and according to the chosen model, determine the parameters.

Criteria for model assessment

The model will be evaluated against tests to ensure its valid.

The first data mining problems will be approached using descriptive statistics and visualisations of the distributions. By these we will be able to ascertain key metrics and some feature importances.

We will also need to ensure that the metric chosen will satisfy the business requirements for being an adequate representation of customer experience, and that it is not just chosen solely to produce good results from the model (Feature importance)

For the second data mining problem, we will use cross validation to evaluate candidates for modelling the data. We will evaluate using the R2 score and the RMSE if using regression. We could also plot residuals of the model to evaluate skew.

The third data mining problem will follow from the solution of the second data mining problem. We can apply a properly evaluated model on the data in order to assess the feature importance. We will ensure that the features chosen from the feature importance are appropriate for using in a strategy that could conceivably target hosts and how they could improve the guests’ experiences.

Finally, it is important to note the broad nature of the project. We will narrow it down to improve customer service. We are only looking to improve guests’ customer experience, and this could be limited to certain subsets of the user base. As long as the user base chosen is large enough for potential insights to be significant, then this will contribute towards satisfying our business goals.

Action plan

Boiling down the above into a plan, we have:

1.Explore and understand the data.

2.Perform exploratory and descriptive analysis in order to find an appropriate response variable related to customer experience and price

3.Based on whether the response variable is continous or discrete either frame a regression or classification problem for predicting this response variable.

4.Clean and tidy the data set in order to prepare it for modelling.

5.Select candidate algorithms for modelling the data and evaluate them, utilizing cross-validation.

6.Use the prediction model to ascertain the most important aspects that hosts can improve guests’ experience.

7.Communicate the results as key points that a non-technical business audience could understand.

Data Understanding

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re

!pip install wordcloud

Requirement already satisfied: wordcloud in /home/fucking/anaconda/lib/python3.8/site-packages (1.8.1)
Requirement already satisfied: pillow in /home/fucking/anaconda/lib/python3.8/site-packages (from wordcloud) (8.0.1)
Requirement already satisfied: numpy>=1.6.1 in /home/fucking/anaconda/lib/python3.8/site-packages (from wordcloud) (1.19.2)
Requirement already satisfied: matplotlib in /home/fucking/anaconda/lib/python3.8/site-packages (from wordcloud) (3.3.2)
Requirement already satisfied: certifi>=2020.06.20 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (2020.6.20)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (1.3.0)
Requirement already satisfied: python-dateutil>=2.1 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (2.4.7)
Requirement already satisfied: six>=1.5 in /home/fucking/anaconda/lib/python3.8/site-packages (from python-dateutil>=2.1->matplotlib->wordcloud) (1.15.0)

from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/fucking/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/fucking/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/fucking/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!





True

Data Gathering

calender_df=pd.read_csv("calendar.csv", delimiter=',')
listings_df=pd.read_csv("listings.csv", delimiter=',')
reviews_df=pd.read_csv("reviews.csv", delimiter=',')

Understanding the calender dataset

calender_df.shape

(1393570, 4)

calender_df.head()

	listing_id	date	available	price
0	241032	2016-01-04	t	$85.00
1	241032	2016-01-05	t	$85.00
2	241032	2016-01-06	f	NaN
3	241032	2016-01-07	f	NaN
4	241032	2016-01-08	f	NaN

# Are there missing values
no_missing_vals_cols=set(calender_df.columns[calender_df.isnull().mean()==0])
no_missing_vals_cols

{'available', 'date', 'listing_id'}

#replace True or false with 1 and 0 for available not available and change 
calender_df=calender_df.replace(to_replace="t",value="1")
calender_df=calender_df.replace(to_replace="f",value="0")
calender_df['available'] = calender_df['available'].astype(int, errors = 'raise')

#Find the proportion of available to unavailable  values
calender_df.available.value_counts()/ calender_df.shape[0]

1    0.67061
0    0.32939
Name: available, dtype: float64

calender_df['price'] = calender_df['price'].str.extract(r'(\d+)', expand=False)
# make price float
calender_df['price'] = calender_df['price'].astype(float, errors = 'raise')

#fill missing price values with mean
calender_df['price'].fillna((calender_df['price'].mean()), inplace=True)

#plot the proportions of available to not available
price_val=calender_df.groupby(['available']).listing_id.count()
price_val.plot.pie(figsize=(6,6))

<AxesSubplot:ylabel='listing_id'>

png

67% of listings are available and 32.9% are unavailable 365 days.

#check the calender data types
calender_df.dtypes

listing_id      int64
date           object
available       int64
price         float64
dtype: object

The date column is saved as an object, we have to convert it to date time dtype for analysis

#convert the date column from object to datetime
calender_df['date'] =  pd.to_datetime(calender_df['date'])

calender_df.date.dtypes

dtype('<M8[ns]')

A more interesting approach we will need is to split the date time into day of the week, week of the year, month and year. This will help us uncover insights into trends as relates to time

# split date time into day of the week, week of the year, month of the year and year for indepth analysis

# Create new columns
calender_df['dayofweek'] = calender_df['date'].dt.dayofweek
calender_df['weekofyear'] = calender_df['date'].dt.isocalendar().week
calender_df['month'] = calender_df['date'].dt.month
calender_df['year'] = calender_df['date'].dt.year

calender_df[:5]

	listing_id	date	available	price	dayofweek	weekofyear	month	year
0	241032	2016-01-04	1	85.000000	0	1	1	2016
1	241032	2016-01-05	1	85.000000	1	1	1	2016
2	241032	2016-01-06	0	137.090652	2	1	1	2016
3	241032	2016-01-07	0	137.090652	3	1	1	2016
4	241032	2016-01-08	0	137.090652	4	1	1	2016

What we have gathered from the calender data is that listings are grouped into whether or not they are available, their price and date of whether they were available and not available. We learnt that there are 3818 listings in total, 67% available.

To note: We split the datetime into day of the week, week of the year, month and year. This will be useful for our EDA

Understanding the reviews data

The reviews dataset is small and contains comments made by guests. By knowing frequently used words or words associated with quality, we provide insights for AirBnb hosts

df2=reviews_df
df2.head()

	listing_id	id	date	reviewer_id	reviewer_name	comments
0	7202016	38917982	2015-07-19	28943674	Bianca	Cute and cozy place. Perfect location to every...
1	7202016	39087409	2015-07-20	32440555	Frank	Kelly has a great room in a very central locat...
2	7202016	39820030	2015-07-26	37722850	Ian	Very spacious apartment, and in a great neighb...
3	7202016	40813543	2015-08-02	33671805	George	Close to Seattle Center and all it has to offe...
4	7202016	41986501	2015-08-10	34959538	Ming	Kelly was a great host and very accommodating ...

df2.shape

(84849, 6)

df2.dtypes

listing_id        int64
id                int64
date             object
reviewer_id       int64
reviewer_name    object
comments         object
dtype: object

#convert date to datetime
df2['date'] =  pd.to_datetime(df2['date'])

# split date time into day of the week, week of the year, month of the year and year for indepth analysis
#calender_df.reset_index(inplace=True)

# Create new columns
df2['dayofweek'] = df2['date'].dt.dayofweek
df2['weekofyear'] = df2['date'].dt.isocalendar().week
df2['month'] = df2['date'].dt.month
df2['year'] = df2['date'].dt.year
#calender_df.drop('level_0',inplace=True, axis=1)
df2[:5]

	listing_id	id	date	reviewer_id	reviewer_name	comments	dayofweek	weekofyear	month	year
0	7202016	38917982	2015-07-19	28943674	Bianca	Cute and cozy place. Perfect location to every...	6	29	7	2015
1	7202016	39087409	2015-07-20	32440555	Frank	Kelly has a great room in a very central locat...	0	30	7	2015
2	7202016	39820030	2015-07-26	37722850	Ian	Very spacious apartment, and in a great neighb...	6	30	7	2015
3	7202016	40813543	2015-08-02	33671805	George	Close to Seattle Center and all it has to offe...	6	31	8	2015
4	7202016	41986501	2015-08-10	34959538	Ming	Kelly was a great host and very accommodating ...	0	33	8	2015

def plot_display(plot_words_list):
    wordcloud = WordCloud(width = 1000, height = 700).generate(plot_words_list)
    plt.figure(figsize=(18,12))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

We will use Natural Language Toolkit, a Natural Language Processing library for python and use it to process the reviews and comments

from wordcloud import WordCloud
from wordcloud import STOPWORDS

import matplotlib.pyplot as plt
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
comments = df2[['comments','listing_id',]]
commentsDFTopper = comments.sort_values('listing_id',ascending=[0])
commentsDFtop=commentsDFTopper.head(30)
reviews = ''
for index,row in commentsDFtop.iterrows():
    p = re.sub('[^a-zA-Z]+',' ', row['comments'])
    reviews+=p

reviews_data=nltk.word_tokenize(reviews)
filtered_data=[word for word in reviews_data if word not in stopwords.words('english')] 
wnl = nltk.WordNetLemmatizer() 
reviews_data=[wnl.lemmatize(data) for data in filtered_data]
reviews_words=' '.join(reviews_data)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/fucking/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/fucking/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/fucking/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

Common words among reviews

Words like great,clean,host, flexible, accomodating were frequently used. These are assocatiated with many listings, hosts should strive to get such

plot_display(reviews)

png

Understanding Listings data

We need to look at the data available to us in the listings CSV file, in order to gain understanding of the data.

Columns

We start with reading in the file and listing out the columns available to us.

listings_df=pd.read_csv("listings.csv", delimiter=',')
listings_df.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
       'price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'first_review', 'last_review', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'requires_license',
       'license', 'jurisdiction_names', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification', 'calculated_host_listings_count',
       'reviews_per_month'],
      dtype='object')

# How many data types do we have
listings_df.dtypes.value_counts()

object     62
float64    17
int64      13
dtype: int64

# Integer columns
listings_df.select_dtypes(include = ["int"])[:5]

	id	scrape_id	host_id	accommodates	guests_included	minimum_nights	maximum_nights	availability_30	availability_60	availability_90	availability_365	number_of_reviews	calculated_host_listings_count
0	241032	20160104002432	956883	4	2	1	365	14	41	71	346	207	2
1	953595	20160104002432	5177328	4	1	2	90	13	13	16	291	43	6
2	3308979	20160104002432	16708587	11	10	4	30	1	6	17	220	20	2
3	7421966	20160104002432	9851441	3	1	1	1125	0	0	0	143	0	1
4	278830	20160104002432	1452570	6	6	1	1125	30	60	90	365	38	1

# Float columns
listings_df.select_dtypes(include = ["float"])[:5]

	host_listings_count	host_total_listings_count	latitude	longitude	bathrooms	bedrooms	beds	square_feet	review_scores_rating	review_scores_accuracy	review_scores_cleanliness	review_scores_checkin	review_scores_communication	review_scores_location	review_scores_value	license	reviews_per_month
0	3.0	3.0	47.636289	-122.371025	1.0	1.0	1.0	NaN	95.0	10.0	10.0	10.0	10.0	9.0	10.0	NaN	4.07
1	6.0	6.0	47.639123	-122.365666	1.0	1.0	1.0	NaN	96.0	10.0	10.0	10.0	10.0	10.0	10.0	NaN	1.48
2	2.0	2.0	47.629724	-122.369483	4.5	5.0	7.0	NaN	97.0	10.0	10.0	10.0	10.0	10.0	10.0	NaN	1.15
3	1.0	1.0	47.638473	-122.369279	1.0	0.0	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2.0	2.0	47.632918	-122.372471	2.0	3.0	3.0	NaN	92.0	9.0	9.0	10.0	10.0	9.0	9.0	NaN	0.89

# object columns
listings_df.select_dtypes(include = ["object"])[:5]

	listing_url	last_scraped	name	summary	space	description	experiences_offered	neighborhood_overview	notes	transit	...	has_availability	calendar_last_scraped	first_review	last_review	requires_license	jurisdiction_names	instant_bookable	cancellation_policy	require_guest_profile_picture	require_guest_phone_verification
0	https://www.airbnb.com/rooms/241032	2016-01-04	Stylish Queen Anne Apartment	NaN	Make your self at home in this charming one-be...	Make your self at home in this charming one-be...	none	NaN	NaN	NaN	...	t	2016-01-04	2011-11-01	2016-01-02	f	WASHINGTON	f	moderate	f	f
1	https://www.airbnb.com/rooms/953595	2016-01-04	Bright & Airy Queen Anne Apartment	Chemically sensitive? We've removed the irrita...	Beautiful, hypoallergenic apartment in an extr...	Chemically sensitive? We've removed the irrita...	none	Queen Anne is a wonderful, truly functional vi...	What's up with the free pillows? Our home was...	Convenient bus stops are just down the block, ...	...	t	2016-01-04	2013-08-19	2015-12-29	f	WASHINGTON	f	strict	t	t
2	https://www.airbnb.com/rooms/3308979	2016-01-04	New Modern House-Amazing water view	New modern house built in 2013. Spectacular s...	Our house is modern, light and fresh with a wa...	New modern house built in 2013. Spectacular s...	none	Upper Queen Anne is a charming neighborhood fu...	Our house is located just 5 short blocks to To...	A bus stop is just 2 blocks away. Easy bus a...	...	t	2016-01-04	2014-07-30	2015-09-03	f	WASHINGTON	f	strict	f	f
3	https://www.airbnb.com/rooms/7421966	2016-01-04	Queen Anne Chateau	A charming apartment that sits atop Queen Anne...	NaN	A charming apartment that sits atop Queen Anne...	none	NaN	NaN	NaN	...	t	2016-01-04	NaN	NaN	f	WASHINGTON	f	flexible	f	f
4	https://www.airbnb.com/rooms/278830	2016-01-04	Charming craftsman 3 bdm house	Cozy family craftman house in beautiful neighb...	Cozy family craftman house in beautiful neighb...	Cozy family craftman house in beautiful neighb...	none	We are in the beautiful neighborhood of Queen ...	Belltown	The nearest public transit bus (D Line) is 2 b...	...	t	2016-01-04	2012-07-10	2015-10-24	f	WASHINGTON	f	strict	f	f

5 rows × 62 columns

#which columns have missing values
listings_df.isnull().sum()[listings_df.isna().sum()>0]

summary                         177
space                           569
neighborhood_overview          1032
notes                          1606
transit                         934
thumbnail_url                   320
medium_url                      320
xl_picture_url                  320
host_name                         2
host_since                        2
host_location                     8
host_about                      859
host_response_time              523
host_response_rate              523
host_acceptance_rate            773
host_is_superhost                 2
host_thumbnail_url                2
host_picture_url                  2
host_neighbourhood              300
host_listings_count               2
host_total_listings_count         2
host_has_profile_pic              2
host_identity_verified            2
neighbourhood                   416
zipcode                           7
property_type                     1
bathrooms                        16
bedrooms                          6
beds                              1
square_feet                    3721
weekly_price                   1809
monthly_price                  2301
security_deposit               1952
cleaning_fee                   1030
first_review                    627
last_review                     627
review_scores_rating            647
review_scores_accuracy          658
review_scores_cleanliness       653
review_scores_checkin           658
review_scores_communication     651
review_scores_location          655
review_scores_value             656
license                        3818
reviews_per_month               627
dtype: int64

The listings data set has 92 features/columns. They are obviously too many and we will need to focus.A quick overview of the lists we realise that there are groupings of features.

Columns about reviews, for instance number_of_reviews,first_review, last_review,review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value
There are columns related to property like amenitiess, bedrooms, room type etc
There are columns related to location, properties like neighbourhood,zip cde, city. We will want to see which neighbourhoods are busiest, pricey
We also have columns that are related to the host’s policies such as price, weekly pricem security deposits
Host related attributes like host location, host response time, these features may be important for user experience

Review based columns

# list of columns related to reviews
review_cols = [col for col in listings_df.columns if 'review' in col]
review_cols

['number_of_reviews',
 'first_review',
 'last_review',
 'review_scores_rating',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'reviews_per_month']

listings_df[review_cols].describe()

	number_of_reviews	review_scores_rating	review_scores_accuracy	review_scores_cleanliness	review_scores_checkin	review_scores_communication	review_scores_location	review_scores_value	reviews_per_month
count	3818.000000	3171.000000	3160.000000	3165.000000	3160.000000	3167.000000	3163.000000	3162.000000	3191.000000
mean	22.223415	94.539262	9.636392	9.556398	9.786709	9.809599	9.608916	9.452245	2.078919
std	37.730892	6.606083	0.698031	0.797274	0.595499	0.568211	0.629053	0.750259	1.822348
min	0.000000	20.000000	2.000000	3.000000	2.000000	2.000000	4.000000	2.000000	0.020000
25%	2.000000	93.000000	9.000000	9.000000	10.000000	10.000000	9.000000	9.000000	0.695000
50%	9.000000	96.000000	10.000000	10.000000	10.000000	10.000000	10.000000	10.000000	1.540000
75%	26.000000	99.000000	10.000000	10.000000	10.000000	10.000000	10.000000	10.000000	3.000000
max	474.000000	100.000000	10.000000	10.000000	10.000000	10.000000	10.000000	10.000000	12.150000

There are 3818 rows in total for listings. All columns except for number_of_reviews have missing values. Looking at the columns we see some interesting things:

The range between max and min values for number of reviews is 474 which is huge. The mean of 22 is no way close to the median of 9 and the max of 474.

The standard deviation of the review_scores_rating column at 6.61 is much higher than most other reviewscores columns.

The range of reviews_per_month is a lot smaller than number_of_reviews, with the maximum reviews_per_month at 12.15 compared with the maxium number_of_reviews at 474. This is most likely due to the unbounded time period for which number_of_reviews can be measured, whereas the reviews_per_month is measured over a consistent time period, i.e. a month.

## Whats the correlation between the reviews columns
_df=listings_df[review_cols]
corr=_df.corr()
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)

<AxesSubplot:>

png

Theres strong correlation between basically all the columns except for review scores location. Review scores checking and review scores communication also have strong correlation. Another strong correlation is between reviews per month and number of reviews which is understandable since one is subset of the other

One of our data mining objectives is to find out if theres a metric for understanding the quality of guest’s experience. The reviews columns can help us answer that.

The review scores rating and seems to be the one that is best suited since the other ones are extensions of and correlated which means they may have been used to calculate the rating.

From our earlier analysis, the review scores rating has 3,171 out of 3,818 values that are not empty for this column, which means we are missing 17% of our data.

# a look at the review scores rating 
listings_df['review_scores_rating'].hist()

<AxesSubplot:>

png

Theres generally high scores for reviews rating skewed positively towards the right which means that reviews scores rating is highly important, getting good reviews is very important for hosts

# Another useful metric to note is the number of reviews vs reviews per month
listings_df[['reviews_per_month','number_of_reviews']].hist()

array([[<AxesSubplot:title={'center':'reviews_per_month'}>,
        <AxesSubplot:title={'center':'number_of_reviews'}>]], dtype=object)

png

We already knows theres a strong positive correlation between these two. We notice that, the number of reviews occur within a long range, which just means someone may have many reviews because they have been listed longer. But the reviews per month, shows a more accurate picture

listings_df[['number_of_reviews','first_review','last_review','reviews_per_month']][:5]

	number_of_reviews	first_review	last_review	reviews_per_month
0	207	2011-11-01	2016-01-02	4.07
1	43	2013-08-19	2015-12-29	1.48
2	20	2014-07-30	2015-09-03	1.15
3	0	NaN	NaN	NaN
4	38	2012-07-10	2015-10-24	0.89

listings_df[['review_scores_rating','reviews_per_month']].hist()

array([[<AxesSubplot:title={'center':'review_scores_rating'}>,
        <AxesSubplot:title={'center':'reviews_per_month'}>]], dtype=object)

png

So far we see that the reviews per month and review score rating features are useful features for understanding customer experience and thus improving customer experience. As expected some users/hosts have many reviews but fewer reviews per month suggesting they have been listed longer but not neccesarily that they are perfoming better than those with fewer number of ratigs but higher reviews per month

Note: For Data preparation, we may consider a relation between these two features as a response variable for modelling customer experience

new_price_col=['price', 'security_deposit', 'cleaning_fee', 'guests_included',
       'extra_people', 'minimum_nights', 'maximum_nights','weekly_price', 'monthly_price']

The money columns contain the dollar sign and are of object type. We need to remove the dollar sign and convert the data type to float for quantitative analysis

# We need to remove all $ to the columns 
listings_df['price'] = listings_df['price'].str.extract(r'(\d+)', expand=False)
listings_df['weekly_price'] = listings_df['weekly_price'].str.extract(r'(\d+)', expand=False)
listings_df['monthly_price'] = listings_df['monthly_price'].str.extract(r'(\d+)', expand=False)
listings_df['security_deposit'] = listings_df['security_deposit'].str.extract(r'(\d+)', expand=False)
listings_df['cleaning_fee'] = listings_df['cleaning_fee'].str.extract(r'(\d+)', expand=False)
listings_df['extra_people'] = listings_df['extra_people'].str.extract(r'(\d+)', expand=False)

# change the money values to float
listings_df['cleaning_fee'] = listings_df['cleaning_fee'].astype(float, errors = 'raise')
listings_df['security_deposit'] = listings_df['security_deposit'].astype(float, errors = 'raise')
listings_df['extra_people'] = listings_df['extra_people'].astype(float, errors = 'raise')
listings_df['price'] = listings_df['price'].astype(float, errors = 'raise')
listings_df['weekly_price'] = listings_df['weekly_price'].astype(float, errors = 'raise')
listings_df['monthly_price'] = listings_df['monthly_price'].astype(float, errors = 'raise')

listings_df[new_price_col][:5]

	price	security_deposit	cleaning_fee	guests_included	extra_people	minimum_nights	maximum_nights	weekly_price	monthly_price
0	85.0	NaN	NaN	2	5.0	1	365	NaN	NaN
1	150.0	100.0	40.0	1	0.0	2	90	1.0	3.0
2	975.0	1.0	300.0	10	25.0	4	30	NaN	NaN
3	100.0	NaN	NaN	1	0.0	1	1125	650.0	2.0
4	450.0	700.0	125.0	6	15.0	1	1125	NaN	NaN

# check for missing values in the columns
listings_df[new_price_col].isnull().sum()[listings_df[new_price_col].isnull().sum()>0]

security_deposit    1952
cleaning_fee        1030
weekly_price        1809
monthly_price       2301
dtype: int64

# What are the summary statics saying
listings_df[new_price_col].describe()

	price	security_deposit	cleaning_fee	guests_included	extra_people	minimum_nights	maximum_nights	weekly_price	monthly_price
count	3818.00000	1866.000000	2788.000000	3818.000000	3818.000000	3818.000000	3818.000000	2009.000000	1517.000000
mean	127.71451	247.292069	61.710904	1.672603	10.667627	2.369303	780.447617	433.651070	67.986816
std	89.16278	153.486417	48.830341	1.311040	17.585922	16.305902	1683.589007	296.675884	225.403680
min	1.00000	1.000000	5.000000	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000
25%	75.00000	100.000000	25.000000	1.000000	0.000000	1.000000	60.000000	232.000000	1.000000
50%	100.00000	200.000000	50.000000	1.000000	0.000000	2.000000	1125.000000	476.000000	2.000000
75%	150.00000	300.000000	83.000000	2.000000	20.000000	2.000000	1125.000000	650.000000	3.000000
max	999.00000	995.000000	300.000000	15.000000	300.000000	1000.000000	100000.000000	999.000000	995.000000

Security deposit has 48% listings, lots of missing values. It has a huge range between the maximum and minimum and the mean. It also has a huge standard deviation. There seems to be different charges that are not standardized perhaps related to other factors like neighbourhood, property type etc. We will invest that further in EDA. Maximum nights have very high values. 75% of listings have up to 20 extra people

fig = plt.figure(figsize = (18,7))
ax = fig.gca()
listings_df[new_price_col].hist(ax=ax)

<ipython-input-52-1b5312bb5443>:3: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
  listings_df[new_price_col].hist(ax=ax)





array([[<AxesSubplot:title={'center':'price'}>,
        <AxesSubplot:title={'center':'security_deposit'}>,
        <AxesSubplot:title={'center':'cleaning_fee'}>],
       [<AxesSubplot:title={'center':'guests_included'}>,
        <AxesSubplot:title={'center':'extra_people'}>,
        <AxesSubplot:title={'center':'minimum_nights'}>],
       [<AxesSubplot:title={'center':'maximum_nights'}>,
        <AxesSubplot:title={'center':'weekly_price'}>,
        <AxesSubplot:title={'center':'monthly_price'}>]], dtype=object)

png

It is important while we notice the hist plot to remember that there are some extremely high values (outliers) that may be affecting the distributions. 50% of security deposits are around 200 dollars, Theres a little over 400 listings with security deposits between 0 and 100 which is almost as much as the 50 percent. Of those that charge security deposits,an average of 200 listings charge between 100 dollars to 150 dollars and 300 dollars to 400 dollars. Of those that charge security fees, more than half charge between 0 and 60 dollars. Theres about 1/4 that charge between 60 dollars and 120 dollars. On guests included most listings have between 0 and 1 extra guests. Most listings have between 0 and 20 extra people included.

## Whats the correlation between the money columns
_df2=listings_df[new_price_col]
_corr2=_df2.corr()
sns.heatmap(_corr2, xticklabels=_corr2.columns.values, yticklabels=_corr2.columns.values, annot = True)

<AxesSubplot:>

png

Price and cleaning fee have a correlation, likely the cleaning fee is added to the price

Location based columns

# look for columns with the word columns in them
location_cols = [col for col in listings_df.columns if 'location' in col]
location_cols

['host_location',
 'smart_location',
 'is_location_exact',
 'review_scores_location']

listings_df[location_cols][:5]

	host_location	smart_location	is_location_exact	review_scores_location
0	Seattle, Washington, United States	Seattle, WA	t	9.0
1	Seattle, Washington, United States	Seattle, WA	t	10.0
2	Seattle, Washington, United States	Seattle, WA	t	10.0
3	Seattle, Washington, United States	Seattle, WA	t	NaN
4	Seattle, Washington, United States	Seattle, WA	t	9.0

listings_df[location_cols][:5].dtypes

host_location              object
smart_location             object
is_location_exact           int64
review_scores_location    float64
dtype: object

# is location exact has T, F for Yes or No, True or false we replace it with True or false

listings_df['is_location_exact']=listings_df['is_location_exact'].replace(to_replace="t",value="1")
listings_df['is_location_exact']=listings_df['is_location_exact'].replace(to_replace="f",value="0")

# change is_location_exact from int to Int
listings_df['is_location_exact'] = listings_df['is_location_exact'].astype(int, errors = 'raise')

listings_df[location_cols][:5]

	host_location	smart_location	is_location_exact	review_scores_location
0	Seattle, Washington, United States	Seattle, WA	1	9.0
1	Seattle, Washington, United States	Seattle, WA	1	10.0
2	Seattle, Washington, United States	Seattle, WA	1	10.0
3	Seattle, Washington, United States	Seattle, WA	1	NaN
4	Seattle, Washington, United States	Seattle, WA	1	9.0

# are there columns with missing values
listings_df[location_cols].isnull().sum()[listings_df[location_cols].isnull().sum()>0]

host_location               8
review_scores_location    655
dtype: int64

The review scores location has the key word location in it but has been considered more relevantly under the review based columns so we will drop it. Also, upon rechecking the listings data see there are other columns that deal with location but dont have location in their column names. We include these here along with the location cols

location_cols=['city','state','zipcode','market','smart_location','country_code','country','host_location','is_location_exact']

listings_df[location_cols][:5]

	city	state	zipcode	market	smart_location	country_code	country	host_location	is_location_exact
0	Seattle	WA	98119	Seattle	Seattle, WA	US	United States	Seattle, Washington, United States	1
1	Seattle	WA	98119	Seattle	Seattle, WA	US	United States	Seattle, Washington, United States	1
2	Seattle	WA	98119	Seattle	Seattle, WA	US	United States	Seattle, Washington, United States	1
3	Seattle	WA	98119	Seattle	Seattle, WA	US	United States	Seattle, Washington, United States	1
4	Seattle	WA	98119	Seattle	Seattle, WA	US	United States	Seattle, Washington, United States	1

#exclude the is_location exact and find some summary stats
listings_df[['city','state','zipcode','market','smart_location','country_code','country']].describe()

	city	state	zipcode	market	smart_location	country_code	country
count	3818	3818	3811	3818	3818	3818	3818
unique	7	2	28	1	7	1	1
top	Seattle	WA	98122	Seattle	Seattle, WA	US	United States
freq	3810	3817	420	3818	3810	3818	3818

Zip code has the most count of unique entries with 27 different entries, city has 7 unique entries, the rest are probably the same for instance, same market, country code and country. So we exclude country, country code and state to see what we can learn. Almost all listings are in Seattle, WA thats where we have 99.9% of listings, other areas have 1%

listings_df[['city','state','zipcode','smart_location']].groupby(['city']).count()

	state	zipcode	smart_location
city
Ballard, Seattle	1	1	1
Phinney Ridge Seattle	1	1	1
Seattle	3810	3803	3810
Seattle	2	2	2
West Seattle	2	2	2
seattle	1	1	1
西雅图	1	1	1

(listings_df['zipcode'].value_counts()).plot(kind="bar")

<AxesSubplot:>

png

We see that there are zipcodes that have more listings like the 98122, 98103 and 98102. These are generally higher, the bulk of the listings are on average the same until we get to 98104. Then we see 98199, 126,106,108,133,136 have about the same listings around the 50-60 listings average

Host attributes columns

host_cols = [col for col in listings_df.columns if 'host' in col]
host_cols

['host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'calculated_host_listings_count']

listings_df[host_cols][:5]

	host_id	host_url	host_name	host_since	host_location	host_about	host_response_time	host_response_rate	host_acceptance_rate	host_is_superhost	host_thumbnail_url	host_picture_url	host_neighbourhood	host_listings_count	host_total_listings_count	host_verifications	host_has_profile_pic	host_identity_verified	calculated_host_listings_count
0	956883	https://www.airbnb.com/users/show/956883	Maija	2011-08-11	Seattle, Washington, United States	I am an artist, interior designer, and run a s...	within a few hours	96%	100%	f	https://a0.muscache.com/ac/users/956883/profil...	https://a0.muscache.com/ac/users/956883/profil...	Queen Anne	3.0	3.0	['email', 'phone', 'reviews', 'kba']	t	t	2
1	5177328	https://www.airbnb.com/users/show/5177328	Andrea	2013-02-21	Seattle, Washington, United States	Living east coast/left coast/overseas. Time i...	within an hour	98%	100%	t	https://a0.muscache.com/ac/users/5177328/profi...	https://a0.muscache.com/ac/users/5177328/profi...	Queen Anne	6.0	6.0	['email', 'phone', 'facebook', 'linkedin', 're...	t	t	6
2	16708587	https://www.airbnb.com/users/show/16708587	Jill	2014-06-12	Seattle, Washington, United States	i love living in Seattle. i grew up in the mi...	within a few hours	67%	100%	f	https://a1.muscache.com/ac/users/16708587/prof...	https://a1.muscache.com/ac/users/16708587/prof...	Queen Anne	2.0	2.0	['email', 'phone', 'google', 'reviews', 'jumio']	t	t	2
3	9851441	https://www.airbnb.com/users/show/9851441	Emily	2013-11-06	Seattle, Washington, United States	NaN	NaN	NaN	NaN	f	https://a2.muscache.com/ac/users/9851441/profi...	https://a2.muscache.com/ac/users/9851441/profi...	Queen Anne	1.0	1.0	['email', 'phone', 'facebook', 'reviews', 'jum...	t	t	1
4	1452570	https://www.airbnb.com/users/show/1452570	Emily	2011-11-29	Seattle, Washington, United States	Hi, I live in Seattle, Washington but I'm orig...	within an hour	100%	NaN	f	https://a0.muscache.com/ac/users/1452570/profi...	https://a0.muscache.com/ac/users/1452570/profi...	Queen Anne	2.0	2.0	['email', 'phone', 'facebook', 'reviews', 'kba']	t	t	1

#create a host listings df from the listings,a df that contains only listing data
host_listings_df=listings_df[host_cols].copy()

# find some correlations on the host listing columns
host_listings_df.corr()

	host_id	host_listings_count	host_total_listings_count	calculated_host_listings_count
host_id	1.000000	-0.069613	-0.069613	-0.176040
host_listings_count	-0.069613	1.000000	1.000000	0.224222
host_total_listings_count	-0.069613	1.000000	1.000000	0.224222
calculated_host_listings_count	-0.176040	0.224222	0.224222	1.000000

The corr( ) function returns 4 features, host_id, listings_count, total_listings_count and calculated host listings count. There is no description from AirBnb on what these columns are. We see strong positive 100 percent correlation between host listings count and host total listings count and no other correlation. These two might be the same thing

#host_listings_df
pct_to_float = lambda x: x.str.replace(r'%', r'.0').astype('float') / 100.0

# note that the percentages are now in float e.g 96% is 0.96
# Apply the function to the rate cols

host_listings_df[['host_response_rate','host_acceptance_rate']] = host_listings_df[['host_response_rate','host_acceptance_rate']].apply(pct_to_float, axis=1)

host_listings_df[:5]

	host_id	host_url	host_name	host_since	host_location	host_about	host_response_time	host_response_rate	host_acceptance_rate	host_is_superhost	host_thumbnail_url	host_picture_url	host_neighbourhood	host_listings_count	host_total_listings_count	host_verifications	host_has_profile_pic	host_identity_verified	calculated_host_listings_count
0	956883	https://www.airbnb.com/users/show/956883	Maija	2011-08-11	Seattle, Washington, United States	I am an artist, interior designer, and run a s...	within a few hours	0.96	1.0	f	https://a0.muscache.com/ac/users/956883/profil...	https://a0.muscache.com/ac/users/956883/profil...	Queen Anne	3.0	3.0	['email', 'phone', 'reviews', 'kba']	t	t	2
1	5177328	https://www.airbnb.com/users/show/5177328	Andrea	2013-02-21	Seattle, Washington, United States	Living east coast/left coast/overseas. Time i...	within an hour	0.98	1.0	t	https://a0.muscache.com/ac/users/5177328/profi...	https://a0.muscache.com/ac/users/5177328/profi...	Queen Anne	6.0	6.0	['email', 'phone', 'facebook', 'linkedin', 're...	t	t	6
2	16708587	https://www.airbnb.com/users/show/16708587	Jill	2014-06-12	Seattle, Washington, United States	i love living in Seattle. i grew up in the mi...	within a few hours	0.67	1.0	f	https://a1.muscache.com/ac/users/16708587/prof...	https://a1.muscache.com/ac/users/16708587/prof...	Queen Anne	2.0	2.0	['email', 'phone', 'google', 'reviews', 'jumio']	t	t	2
3	9851441	https://www.airbnb.com/users/show/9851441	Emily	2013-11-06	Seattle, Washington, United States	NaN	NaN	NaN	NaN	f	https://a2.muscache.com/ac/users/9851441/profi...	https://a2.muscache.com/ac/users/9851441/profi...	Queen Anne	1.0	1.0	['email', 'phone', 'facebook', 'reviews', 'jum...	t	t	1
4	1452570	https://www.airbnb.com/users/show/1452570	Emily	2011-11-29	Seattle, Washington, United States	Hi, I live in Seattle, Washington but I'm orig...	within an hour	1.00	NaN	f	https://a0.muscache.com/ac/users/1452570/profi...	https://a0.muscache.com/ac/users/1452570/profi...	Queen Anne	2.0	2.0	['email', 'phone', 'facebook', 'reviews', 'kba']	t	t	1

We beging to investigate the host listings, theres a number of features, we investigate them in smaller batches starting with the rates

host_listings_df[['host_response_rate','host_acceptance_rate']].hist()

array([[<AxesSubplot:title={'center':'host_response_rate'}>,
        <AxesSubplot:title={'center':'host_acceptance_rate'}>]],
      dtype=object)

png

host acceptance rate is always 100% in all occurences, there are no rejections from hosts, host response rate is also almost always 100%, very few hosts have less than 80% response rate

One of the data mining questions is about the increase in listings in AirBnb, we look at the host since feature

# convert it to date time
host_listings_df['host_since']=pd.to_datetime(host_listings_df['host_since'])

# split the date time into week,month and year
host_listings_df['dayofweek'] = host_listings_df['host_since'].dt.dayofweek
host_listings_df['weekofyear'] = host_listings_df['host_since'].dt.isocalendar().week
host_listings_df['month'] = host_listings_df['host_since'].dt.month
host_listings_df['year'] = host_listings_df['host_since'].dt.year

host_listings_df[:5]

	host_id	host_url	host_name	host_since	host_location	host_about	host_response_time	host_response_rate	host_acceptance_rate	host_is_superhost	...	host_listings_count	host_total_listings_count	host_verifications	host_has_profile_pic	host_identity_verified	calculated_host_listings_count	dayofweek	weekofyear	month	year
0	956883	https://www.airbnb.com/users/show/956883	Maija	2011-08-11	Seattle, Washington, United States	I am an artist, interior designer, and run a s...	within a few hours	0.96	1.0	f	...	3.0	3.0	['email', 'phone', 'reviews', 'kba']	t	t	2	3.0	32	8.0	2011.0
1	5177328	https://www.airbnb.com/users/show/5177328	Andrea	2013-02-21	Seattle, Washington, United States	Living east coast/left coast/overseas. Time i...	within an hour	0.98	1.0	t	...	6.0	6.0	['email', 'phone', 'facebook', 'linkedin', 're...	t	t	6	3.0	8	2.0	2013.0
2	16708587	https://www.airbnb.com/users/show/16708587	Jill	2014-06-12	Seattle, Washington, United States	i love living in Seattle. i grew up in the mi...	within a few hours	0.67	1.0	f	...	2.0	2.0	['email', 'phone', 'google', 'reviews', 'jumio']	t	t	2	3.0	24	6.0	2014.0
3	9851441	https://www.airbnb.com/users/show/9851441	Emily	2013-11-06	Seattle, Washington, United States	NaN	NaN	NaN	NaN	f	...	1.0	1.0	['email', 'phone', 'facebook', 'reviews', 'jum...	t	t	1	2.0	45	11.0	2013.0
4	1452570	https://www.airbnb.com/users/show/1452570	Emily	2011-11-29	Seattle, Washington, United States	Hi, I live in Seattle, Washington but I'm orig...	within an hour	1.00	NaN	f	...	2.0	2.0	['email', 'phone', 'facebook', 'reviews', 'kba']	t	t	1	1.0	48	11.0	2011.0

5 rows × 23 columns

We want to see the statistics of which days of the week, week of the year occur more and what those trends are

host_listings_df[['host_since','dayofweek','weekofyear','month','year']].describe()

	dayofweek	weekofyear	month	year
count	3816.000000	3816.000000	3816.000000	3816.000000
mean	2.634696	26.816038	6.551887	2013.214623
std	2.013244	13.972345	3.208349	1.560423
min	0.000000	1.000000	1.000000	2008.000000
25%	1.000000	15.000000	4.000000	2012.000000
50%	2.000000	28.000000	7.000000	2013.000000
75%	4.000000	38.000000	9.000000	2015.000000
max	6.000000	53.000000	12.000000	2016.000000

On average, about half of the hosts have been hosts since July 2013 and interestingly joined on a Wednesday. The maximum year is 2016 because this dataset was compiled in 2016.

host_listings_df[['host_since','dayofweek','weekofyear','month']].hist()

array([[<AxesSubplot:title={'center':'dayofweek'}>,
        <AxesSubplot:title={'center':'weekofyear'}>],
       [<AxesSubplot:title={'center':'month'}>, <AxesSubplot:>]],
      dtype=object)

png

The highest number of hosts joined on the first day of the week and second day of the week (According to google,in the US thats Sunday and Monday). The distribution of the months is spread accross the whole year, with the most joining between weeks 26 and 38. The months distribution provides us an even clearer picture showing highest joins beginning, mid and end of the year.

host_listings_df['host_since'].hist()

<AxesSubplot:>

png

This answers one of our data mining questions on whether there has been an increase in hosts joining AirBnb. Since 2008 when AirBnb started, there has been a huge growth, with each year there’s at least an increament of 100. Airbnb is experiencing significant growth.

host_listings_df[['host_since','year','host_neighbourhood']].groupby(['year']).count()

	host_since	host_neighbourhood
year
2008.0	4	4
2009.0	64	63
2010.0	149	146
2011.0	398	389
2012.0	539	514
2013.0	776	723
2014.0	900	809
2015.0	981	866
2016.0	5	4

This further elaborates our findings that there’s an upward trend of AirBnB listings since 2008 to 2016. In 10 years, there was a growth from 4 listings to 3816 listings, the growth is exponential. Between 2009 and 2011 the growth rate doubled at a average growth rate of 140% per year. From 2011 the growth rate steadied to an average growth rate of 30%. 2016 has a suprisingly low number possibly due to when this data was collected. We will investigate this. Otherwise, between 2008 and 2016, we see AirBnb listings grow significantly.

host_listings_df['host_since'].describe(datetime_is_numeric=True)

count                             3816
mean     2013-09-18 17:26:02.264153856
min                2008-11-10 00:00:00
25%                2012-08-17 00:00:00
50%                2013-12-12 12:00:00
75%                2015-01-14 00:00:00
max                2016-01-03 00:00:00
Name: host_since, dtype: object

An investigation into the host since column reveals why 2016 has such a low count for listings. The dataset was collected 03-01-2016, on just 3 days of January there was already 4 listings at 1.33 listings per day.

listings_df[['host_since','neighbourhood_cleansed']].describe()

	host_since	neighbourhood_cleansed
count	3816	3818
unique	1380	87
top	2013-08-30	Broadway
freq	51	397

There are 87 neighbourhoods listed. Broadway is the most frequently listed neighbourhood which occurs 397 times

host_listings_df['host_is_superhost'].value_counts()/host_listings_df.shape[0]

f    0.795705
t    0.203772
Name: host_is_superhost, dtype: float64

For 70% of the listings the host is not the superhost

host_listings_df['host_has_profile_pic'].value_counts()/host_listings_df.shape[0]

t    0.997643
f    0.001833
Name: host_has_profile_pic, dtype: float64

Every host has a profile picture, probably only 1 host in the entire listing doesnt have a picture

host_listings_df['host_identity_verified'].value_counts()/host_listings_df.shape[0]

t    0.784966
f    0.214510
Name: host_identity_verified, dtype: float64

Most host’s identities are verified, only 21.4% of hosts aren’t verified

host_listings_df['host_verifications'].describe()

count                                                 3818
unique                                                 116
top       ['email', 'phone', 'facebook', 'reviews', 'kba']
freq                                                   595
Name: host_verifications, dtype: object

There are as many host verifications as there are listings. I assume since its unspecified that host verifications are information about hosts that are given out by the hosts to verify. Most hosts provide email, phone, facebook, reviews and kba for verification

host_listings_df['host_response_time'].value_counts()/host_listings_df.shape[0]

within an hour        0.443164
within a few hours    0.253536
within a day          0.156365
a few days or more    0.009953
Name: host_response_time, dtype: float64

44% of hosts respond within an hour, while 25% respond within a few hours, meaning that a total of 69% of hosts respond well in time guests. 15% will respond within a day. Basically hosts respond at most within a day

host_listings_df['host_thumbnail_url'].describe()

count                                                  3816
unique                                                 2743
top       https://a2.muscache.com/ac/pictures/a4d7d053-c...
freq                                                     46
Name: host_thumbnail_url, dtype: object

host_listings_df['host_url'].describe()

count                                          3818
unique                                         2751
top       https://www.airbnb.com/users/show/8534462
freq                                             46
Name: host_url, dtype: object

The same is true of host urls, every host has a url

host_listings_df['host_name'].describe()

count       3816
unique      1466
top       Andrew
freq          56
Name: host_name, dtype: object

host_listings_df['host_about'].describe()

count                                                  2959
unique                                                 2011
top       It would be my pleasure to share and explore t...
freq                                                     46
Name: host_about, dtype: object

Every host has a thumbnail, they aren’t very unique.The most frequently used one is shown

host_listings_df['host_neighbourhood'].describe()

count             3518
unique             102
top       Capitol Hill
freq               405
Name: host_neighbourhood, dtype: object

Most hosts stay in capitol Hill

count_vals = host_listings_df.host_neighbourhood.value_counts()
(count_vals[:20]/host_listings_df.shape[0]).plot(kind="bar");
plt.title("Neighbourhood ");

png

Policies columns

Here we look at columns that related to host policy like ‘requires_license’, ‘license’, ‘jurisdiction_names’, instant_bookable’, ‘cancellation_policy’, ‘require_guest_profile_picture’, ‘require_guest_phone_verification’

policies=['requires_license', 'license', 'jurisdiction_names', 'instant_bookable', 'cancellation_policy','experiences_offered', 'require_guest_profile_picture',  'require_guest_phone_verification']

policies_df=listings_df[policies]

policies_df[:5]

	requires_license	license	jurisdiction_names	instant_bookable	cancellation_policy	experiences_offered	require_guest_profile_picture	require_guest_phone_verification
0	f	NaN	WASHINGTON	f	moderate	none	f	f
1	f	NaN	WASHINGTON	f	strict	none	t	t
2	f	NaN	WASHINGTON	f	strict	none	f	f
3	f	NaN	WASHINGTON	f	flexible	none	f	f
4	f	NaN	WASHINGTON	f	strict	none	f	f

policies_df['requires_license'].value_counts()/policies_df.shape[0]

f    1.0
Name: requires_license, dtype: float64

All of the listings do not require license

policies_df['require_guest_phone_verification'].value_counts()/policies_df.shape[0]

f    0.901781
t    0.098219
Name: require_guest_phone_verification, dtype: float64

90% of hosts do not require guest phone verification

policies_df['require_guest_profile_picture'].value_counts()/policies_df.shape[0]

f    0.915925
t    0.084075
Name: require_guest_profile_picture, dtype: float64

91% of hosts do not require guest profile picture

policies_df['cancellation_policy'].value_counts()/policies_df.shape[0]

strict      0.371137
moderate    0.327658
flexible    0.301205
Name: cancellation_policy, dtype: float64

On cancelation policy, hosts are split almost evenly between strict, moderate and flexible. 37% being very strict on cancelation

policies_df['instant_bookable'].value_counts()/policies_df.shape[0]

f    0.845207
t    0.154793
Name: instant_bookable, dtype: float64

About 84.5% of listings are not instant bookable

policies_df['experiences_offered'].value_counts()

none    3818
Name: experiences_offered, dtype: int64

available=['has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365']

available_df=listings_df[available]

available_df[:5]

	has_availability	availability_30	availability_60	availability_90	availability_365
0	t	14	41	71	346
1	t	13	13	16	291
2	t	1	6	17	220
3	t	0	0	0	143
4	t	30	60	90	365

available_df['has_availability'].value_counts()

t    3818
Name: has_availability, dtype: int64

All have availability. All listings have availability in general. We just have to inspect how much availability they have

available_df[['availability_30', 'availability_60', 'availability_90', 'availability_365']].describe()

	availability_30	availability_60	availability_90	availability_365
count	3818.000000	3818.000000	3818.000000	3818.000000
mean	16.786276	36.814825	58.082504	244.772656
std	12.173637	23.337541	34.063845	126.772526
min	0.000000	0.000000	0.000000	0.000000
25%	2.000000	13.000000	28.000000	124.000000
50%	20.000000	46.000000	73.000000	308.000000
75%	30.000000	59.000000	89.000000	360.000000
max	30.000000	60.000000	90.000000	365.000000

On average, about 17 houses have availability 30 days, 37 have availability 60 days a year, 58 for 90 days a year and about 245 have availability 365 days of the year.

property_=['id','property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type','amenities']

property_df=listings_df[property_]

property_df[:5]

	id	property_type	room_type	accommodates	bathrooms	bedrooms	beds	bed_type	amenities
0	241032	Apartment	Entire home/apt	4	1.0	1.0	1.0	Real Bed	{TV,"Cable TV",Internet,"Wireless Internet","A...
1	953595	Apartment	Entire home/apt	4	1.0	1.0	1.0	Real Bed	{TV,Internet,"Wireless Internet",Kitchen,"Free...
2	3308979	House	Entire home/apt	11	4.5	5.0	7.0	Real Bed	{TV,"Cable TV",Internet,"Wireless Internet","A...
3	7421966	Apartment	Entire home/apt	3	1.0	0.0	2.0	Real Bed	{Internet,"Wireless Internet",Kitchen,"Indoor ...
4	278830	House	Entire home/apt	6	2.0	3.0	3.0	Real Bed	{TV,"Cable TV",Internet,"Wireless Internet",Ki...

property_df.groupby(['room_type']).id.count()/property_df.shape[0]

room_type
Entire home/apt    0.665532
Private room       0.303824
Shared room        0.030644
Name: id, dtype: float64

(property_df.groupby(['room_type']).id.count()/property_df.shape[0]).plot(kind="pie")

<AxesSubplot:ylabel='id'>

png

66.5% of properties are are entire homes/appartments. 30% are private rooms, only 3% of listings are shared rooms

property_df.groupby(['property_type']).id.count()/property_df.shape[0]

property_type
Apartment          0.447355
Bed & Breakfast    0.009691
Boat               0.002095
Bungalow           0.003405
Cabin              0.005500
Camper/RV          0.003405
Chalet             0.000524
Condominium        0.023834
Dorm               0.000524
House              0.453903
Loft               0.010477
Other              0.005762
Tent               0.001310
Townhouse          0.030906
Treehouse          0.000786
Yurt               0.000262
Name: id, dtype: float64

(property_df.groupby(['property_type']).id.count()/property_df.shape[0]).plot(kind="bar")

<AxesSubplot:xlabel='property_type'>

png

Appartments and houses are the most common properties with 45% and 44.7% respectively. Bed and breakfast at 95 and the rest shared between other property types

property_df.groupby(['bed_type']).id.count()/property_df.shape[0]

bed_type
Airbed           0.007072
Couch            0.003405
Futon            0.019382
Pull-out Sofa    0.012310
Real Bed         0.957831
Name: id, dtype: float64

(property_df.groupby(['bed_type']).id.count()/property_df.shape[0]).plot(kind="bar")

<AxesSubplot:xlabel='bed_type'>

png

95% of beds listed are real beds, about 2% being Futons and the rest not beds

property_df.groupby(['accommodates']).id.count()/property_df.shape[0]

accommodates
   0.067051
   0.426139
   0.104243
   0.205605
   0.048193
   0.086957
   0.013620
   0.031168
   0.003405
  0.006548
  0.000786
  0.003929
  0.000786
  0.000524
  0.001048
Name: id, dtype: float64

(property_df.groupby(['accommodates']).id.count()/property_df.shape[0]).plot(kind="bar")

<AxesSubplot:xlabel='accommodates'>

png

42% of listings accomodate a maximum of 2 guests, 20% accomodate 4, 10% accomodate 3, smaller percentages can accomodate more than 4 people. This makes sense since most listings are houses and apartments

property_df.groupby(['bedrooms']).id.count()/property_df.shape[0]

bedrooms
0    0.097433
0    0.633054
0    0.167627
0    0.074123
0    0.018072
0    0.006286
0    0.001572
0    0.000262
Name: id, dtype: float64

(property_df.groupby(['bedrooms']).id.count()/property_df.shape[0]).plot(kind="bar")

<AxesSubplot:xlabel='bedrooms'>

png

63% of listings offer 1 bedroom while 16.7% have 2 bedrooms. 9% have no bedrooms at all

amenities_df=property_df['amenities']

Ammenities are in textual form but can be made into categories since there are frequently reccurring features. Also we can use wordcloud to map some of the most frequently occuring words

# drop empty rows in amenities
amenities_df = amenities_df[amenities_df != '{}']

# create a list of amenities, removing the {} and "" and split each word with a comma
amenities_list = []

for index, row in amenities_df.items():
    amenities_list.append(row.replace('{','').replace('}','').replace('"','').split(','))

amenities_list[:3]

[['TV',
  'Cable TV',
  'Internet',
  'Wireless Internet',
  'Air Conditioning',
  'Kitchen',
  'Heating',
  'Family/Kid Friendly',
  'Washer',
  'Dryer'],
 ['TV',
  'Internet',
  'Wireless Internet',
  'Kitchen',
  'Free Parking on Premises',
  'Buzzer/Wireless Intercom',
  'Heating',
  'Family/Kid Friendly',
  'Washer',
  'Dryer',
  'Smoke Detector',
  'Carbon Monoxide Detector',
  'First Aid Kit',
  'Safety Card',
  'Fire Extinguisher',
  'Essentials'],
 ['TV',
  'Cable TV',
  'Internet',
  'Wireless Internet',
  'Air Conditioning',
  'Kitchen',
  'Free Parking on Premises',
  'Pets Allowed',
  'Pets live on this property',
  'Dog(s)',
  'Cat(s)',
  'Hot Tub',
  'Indoor Fireplace',
  'Heating',
  'Family/Kid Friendly',
  'Washer',
  'Dryer',
  'Smoke Detector',
  'Carbon Monoxide Detector',
  'Essentials',
  'Shampoo']]

Now that we have ammenities as a list, we can add this as a column and one hot encode it to represent the categorical variables ( Stack overflow and kaggle ) have explanations on how this is done. We will use that snippet from there

new_amenities_df = pd.Series(amenities_list, name = 'amenities').to_frame()
new_amenities_df[:5]

	amenities
0	[TV, Cable TV, Internet, Wireless Internet, Ai...
1	[TV, Internet, Wireless Internet, Kitchen, Fre...
2	[TV, Cable TV, Internet, Wireless Internet, Ai...
3	[Internet, Wireless Internet, Kitchen, Indoor ...
4	[TV, Cable TV, Internet, Wireless Internet, Ki...

dummies_amenities_df = new_amenities_df.drop('amenities', 1).join(
    pd.get_dummies(
        pd.DataFrame(new_amenities_df.amenities.tolist()).stack()
    ).astype(int).sum(level=0)
)

dummies_amenities_df

	24-Hour Check-in	Air Conditioning	Breakfast	Buzzer/Wireless Intercom	Cable TV	Carbon Monoxide Detector	Cat(s)	Dog(s)	Doorman	Dryer	...	Safety Card	Shampoo	Smoke Detector	Smoking Allowed	Suitable for Events	TV	Washer	Washer / Dryer	Wheelchair Accessible	Wireless Internet
0	0	1	0	0	1	0	0	0	0	1	...	0	0	0	0	0	1	1	0	0	1
1	0	0	0	1	0	1	0	0	0	1	...	1	0	1	0	0	1	1	0	0	1
2	0	1	0	0	1	1	1	1	0	1	...	0	1	1	0	0	1	1	0	0	1
3	0	0	0	0	0	1	0	0	0	1	...	1	1	1	0	0	0	1	0	0	1
4	0	0	0	0	1	1	0	0	0	0	...	0	1	1	0	0	1	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3768	0	1	0	0	1	0	0	0	0	1	...	0	1	1	0	0	1	1	0	1	1
3769	1	0	0	0	1	0	0	0	0	0	...	0	1	1	0	0	1	0	0	0	1
3770	0	0	1	0	1	0	1	1	0	1	...	1	0	1	0	0	0	1	0	0	1
3771	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	1	0	0	0	1
3772	0	0	0	1	1	0	0	1	0	1	...	0	1	1	0	0	1	1	0	0	1

3773 rows × 41 columns

Now we have all the buzz words as columns

dummies_amenities_df.columns

Index(['24-Hour Check-in', 'Air Conditioning', 'Breakfast',
       'Buzzer/Wireless Intercom', 'Cable TV', 'Carbon Monoxide Detector',
       'Cat(s)', 'Dog(s)', 'Doorman', 'Dryer', 'Elevator in Building',
       'Essentials', 'Family/Kid Friendly', 'Fire Extinguisher',
       'First Aid Kit', 'Free Parking on Premises', 'Gym', 'Hair Dryer',
       'Hangers', 'Heating', 'Hot Tub', 'Indoor Fireplace', 'Internet', 'Iron',
       'Kitchen', 'Laptop Friendly Workspace', 'Lock on Bedroom Door',
       'Other pet(s)', 'Pets Allowed', 'Pets live on this property', 'Pool',
       'Safety Card', 'Shampoo', 'Smoke Detector', 'Smoking Allowed',
       'Suitable for Events', 'TV', 'Washer', 'Washer / Dryer',
       'Wheelchair Accessible', 'Wireless Internet'],
      dtype='object')

dummies_amenities_df.sum()

24-Hour Check-in               616
Air Conditioning               677
Breakfast                      291
Buzzer/Wireless Intercom       538
Cable TV                      1446
Carbon Monoxide Detector      2485
Cat(s)                         382
Dog(s)                         509
Doorman                         85
Dryer                         2997
Elevator in Building           785
Essentials                    3237
Family/Kid Friendly           1963
Fire Extinguisher             2196
First Aid Kit                 1680
Free Parking on Premises      2167
Gym                            442
Hair Dryer                     774
Hangers                        846
Heating                       3627
Hot Tub                        303
Indoor Fireplace               886
Internet                      2811
Iron                           742
Kitchen                       3423
Laptop Friendly Workspace      745
Lock on Bedroom Door           100
Other pet(s)                    51
Pets Allowed                   472
Pets live on this property     883
Pool                           159
Safety Card                    727
Shampoo                       2670
Smoke Detector                3281
Smoking Allowed                 82
Suitable for Events            209
TV                            2574
Washer                        2992
Washer / Dryer                   2
Wheelchair Accessible          300
Wireless Internet             3667
dtype: int64

dummies_amenities_df.sum().sort_values(ascending = False).plot(kind='bar', figsize = (15,5));

png

There is a long list of amenities on offer, washer/Dryer has only occurences, it might be the only amenity that is insignificant

amenitiesDF = listings_df[['amenities','price','id',]]
amenitiesDFTopper = amenitiesDF.sort_values('price',ascending=[0])
amenitiesDFtop=amenitiesDFTopper.head(30)
allemenities = ''
for index,row in amenitiesDFtop.iterrows():
    p = re.sub('[^a-zA-Z]+',' ', row['amenities'])
    allemenities+=p

allemenities_data=nltk.word_tokenize(allemenities)
filtered_data=[word for word in allemenities_data if word not in stopwords.words('english')] 
wnl = nltk.WordNetLemmatizer() 
allemenities_data=[wnl.lemmatize(data) for data in filtered_data]
allemenities_words=' '.join(allemenities_data)

plot_display(allemenities_words)

png

Of all the amenities that we found and listed, the ones shown on the wordmap are the ones that are strongly related to price. Some of the main ones include Wireless Internet, smoke detector, Monoxide detector, Carbon monoxide detector, Essentials like shampoo, Free Parking and Family and kid friendly. These improve customer experience and also contribute to pricing. The more ammenities the host adds the pricier the listing

EDA

While we have already done most of the EDA during data understanding, there’s a few aspects that we haven’t done. For instance, on the calender data set, we have not investigated the availability and price in relation to day of the week, week of the year, month of the year and year. We also have yet to determine if there are factors/ features that relate to or affect price. We also want to uncover neighbourhood trends, find out which times are busiest in Seattle and how the price is affected by those changes. In answer to some of our data mining objectives

fig = plt.figure(figsize = (15,10))
ax = fig.gca()
listings_df.plot.scatter('longitude', 'latitude', ax=ax)

<AxesSubplot:xlabel='longitude', ylabel='latitude'>

png

Revisiting the review Data

nltk.download('vader_lexicon')

!pip install langdetect

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/fucking/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Requirement already satisfied: langdetect in /home/fucking/anaconda/lib/python3.8/site-packages (1.0.9)
Requirement already satisfied: six in /home/fucking/anaconda/lib/python3.8/site-packages (from langdetect) (1.15.0)

reviewsDF=reviews_df

We want to understand the general sentiments in the comments whether they were positive, negative or neutral. In order to do that, we use a built-in NLTK library. To read and score each comment on a polariser

# Our dataframe consists of reviews in different language as well.Hence removing the comments which are not in english
from langdetect import detect

def detect_lang(sente):
    sente=str(sente)
    try:
        return detect(sente)
    except:
        return "None"

for index,row in reviewsDF.iterrows():
    lang=detect_lang(row['comments'])
    reviewsDF.at[index,'language'] = lang
#     print(lang)
    
#taking rows whose language is English
EngReviewsDF=reviewsDF[reviewsDF.language=='en']

EngReviewsDF.head[:5]

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-73-eb087a4ca30d> in <module>
     17 EngReviewsDF=reviewsDF[reviewsDF.language=='en']
     18 
---> 19 EngReviewsDF.head[:5]


TypeError: 'method' object is not subscriptable

We check for positive polarity in the sentiments on the reviews section

polarDF=EngReviewsDF[['pos']]
polarDF=polarDF.groupby(pd.cut(polarDF["pos"], np.arange(0, 1.1, 0.1))).count()
polarDF=polarDF.rename(columns={'pos':'count_of_Comments'})
polarDF=polarDF.reset_index()
polarDF=polarDF.rename(columns={'pos':'range_i'})
for i,r in polarDF.iterrows():
    polarDF.at[i,'RANGE'] = float(str(r['range_i'])[1:4].replace(',',''))
    polarDF.at[i,'Sentiment'] = 'positive'
del polarDF['range_i']
polarDF.head()

	count_of_Comments	RANGE	Sentiment
0	202	0.0	positive
1	1792	0.1	positive
2	3712	0.2	positive
3	2890	0.3	positive
4	1319	0.4	positive

We check for negative polarity as well

polarDFneg=EngReviewsDF[['neg']]
polarDFneg=polarDFneg.groupby(pd.cut(polarDFneg["neg"], np.arange(0, 1.1, 0.1))).count()
polarDFneg=polarDFneg.rename(columns={'neg':'count_of_Comments'})
polarDFneg=polarDFneg.reset_index()
polarDFneg=polarDFneg.rename(columns={'neg':'range_i'})
for i,r in polarDFneg.iterrows():
    polarDFneg.at[i,'RANGE'] = float(str(r['range_i'])[1:4].replace(',',''))
    polarDFneg.at[i,'Sentiment'] = 'negative'
del polarDFneg['range_i']
for i,r in polarDFneg.iterrows():
    polarDF=polarDF.append(pd.Series([r[0],r[1],r[2]],index=['count_of_Comments','RANGE','Sentiment']),ignore_index=True)
    
polarDFneg.head()

	count_of_Comments	RANGE	Sentiment
0	3180	0.0	negative
1	130	0.1	negative
2	8	0.2	negative
3	1	0.3	negative
4	0	0.4	negative

Checking for neutral sentiments as well

polarDFneut=EngReviewsDF[['neu']]
polarDFneut=polarDFneut.groupby(pd.cut(polarDFneut["neu"], np.arange(0, 1.0, 0.1))).count()
polarDFneut=polarDFneut.rename(columns={'neu':'count_of_Comments'})
polarDFneut=polarDFneut.reset_index()
polarDFneut=polarDFneut.rename(columns={'neu':'range_i'})
for i,r in polarDFneut.iterrows():
    polarDFneut.at[i,'RANGE'] = float(str(r['range_i'])[1:4].replace(',',''))
    polarDFneut.at[i,'Sentiment'] = 'neutral' 
del polarDFneut['range_i']

for i,r in polarDFneut.iterrows():
    polarDF=polarDF.append(pd.Series([r[0],r[1],r[2]],index=['count_of_Comments','RANGE','Sentiment']),ignore_index=True)
    
polarDFneut.head()

	count_of_Comments	RANGE	Sentiment
0	1	0.0	neutral
1	17	0.1	neutral
2	83	0.2	neutral
3	208	0.3	neutral
4	560	0.4	neutral

Then we plot the polarity of the sentiments against the number of compliments to see the general sentiment

plt.figure(figsize=(10,10))
sns.factorplot(data=polarDF, x="RANGE", y="count_of_Comments",col="Sentiment") 

/home/fucking/anaconda/lib/python3.8/site-packages/seaborn/categorical.py:3704: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)

<seaborn.axisgrid.FacetGrid at 0x7f5da14bc3d0>

<Figure size 720x720 with 0 Axes>

png

There are very few negative comments, neutral sentiments are skewed towards the range.

inputDF = pd.read_csv("listings.csv", delimiter=',')

inputDF=inputDF[['number_of_reviews','price','review_scores_rating']]

# replacing NaN values with 0
inputDF.fillna(0, inplace=True)

#Extracting prices from the table
price = inputDF['price']
prices=[]

#clean the data to make it float
for p in price:
    p=float(p[1:].replace(',',''))
    prices.append(p)

#replace the price column with the new column
inputDF['price']=prices

price_review = inputDF[['number_of_reviews', 'price']].sort_values(by = 'price')

price_review.plot(x = 'price', 
                  y = 'number_of_reviews', 
                  style = 'o',
                  figsize =(12,8),
                  legend = False,
                  title = 'Reviews based on Price')

plt.xlabel("price")
plt.ylabel("Number of reviews")

Text(0, 0.5, 'Number of reviews')

png

There are more reviews concentrated around a certain price range of between up to $ 250. We see here that customers prefer cheaper listings and use cheaper listings more hence the many reviews

Revisiting the calender data

calender_df[:5]

	listing_id	date	available	price	dayofweek	weekofyear	month	year	day_Name	holiday	us_holidays_name
0	241032	2016-01-04	1	85.000000	0	1	1	2016	default	False	working
1	241032	2016-01-05	1	85.000000	1	1	1	2016	default	False	working
2	241032	2016-01-06	0	137.090652	2	1	1	2016	default	False	working
3	241032	2016-01-07	0	137.090652	3	1	1	2016	default	False	working
4	241032	2016-01-08	0	137.090652	4	1	1	2016	default	False	working

fig, axs = plt.subplots(1, 3, figsize=(16, 5))
sns.boxplot(data=calender_df, x='month', y='price', ax=axs[0], fliersize=0)
sns.boxplot(data=calender_df, x='dayofweek', y='price', ax=axs[1], fliersize=0)
sns.boxplot(data=calender_df, x='year', y='price', ax=axs[2], fliersize=0)
for ax in axs:
    ax.set_ylim(0, 350)

png

The price of listings is highest between day 3 and 4 of the week, and month 6 to 9 of the year.

# we modify the data to include average price and a year and month column
yearDF=calender_df.groupby(['year','month']).price.mean()
yearDF=yearDF.reset_index()
yearDF=yearDF.rename(columns={'price':'average_Price'})
yearDF['year-Month']=yearDF['year'].map(str) + "-" + yearDF['month'].map(str)
yearDF.to_csv('./year_month_data.csv')
yearDF[::5]

	year	month	average_Price	year-Month
0	2016	1	128.293029	2016-1
5	2016	6	143.543198	2016-6
10	2016	11	135.396148	2016-11

yearDF.plot(kind="bar",x='year-Month', y='average_Price',color = '#662dff', figsize =(15,8), 
           title = 'Seattle Airbnb prices trend over months')
plt.ylabel('Average Price')

Text(0, 0.5, 'Average Price')

png

The average price is more or less the same throughout 2016 with a hike between June and Sept as seen earlier in the analysis

!pip install holidays

Requirement already satisfied: holidays in /home/fucking/anaconda/lib/python3.8/site-packages (0.11.1)
Requirement already satisfied: hijri-converter in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (2.1.1)
Requirement already satisfied: korean-lunar-calendar in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (0.2.1)
Requirement already satisfied: convertdate>=2.3.0 in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (2.3.2)
Requirement already satisfied: six in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (1.15.0)
Requirement already satisfied: python-dateutil in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (2.8.1)
Requirement already satisfied: pymeeus<=1,>=0.3.13 in /home/fucking/anaconda/lib/python3.8/site-packages (from convertdate>=2.3.0->holidays) (0.5.11)
Requirement already satisfied: pytz>=2014.10 in /home/fucking/anaconda/lib/python3.8/site-packages (from convertdate>=2.3.0->holidays) (2020.1)

Another useful library for calender analysis is to install the holiday library so that we know each holiday in the US to see if the days and or dates with higher prices corresponded to holidays

# We make a copy of the dataset so that we mess with the copy and preserve the original
calendarDF=calender_df.copy()

#replacing NaN values with 0
calendarDF.fillna(0, inplace=True)
calendarDF = calendarDF[calendarDF.price != 0]

#Extracting prices from the table
price = calendarDF['price']
prices=[]

for p in price:
    p = re.sub('[^0-9.]+','', p)
    prices.append(float(p))
    
#replace the price column with the new column
calendarDF['price']=prices

calendarDF = calendarDF[calendarDF.price >= 0]

#separating date column into day month and year
calendarDF['Year'],calendarDF['Month'],calendarDF['Day']=calendarDF['date'].str.split('-',2).str
calendarDF.head()

<ipython-input-629-91ebf8a550a9>:19: FutureWarning: Columnar iteration over characters will be deprecated in future releases.
  calendarDF['Year'],calendarDF['Month'],calendarDF['Day']=calendarDF['date'].str.split('-',2).str

	listing_id	date	available	price	Year	Month	Day
0	241032	2016-01-04	t	85.0	2016	01	04
1	241032	2016-01-05	t	85.0	2016	01	05
9	241032	2016-01-13	t	85.0	2016	01	13
10	241032	2016-01-14	t	85.0	2016	01	14
14	241032	2016-01-18	t	85.0	2016	01	18

Now that we have recreated the calender data set like we did before and seperated the datetime to day, month and year, we can use these for our EDA. Having installed the holidays, we add them to the data set

from datetime import date
import datetime
import calendar
import holidays

calendarDF.fillna(0, inplace=True)
us_holidays = holidays.UnitedStates()

calendarDF['day_Name']='default'
calendarDF['holiday']='False'
calendarDF['us_holidays_name']='working'
for index,row in calendarDF.iterrows():
    sdate = datetime.date(int(row['Year']),int(row['Month']),int(row['Day']))
    vall=date(int(row['Year']),int(row['Month']),int(row['Day'])) in us_holidays
    calendarDF.at[index,'day_Name'] = calendar.day_name[sdate.weekday()]
    calendarDF.at[index,'holiday'] = vall
    calendarDF.at[index,'us_holidays_name'] = us_holidays.get(sdate)
calendarDF.to_csv('./holidays_data.csv')
calendarDF.head()

	listing_id	date	available	price	Year	Month	Day	day_Name	holiday	us_holidays_name
0	241032	2016-01-04	t	85.0	2016	01	04	Monday	False	None
1	241032	2016-01-05	t	85.0	2016	01	05	Tuesday	False	None
9	241032	2016-01-13	t	85.0	2016	01	13	Wednesday	False	None
10	241032	2016-01-14	t	85.0	2016	01	14	Thursday	False	None
14	241032	2016-01-18	t	85.0	2016	01	18	Monday	True	Martin Luther King Jr. Day

We estimate and calculate the average price of listings per day to see the trend of prices per day of the week

#calculating Average price for each day

dayDF=calendarDF.groupby('day_Name').price.mean()
dayDF=dayDF.reset_index()
dayDF['day_num']=0

for index,row in dayDF.iterrows():
    if row['day_Name']=='Monday':
        dayDF.at[index,'day_num']=1
    if row['day_Name']=='Tuesday':
        dayDF.at[index,'day_num']=2
    if row['day_Name']=='Wednesday':
        dayDF.at[index,'day_num']=3
    if row['day_Name']=='Thursday':
        dayDF.at[index,'day_num']=4
    if row['day_Name']=='Friday':
        dayDF.at[index,'day_num']=5
    if row['day_Name']=='Saturday':
        dayDF.at[index,'day_num']=6
    if row['day_Name']=='Sunday':
        dayDF.at[index,'day_num']=7
dayDF=dayDF.sort_values('day_num',ascending=[1])
dayDF=dayDF.rename(columns={'price':'Average_Price'})
dayDF

	day_Name	Average_Price	day_num
1	Monday	135.676414	1
5	Tuesday	135.408764	2
6	Wednesday	135.447880	3
4	Thursday	136.476032	4
0	Friday	143.036294	5
2	Saturday	143.202136	6
3	Sunday	136.459941	7

What we see is that the average price is higher on weekends between Friday and Saturday

import matplotlib.pyplot as plt
dayname = list()
for i in dayDF['day_Name']:
    dayname.append(i)
avgprice = list()
for i in dayDF['Average_Price']:
    avgprice.append(i)
graph_input = dict(zip(dayname,avgprice))

plt.scatter(dayname,avgprice)
plt.show()

png

calendarDF.groupby('us_holidays_name').listing_id.count()

us_holidays_name
Christmas Day                 2829
Christmas Day (Observed)      2831
Columbus Day                  2650
Independence Day              2371
Labor Day                     2544
Martin Luther King Jr. Day    2231
Memorial Day                  2583
New Year's Day                2922
New Year's Day (Observed)     2776
Thanksgiving                  2746
Veterans Day                  2718
Washington's Birthday         2529
Name: listing_id, dtype: int64

In general, all holidays have bookings with New Year’s day being the highest.

calendarDF.groupby('us_holidays_name').price.mean()

us_holidays_name
Christmas Day                 137.974903
Christmas Day (Observed)      137.895797
Columbus Day                  134.985660
Independence Day              151.402362
Labor Day                     142.087264
Martin Luther King Jr. Day    121.740475
Memorial Day                  143.233837
New Year's Day                138.126968
New Year's Day (Observed)     135.572767
Thanksgiving                  136.054989
Veterans Day                  140.070272
Washington's Birthday         125.353895
Name: price, dtype: float64

Independence day has the highest price, Martin Luther King Jr Day and Washington’s Birthday being the cheapest.

Independence day affectionately known as the 4th of July in the USA is sometimes celebrated as a long weekend. So we check the average prices over the weekends. Especially around July as trends have shown a spike in prices between June and August.

# analyzing data from date 4th of July to date 13th of July which includes both long weekend 
#and normal workdays to compare prices 


marDF=calendarDF[(calendarDF['Year'] == '2016') & (calendarDF['Month'] == '07' )& 
                 ((calendarDF['Day'] == '04' )|(calendarDF['Day'] == '05' )|(calendarDF['Day'] == '06' )
                  | (calendarDF['Day'] == '07' )| (calendarDF['Day'] == '08' )| (calendarDF['Day'] == '09' )
                  | (calendarDF['Day'] == '10' )| (calendarDF['Day'] == '11' )| (calendarDF['Day'] == '12' )
                  | (calendarDF['Day'] == '13' ))]
marDF=marDF.groupby('Day').price.mean()
marDF=marDF.reset_index()
marDF=marDF.sort_values('Day',ascending=[1])
marDF=marDF.rename(columns={'price':'Average_Price'})
marDF.head(10)

	Day	Average_Price
0	04	151.402362
1	05	150.522832
2	06	150.200837
3	07	150.709570
4	08	156.260469
5	09	156.448161
6	10	150.302538
7	11	149.866250
8	12	150.058504
9	13	150.656785

Clearly and conclusively, weekends cost more than week days. A plot of average price per day of the week should show this

x=marDF['Day'].tolist()
y=marDF['Average_Price'].tolist()
plt.plot(x,y, 'ro-')
plt.ylabel('Average Price')
plt.xlabel('Days')
plt.show()

png

This answers our data mining question of what the busiest times in Seattle are and how the prices change. From our analysis, the summer is the busiest times in seattle and prices are higher. The Independence day holiday is the most expensive to visit in though holidays in general are pricier. Good holidays to visit seattle on are Martin Luther King Jr and Washington’s birthday. Weekends are also pricer than week days

Since we have NLT we can utilise it to process the text features

summary = listings_df[['summary','price']]
summary = summary[pd.notnull(summary['summary'])]
summary = summary[summary['summary']!=0]
summary = summary.sort_values('price',ascending=[0])
top100DF = summary.head(1000)

from nltk.corpus import stopwords
import string

words=''
for index,row in top100DF.iterrows():
    words += row['summary']
    
string_punctuation = string.punctuation
ignoreChar=['\r','\n','',' ',"'s"]
nums=['0','1','2','3','4','5','6','7','8','9']
summary_data=nltk.word_tokenize(words)
words_only = [l.lower() for l in summary_data if l not in string_punctuation if l not in ignoreChar if l not in nums]
filtered_data=[word for word in words_only if word not in stopwords.words('english')] 
wnl = nltk.WordNetLemmatizer() 
final_data=[wnl.lemmatize(data) for data in filtered_data]
final_words=' '.join(final_data)
final_words[:500]

'100 walk score close convention center westlake station pike place market amazon belltown capitol hill spotless reliable accommodating professional host apartment lower level classic 1915 craftsman bungalow heart wallingford walk score 93 ideally located 45th 50th block restaurant grocery store drug stores.walk downtown seattle attraction charming b b tree-lined side street capitol hill neighborhood.awesome two bedroom condo perfect convention traveler business traveler couple wanting see enjoy '

plot_display(final_words)

png

The most frequently used words in listing summary are home, room, downtown, apartment,neighbourhd, garden, ballard bus line, bathroom. Hosts listing houses should use these key words to get more customers

some_features=listings_df[['id','property_type','price','room_type','beds','bedrooms',
                          'neighbourhood_cleansed','bathrooms','accommodates','latitude','longitude']]
some_features.fillna(0,inplace=True)
some_features['price'] = some_features['price'].str.extract(r'(\d+)', expand=False)
# make price float
some_features['price'] = some_features['price'].astype(float, errors = 'raise')
some_features = some_features[some_features.bathrooms >0]
some_features = some_features[some_features.beds >0]
some_features = some_features[some_features.price >0]
some_features = some_features[some_features.accommodates >0]
some_features.head()

plt.figure(figsize=(18,22))
ax = sns.heatmap(some_features.groupby([
        'neighbourhood_cleansed', 'bedrooms']).price.mean().unstack(),annot=True, fmt=".0f")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

(87.5, -0.5)

png

The heatmap shows that the number of bed rooms greatly affect price, the more the bedrooms the higher the price but neighbourhood is also important. As expected some neighbourhoods with fewer bedrooms will be more expensive vice versa

# apartments are the most expensive room type,in 
plt.figure(figsize=(12,12))
sns.heatmap(some_features.groupby(['property_type', 'room_type']).price.mean().unstack(),annot=True, fmt=".0f")

<AxesSubplot:xlabel='room_type', ylabel='property_type'>

png

The room type certainly affects the price, apartments are the most expensive followed by boats.

some_features.groupby(['bedrooms']).price.mean().plot(kind="bar")

<AxesSubplot:xlabel='bedrooms'>

png

Key Findings from the Analysis of the data

When we started we sort to understand and answer these questions:

What features of the listings are related to price?
What vibes are happening in each neighbourhood?
Are there factors/metrics for that help understand customer experience?
Are there things that guests/AirBnB can do to boost guest experience and get more clients?
Has there been an increase in the number of AirBnb hosts?
What are the busiest times of the year in seattle and by how much do the prices change

Our Analysis of the data has uncovered these findings.

Price

The neighbourhood, number of beds, how many nights the guest will be staying, whether they have extra guests, the day of the week, the months, the neighbourhood, whether its a holiday and Amenities provided
Of all the amenities that we found and listed, the ones shown on the wordmap are the ones that are strongly related to price. Some of the main ones include Wireless Internet, smoke detector, Monoxide detector, Carbon monoxide detector, Essentials like shampoo, Free Parking and Family and kid friendly. These improve customer experience and also contribute to pricing. The more ammenities the host adds the pricier the listing
Appartments and houses are the most common properties with 45% and 44.7% of all the listings of room types and they are the most expensive

Vibes in each neighbourhood

Broadway is the most frequently listed neighbourhood which occurs 397 times
We see that there are zipcodes that have more listings like the 98122, 98103 and 98102. These are generally higher, the bulk of the listings are on average the same until we get to 98104. Then we see 98199, 126,106,108,133,136 have about the same listings around the 50-60 listings average

**Has there been an increase in AirBnb’s hosts

Since 2008 to 2016. In 10 years, there was a growth from 4 listings to 3816 listings, the growth is exponential. Between 2009 and 2011 the growth rate doubled at a average growth rate of 140% per year. From 2011 the growth rate steadied to an average growth rate of 30%. About 50% of hosts joined July 2013

Busiest times in Seattle and how prices change

The summer is the busiest times in seattle, between June and August. Prices are highest during these times. Holidays are generally higher priced on average than normal days.The Independence day holiday is the most expensive to visit in, in Seattle. Good holidays to visit seattle on are Martin Luther King Jr and Washington’s birthday, they are cheaper as well. Weekends are also higher priced than week days, prices go up on Friday and Saturday

Customer metrics

reviews per month and review score rating features are useful features for understanding customer experience and thus improving customer experience. As expected some users/hosts have many reviews but fewer reviews per month suggesting they have been listed longer but not neccesarily that they are perfoming better than those with fewer number of ratigs but higher reviews per month

Data Preparation

Our business understanding is broad, we sought to understand features that we can link to the customer experience and features that affect pricing. As we come down to preparing the data for modelling, we realise that we need to narrow down to one aspect to be investigated fully first then as time permits we can do the other one. We will start with and narrow our focus on price first. We will model price prediction.

What we have already learned about price from our EDA has revealed the features that seem to be related to or affect pricing. We will focus on these features. We concluded that:

Amenities
number of reviews
propety type
room type
neighbourhood
number of beds
how many nights
The guest will be staying
The day of the week, the months
Whether its a holiday
Amenities provided: and on the ammenities we provided a list of amenities that are essential

We will now clean and prepare the columns with these and related features. Fill in missing values. Convert the categorical variables into one hot encoded dummies

Selecting the columns that we want

We have 3 datasets:

calendar.csv - calendar data for the listings: availability dates, price for each date.
listings.csv - summary information on listing in Seattle such as: location, host information, cleaning and   guest fees, amenities etc.
reviews.csv - summary review data for the listings. We will only use the calender data and listings, the reviews is in textual format so we wont be modelling it. During our EDA we have already predetermined the columns that affect price, so here we just list them

#Select only the columns which we are interested in

selected_listings_cols = [
    'id', 'zipcode', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms',
 'beds', 'bed_type', 'amenities', 'price', 'weekly_price', 'monthly_price', 'security_deposit',
 'cleaning_fee', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
 'review_scores_location', 'review_scores_value', 'cancellation_policy',
]

new_listings_df = listings_df[selected_listings_cols]
new_listings_df['price'][:5]

   85.0
  150.0
  975.0
  100.0
  450.0
Name: price, dtype: float64

Drop missing values

# Drop rows with missing values of the number of reviews
# Drop rows with missing values of the bathrooms
# Drop rows with missing values of the property type

review_scores_cols = [col for col in listings_df.columns if 'review_scores' in col]
drop_cols = ['number_of_reviews','bathrooms','property_type'] + review_scores_cols

new_listings_df.dropna(subset = drop_cols, axis = 0, inplace = True)

# Drop rows where the number of reviews is 0.
new_listings_df = new_listings_df[new_listings_df['number_of_reviews'] != 0]

# Drop rows where the amenities are empty
new_listings_df = new_listings_df[new_listings_df['amenities'] != '{}']

# Reset index
new_listings_df = new_listings_df.reset_index(drop = True)

<ipython-input-133-45b57bcb1fc0>:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_listings_df.dropna(subset = drop_cols, axis = 0, inplace = True)

Fill missing values

# Fill missing values of the bedrooms and beds as 1
new_listings_df[['bedrooms', 'beds']] = new_listings_df[['bedrooms', 'beds']].fillna(value = 1)

# Change the data type of the price and related columns and fill missing values as 0
new_listings_price_df = new_listings_df[['price','weekly_price','monthly_price','security_deposit','cleaning_fee']]

for col in ['price','weekly_price','monthly_price','security_deposit','cleaning_fee']:
    new_listings_price_df = pd.concat([new_listings_price_df.drop(columns = [col]), new_listings_price_df[col].str.replace('$','').str.replace(',','').astype(float)], axis = 1)
    

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-135-c8147b92ab2f> in <module>
      3 
      4 for col in ['price','weekly_price','monthly_price','security_deposit','cleaning_fee']:
----> 5     new_listings_price_df = pd.concat([new_listings_price_df.drop(columns = [col]), new_listings_price_df[col].str.replace('$','').str.replace(',','').astype(float)], axis = 1)
      6 
      7 new_listings_price_df.fillna(value = 0, inplace =True)


~/anaconda/lib/python3.8/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5133             or name in self._accessors
   5134         ):
-> 5135             return object.__getattribute__(self, name)
   5136         else:
   5137             if self._info_axis._can_hold_identifiers_and_holds_name(name):


~/anaconda/lib/python3.8/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
    185             # we're accessing the attribute of the class, i.e., Dataset.geo
    186             return self._accessor
--> 187         accessor_obj = self._accessor(obj)
    188         # Replace the property with the accessor object. Inspired by:
    189         # https://www.pydanny.com/cached-property.html


~/anaconda/lib/python3.8/site-packages/pandas/core/strings.py in __init__(self, data)
   2098 
   2099     def __init__(self, data):
-> 2100         self._inferred_dtype = self._validate(data)
   2101         self._is_categorical = is_categorical_dtype(data.dtype)
   2102         self._is_string = data.dtype.name == "string"


~/anaconda/lib/python3.8/site-packages/pandas/core/strings.py in _validate(data)
   2155 
   2156         if inferred_dtype not in allowed_types:
-> 2157             raise AttributeError("Can only use .str accessor with string values!")
   2158         return inferred_dtype
   2159 


AttributeError: Can only use .str accessor with string values!

new_listings_price_df.fillna(value = 0, inplace =True)

/home/fucking/anaconda/lib/python3.8/site-packages/pandas/core/frame.py:4317: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(

# Calculate weekly and monthly price
new_listings_price_df['calc_weekly_price'] = new_listings_price_df['price'] * 7
new_listings_price_df['calc_monthly_price'] = new_listings_price_df['price'] * 30
new_listings_price_df[:5]

<ipython-input-158-c05eff8d17ca>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_listings_price_df['calc_weekly_price'] = new_listings_price_df['price'] * 7
<ipython-input-158-c05eff8d17ca>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_listings_price_df['calc_monthly_price'] = new_listings_price_df['price'] * 30

	price	weekly_price	monthly_price	security_deposit	cleaning_fee	calc_weekly_price	calc_monthly_price
0	85.0	595.0	2550.0	0.0	0.0	595.0	2550.0
1	150.0	1.0	3.0	100.0	40.0	1050.0	4500.0
2	975.0	6825.0	29250.0	1.0	300.0	6825.0	29250.0
3	450.0	3150.0	13500.0	700.0	125.0	3150.0	13500.0
4	120.0	800.0	3600.0	0.0	40.0	840.0	3600.0

# Fill the weekly and monthky price by its calculated values
for idx, row in new_listings_price_df.iterrows():
    if row['weekly_price'] == 0:
        new_listings_price_df.loc[idx, ['weekly_price']] = row['calc_weekly_price']
    if row['monthly_price'] == 0:
        new_listings_price_df.loc[idx, ['monthly_price']] = row['calc_monthly_price']

new_listings_price_df[:5]

	price	weekly_price	monthly_price	security_deposit	cleaning_fee	calc_weekly_price	calc_monthly_price
0	85.0	595.0	2550.0	0.0	0.0	595.0	2550.0
1	150.0	1.0	3.0	100.0	40.0	1050.0	4500.0
2	975.0	6825.0	29250.0	1.0	300.0	6825.0	29250.0
3	450.0	3150.0	13500.0	700.0	125.0	3150.0	13500.0
4	120.0	800.0	3600.0	0.0	40.0	840.0	3600.0

#drop the calculated prices columns
new_listings_price_df.drop(columns = ['calc_weekly_price', 'calc_monthly_price'], inplace = True)

/home/fucking/anaconda/lib/python3.8/site-packages/pandas/core/frame.py:4163: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(

new_listings_price_df

	price	weekly_price	monthly_price	security_deposit	cleaning_fee
0	85.0	595.0	2550.0	0.0	0.0
1	150.0	1.0	3.0	100.0	40.0
2	975.0	6825.0	29250.0	1.0	300.0
3	450.0	3150.0	13500.0	700.0	125.0
4	120.0	800.0	3600.0	0.0	40.0
...	...	...	...	...	...
3118	154.0	1.0	5.0	150.0	85.0
3119	65.0	455.0	1950.0	0.0	20.0
3120	95.0	600.0	2.0	500.0	0.0
3121	359.0	2513.0	10770.0	0.0	230.0
3122	79.0	553.0	2370.0	500.0	50.0

3123 rows × 5 columns

Categorical values changed to dummy variable

# Create dummy columns of cancellation policy, room type, property type and bed type

cancellation_policy_dummy_df = pd.get_dummies(new_listings_df['cancellation_policy'], prefix = 'cancellation_policy')
room_type_dummy_df = pd.get_dummies(new_listings_df['room_type'], prefix = 'room_type')
property_type_dummy_df = pd.get_dummies(new_listings_df['property_type'], prefix = 'property_type')
bed_type_dummy_df = pd.get_dummies(new_listings_df['bed_type'], prefix = 'bed_type')
bed_type_dummy_df

	bed_type_Airbed	bed_type_Couch	bed_type_Futon	bed_type_Pull-out Sofa	bed_type_Real Bed
0	0	0	0	0	1
1	0	0	0	0	1
2	0	0	0	0	1
3	0	0	0	0	1
4	0	0	0	0	1
...	...	...	...	...	...
3118	0	0	0	0	1
3119	0	0	0	0	1
3120	0	0	0	0	1
3121	0	0	0	0	1
3122	0	0	0	0	1

3123 rows × 5 columns

Create dummy variables based on amenities

# Create dummy columns based on the ammenities

# Drop rows with empty rows
amenities_series = new_listings_df['amenities']
amenities_series = amenities_series[amenities_series != '{}']

# Iterate over rows and format them as list
amenities_list = []

for index, row in amenities_series.items():
    amenities_list.append(row.replace('{','').replace('}','').replace('"','').split(','))

# Convert the list to a data frame
amenities_df = pd.Series(amenities_list, name = 'amenities').to_frame()

# Create a dummy data frame
dummies_amenities_df = amenities_df.drop('amenities', 1).join(
    pd.get_dummies(
        pd.DataFrame(amenities_df.amenities.tolist()).stack()
    ).astype(int).sum(level=0)
)

# Reset index
# dummies_amenities_df = dummies_amenities_df.reset_index(drop=True)
dummies_amenities_df[:5]

	Air Conditioning	Buzzer/Wireless Intercom	Cable TV	Carbon Monoxide Detector	Cat(s)	Dog(s)	Dryer	...	Safety Card	Shampoo	Smoke Detector	TV	Washer	Wireless Internet
0	1	0	1	0	0	0	1	...	0	0	0	1	1	1
1	0	1	0	1	0	0	1	...	1	0	1	1	1	1
2	1	0	1	1	1	1	1	...	0	1	1	1	1	1
3	0	0	1	1	0	0	0	...	0	1	1	1	0	1
4	0	0	0	0	0	0	0	...	0	1	1	0	0	1

5 rows × 41 columns

dummy_df = pd.concat([cancellation_policy_dummy_df, room_type_dummy_df, property_type_dummy_df, bed_type_dummy_df, dummies_amenities_df], axis = 1)
dummy_df[:5]

	cancellation_policy_moderate	cancellation_policy_strict	room_type_Entire home/apt	room_type_Private room	property_type_Apartment	...	Safety Card	Shampoo	Smoke Detector	TV	Washer	Wireless Internet
0	1	0	1	0	1	...	0	0	0	1	1	1
1	0	1	1	0	1	...	1	0	1	1	1	1
2	0	1	1	0	0	...	0	1	1	1	1	1
3	0	1	1	0	0	...	0	1	1	1	0	1
4	0	1	0	1	0	...	0	1	1	0	0	1

5 rows × 68 columns

Concatinate the dummy dataframes together

concat_listings_df =pd.concat([dummy_df,new_listings_price_df],axis=1)
concat_listings_df ['price']

#pd.concat([new_listings_df.select_dtypes(include = ['int', 'float']), new_listings_price_df, dummy_df], axis = 1)

      85.0
     150.0
     975.0
     450.0
     120.0
        ...  
  154.0
   65.0
   95.0
  359.0
   79.0
Name: price, Length: 3123, dtype: float64

# Create a list which contains always non available listing_id
always_f_listing_id = list(calender_df.groupby('listing_id')['available'].sum().loc[lambda x : x == 0].index.values)

calender_df[:5]

	listing_id	date	available	price	dayofweek	weekofyear	month	year
0	241032	2016-01-04	1	85.000000	0	1	1	2016
1	241032	2016-01-05	1	85.000000	1	1	1	2016
2	241032	2016-01-06	0	137.090652	2	1	1	2016
3	241032	2016-01-07	0	137.090652	3	1	1	2016
4	241032	2016-01-08	0	137.090652	4	1	1	2016

calender_df.isna().sum()

listing_id    0
date          0
available     0
price         0
dayofweek     0
weekofyear    0
month         0
year          0
dtype: int64

Data Modelling

We will use

concat_listings_df and the calender_df

concat_listings_df.columns

Index(['cancellation_policy_flexible', 'cancellation_policy_moderate',
       'cancellation_policy_strict', 'room_type_Entire home/apt',
       'room_type_Private room', 'room_type_Shared room',
       'property_type_Apartment', 'property_type_Bed & Breakfast',
       'property_type_Boat', 'property_type_Bungalow', 'property_type_Cabin',
       'property_type_Camper/RV', 'property_type_Chalet',
       'property_type_Condominium', 'property_type_Dorm',
       'property_type_House', 'property_type_Loft', 'property_type_Other',
       'property_type_Tent', 'property_type_Townhouse',
       'property_type_Treehouse', 'property_type_Yurt', 'bed_type_Airbed',
       'bed_type_Couch', 'bed_type_Futon', 'bed_type_Pull-out Sofa',
       'bed_type_Real Bed', '24-Hour Check-in', 'Air Conditioning',
       'Breakfast', 'Buzzer/Wireless Intercom', 'Cable TV',
       'Carbon Monoxide Detector', 'Cat(s)', 'Dog(s)', 'Doorman', 'Dryer',
       'Elevator in Building', 'Essentials', 'Family/Kid Friendly',
       'Fire Extinguisher', 'First Aid Kit', 'Free Parking on Premises', 'Gym',
       'Hair Dryer', 'Hangers', 'Heating', 'Hot Tub', 'Indoor Fireplace',
       'Internet', 'Iron', 'Kitchen', 'Laptop Friendly Workspace',
       'Lock on Bedroom Door', 'Other pet(s)', 'Pets Allowed',
       'Pets live on this property', 'Pool', 'Safety Card', 'Shampoo',
       'Smoke Detector', 'Smoking Allowed', 'Suitable for Events', 'TV',
       'Washer', 'Washer / Dryer', 'Wheelchair Accessible',
       'Wireless Internet', 'price', 'weekly_price', 'monthly_price',
       'security_deposit', 'cleaning_fee'],
      dtype='object')

selected_cols = [
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'cancellation_policy_flexible',
 'cancellation_policy_moderate',
 'cancellation_policy_strict',
 'room_type_Entire home/apt',
 'room_type_Private room',
 'room_type_Shared room',
 'property_type_Apartment',
 'property_type_Bed & Breakfast',
 'property_type_Boat',
 'property_type_Bungalow',
 'property_type_Cabin',
 'property_type_Camper/RV',
 'property_type_Chalet',
 'property_type_Condominium',
 'property_type_Dorm',
 'property_type_House',
 'property_type_Loft',
 'property_type_Other',
 'property_type_Tent',
 'property_type_Townhouse',
 'property_type_Treehouse',
 'property_type_Yurt',
 'bed_type_Airbed',
 'bed_type_Couch',
 'bed_type_Futon',
 'bed_type_Pull-out Sofa',
 'bed_type_Real Bed',
 '24-Hour Check-in',
 'Air Conditioning',
 'Breakfast',
 'Buzzer/Wireless Intercom',
 'Cable TV',
 'Carbon Monoxide Detector',
 'Cat(s)',
 'Dog(s)',
 'Doorman',
 'Dryer',
 'Elevator in Building',
 'Essentials',
 'Family/Kid Friendly',
 'Fire Extinguisher',
 'First Aid Kit',
 'Free Parking on Premises',
 'Gym',
 'Hair Dryer',
 'Hangers',
 'Heating',
 'Hot Tub',
 'Indoor Fireplace',
 'Internet',
 'Iron',
 'Kitchen',
 'Laptop Friendly Workspace',
 'Lock on Bedroom Door',
 'Other pet(s)',
 'Pets Allowed',
 'Pets live on this property',
 'Pool',
 'Safety Card',
 'Shampoo',
 'Smoke Detector',
 'Smoking Allowed',
 'Suitable for Events',
 'TV',
 'Washer',
 'Wheelchair Accessible',
 'Wireless Internet'
                ]

Machine Learning: Model Selection

We will start with XGBoost first because it is a complex and advanced gradient boosting algorith that can deal with many features ( our data set has too many variables)

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import plot_importance
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

X=concat_listings_df.drop(['price'],axis=1).select_dtypes(exclude=['object'])
y=concat_listings_df['price']

# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, y)
# summarize scores
#set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])

[ 1.53466214  1.2987518   2.90972465 13.47652381  9.74081468  7.69373935
  2.41685525  0.8211285  11.37626548  0.40519549  0.57030881  0.34520482
  0.31617885  0.9355316   1.59113654  2.53268221  0.78594757  1.47508666
  0.39350675  0.95576457  0.97043513  0.09909161  1.74685022  0.96428115
  0.74663051  0.70456078  1.47429514  1.53097578  1.80095279  1.45991194
  1.55088211  2.31376894  1.11652724  1.07495732  0.98874016  2.92842166
  1.17212899  2.27682701  1.02246066  2.30845121  0.93481025  1.24830463
  1.31057935  2.64698341  1.51811673  1.46783559  1.14992775  2.44973911
  1.88447824  1.07653317  1.58715301  1.20123227  1.25092065  1.04704867
  0.44190869  1.13826204  1.55845695  3.05327576  1.22631481  1.24680062
  0.83378803  0.86773784  1.69089912  2.00411682  1.2468099   0.11179384
  1.24166705  0.95109599 11.99159758 13.29630536  2.89720016 11.95987119]
[[1.000e+00 5.950e+02 2.550e+03 0.000e+00]
 [1.000e+00 1.000e+00 3.000e+00 4.000e+01]
 [1.000e+00 6.825e+03 2.925e+04 3.000e+02]
 [1.000e+00 3.150e+03 1.350e+04 1.250e+02]
 [0.000e+00 8.000e+02 3.600e+03 4.000e+01]]

model = LogisticRegression(solver='lbfgs')
rfe = RFE(model, 3)
fit = rfe.fit(X, y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/utils/validation.py:67: FutureWarning: Pass n_features_to_select=3 as keyword args. From version 0.25 passing these as positional arguments will result in an error
  warnings.warn("Pass {} as keyword args. From version 0.25 "
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

# Train, test and validation data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=99) 

# Create a model and fit the data to it 

xgb_model = XGBRegressor(
    max_depth=15,
    n_estimators=1000,
    min_child_weight=10, 
    colsample_bytree=0.6, 
    subsample=0.6, 
    eta=0.2,    
    seed=0,
    learning_rate = 0.05,
    n_jobs=2)

xgb_model.fit(
    X_train, 
    y_train, 
    eval_metric="rmse", 
    eval_set=[(X_train, y_train), (X_val, y_val)], 
    verbose=10, 
    early_stopping_rounds = 5)

[0]	validation_0-rmse:147.14436	validation_1-rmse:146.99689
[10]	validation_0-rmse:95.54114	validation_1-rmse:95.90320
[20]	validation_0-rmse:64.75782	validation_1-rmse:66.27467
[30]	validation_0-rmse:47.27364	validation_1-rmse:48.85769
[40]	validation_0-rmse:38.16735	validation_1-rmse:41.10921
[50]	validation_0-rmse:32.86573	validation_1-rmse:36.87308
[60]	validation_0-rmse:29.69966	validation_1-rmse:34.57921
[70]	validation_0-rmse:27.70108	validation_1-rmse:33.17731
[80]	validation_0-rmse:26.40963	validation_1-rmse:32.58391
[90]	validation_0-rmse:25.15677	validation_1-rmse:31.58672
[100]	validation_0-rmse:24.32839	validation_1-rmse:31.16549
[103]	validation_0-rmse:24.15864	validation_1-rmse:31.21159





XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.6, eta=0.2, gamma=0,
             gpu_id=-1, importance_type='gain', interaction_constraints='',
             learning_rate=0.05, max_delta_step=0, max_depth=15,
             min_child_weight=10, missing=nan, monotone_constraints='()',
             n_estimators=1000, n_jobs=2, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0,
             subsample=0.6, tree_method='exact', validate_parameters=1,
             verbosity=None)

xgb_train_pred = xgb_model.predict(X_train)
xgb_val_pred = xgb_model.predict(X_val)
xgb_test_pred = xgb_model.predict(X_test)

for i in [[y_train, xgb_train_pred], [y_val, xgb_val_pred], [y_test, xgb_test_pred]]:
    print(r2_score(i[0], i[1]))

9248414694882681
8683763163400222
916730205761942

The model actually performs relatively well with between 85 to 90% accuracy on the training data and testing data

fig, ax = plt.subplots(figsize=(10, 15))
plot_importance(xgb_model,ax=ax);

png

The feature importances are not entirely consistent with our EDA findings possibly due to dimensionality problems and too much correlation. Perhaps we could revisit our feature engineering and reduce the features

Customer Experience prediction

Our other busines objective was to determine if there are metrics that we can use to determine or understand customer experience. We have already done the EDA and feature engineering. We just have to do the final feature selection and model

df=listings_df.copy()

df['cx_score'] = df['review_scores_rating'] / 100 * df['reviews_per_month']

We found that there’s a strong relation between reviews scores rating and reviews per month. Since we can only have one response variable, we can combine these two by multiply them. We call the column cx_score

listings_df['neighbourhood_cleansed'].value_counts()[:10]

Broadway                     397
Belltown                     234
Wallingford                  167
Fremont                      158
Minor                        135
University District          122
Stevens                      119
First Hill                   108
Central Business District    103
Lower Queen Anne              94
Name: neighbourhood_cleansed, dtype: int64

We list the top 10 neighbourhoods which we will use to model our data. We will choose one neighbourhood which we will use to model the data with

Share on

Twitter Facebook Google+ LinkedIn

Lionel Mpofu

Predicting AirBnB Prices and Review scores

Business Understanding

Data mining objectives

Data mining problem type

Criteria for model assessment

Action plan

Data Understanding

Data Gathering

Understanding the calender dataset

Understanding the reviews data

Common words among reviews

Understanding Listings data

Columns

Review based columns

Location based columns

Host attributes columns

Policies columns

EDA

Revisiting the review Data

Revisiting the calender data

Key Findings from the Analysis of the data

Data Preparation

Selecting the columns that we want

Data Modelling

Machine Learning: Model Selection

Customer Experience prediction

Share on

You May Also Enjoy

Optimising Starbuck’s rewards program

Optimising Starbucks Promotional Strategy

Recommendations with IBM WATSON

Population Segmentation with SageMaker

	Air Conditioning	Buzzer/Wireless Intercom	Cable TV	Carbon Monoxide Detector	Cat(s)	Dog(s)	Dryer	...	Safety Card	Shampoo	Smoke Detector	TV	Washer	Wireless Internet
0	1	0	1	0	0	0	1	...	0	0	0	1	1	1
1	0	1	0	1	0	0	1	...	1	0	1	1	1	1
2	1	0	1	1	1	1	1	...	0	1	1	1	1	1
3	0	0	1	1	0	0	0	...	0	1	1	1	0	1
4	0	0	0	0	0	0	0	...	0	1	1	0	0	1

	cancellation_policy_moderate	cancellation_policy_strict	room_type_Entire home/apt	room_type_Private room	property_type_Apartment	...	Safety Card	Shampoo	Smoke Detector	TV	Washer	Wireless Internet
0	1	0	1	0	1	...	0	0	0	1	1	1
1	0	1	1	0	1	...	1	0	1	1	1	1
2	0	1	1	0	0	...	0	1	1	1	1	1
3	0	1	1	0	0	...	0	1	1	1	0	1
4	0	1	0	1	0	...	0	1	1	0	0	1

	Air Conditioning	Buzzer/Wireless Intercom	Cable TV	Carbon Monoxide Detector	Cat(s)	Dog(s)	Dryer	...	Safety Card	Shampoo	Smoke Detector	TV	Washer	Wireless Internet
0	1	0	1	0	0	0	1	...	0	0	0	1	1	1
1	0	1	0	1	0	0	1	...	1	0	1	1	1	1
2	1	0	1	1	1	1	1	...	0	1	1	1	1	1
3	0	0	1	1	0	0	0	...	0	1	1	1	0	1
4	0	0	0	0	0	0	0	...	0	1	1	0	0	1

	cancellation_policy_moderate	cancellation_policy_strict	room_type_Entire home/apt	room_type_Private room	property_type_Apartment	...	Safety Card	Shampoo	Smoke Detector	TV	Washer	Wireless Internet
0	1	0	1	0	1	...	0	0	0	1	1	1
1	0	1	1	0	1	...	1	0	1	1	1	1
2	0	1	1	0	0	...	0	1	1	1	1	1
3	0	1	1	0	0	...	0	1	1	1	0	1
4	0	1	0	1	0	...	0	1	1	0	0	1

Lionel Mpofu

Business Understanding

Data mining objectives

Data mining problem type

Criteria for model assessment

Action plan

Data Understanding

Data Gathering

Understanding the calender dataset

Understanding the reviews data

Common words among reviews

Understanding Listings data

Columns

Review based columns

Price and related features

Location based columns

Host attributes columns

Policies columns

Property related columns

EDA

Revisiting the review Data

Revisiting the calender data

Revisiting Price related metrics

Key Findings from the Analysis of the data

Data Preparation

Selecting the columns that we want

Data Modelling

Machine Learning: Model Selection

Customer Experience prediction

Share on

You May Also Enjoy

Optimising Starbuck’s rewards program

Optimising Starbucks Promotional Strategy

Recommendations with IBM WATSON

Population Segmentation with SageMaker

	Air Conditioning	Buzzer/Wireless Intercom	Cable TV	Carbon Monoxide Detector	Cat(s)	Dog(s)	Dryer	...	Safety Card	Shampoo	Smoke Detector	TV	Washer	Wireless Internet
0	1	0	1	0	0	0	1	...	0	0	0	1	1	1
1	0	1	0	1	0	0	1	...	1	0	1	1	1	1
2	1	0	1	1	1	1	1	...	0	1	1	1	1	1
3	0	0	1	1	0	0	0	...	0	1	1	1	0	1
4	0	0	0	0	0	0	0	...	0	1	1	0	0	1

	cancellation_policy_moderate	cancellation_policy_strict	room_type_Entire home/apt	room_type_Private room	property_type_Apartment	...	Safety Card	Shampoo	Smoke Detector	TV	Washer	Wireless Internet
0	1	0	1	0	1	...	0	0	0	1	1	1
1	0	1	1	0	1	...	1	0	1	1	1	1
2	0	1	1	0	0	...	0	1	1	1	1	1
3	0	1	1	0	0	...	0	1	1	1	0	1
4	0	1	0	1	0	...	0	1	1	0	0	1