Predicting AirBnB Prices and Review scores
Business Understanding
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of its growth initiative, AirBnB is seeking to enhance customer experience and improve business. These insights will help Customer Experience, Business Development and Marketing departments. We will monitor local trends and gain strategic insight on the growth of AirBnB in seattle and on ways of improving the customer experience.
Data mining objectives
AirBnb is seeking to know what vibes are happening in each neighbourhood, if there is a general upward trend of new AirBnb listings in Seattle, which factors influence price and customer experience. We want to find if there are metrics which the business can associate with good prices and increased listings.Also to find a good way to measure customer satisfaction and ways that AirBnb hosts can do to get more bookings.
We will use the listings, calendar and reviews datasets for our analysis. We will wrangle and explore this data to understand what data we have and find insights from it to meet our objectives. A large part of this will involve applying transformations to the data, and dealing with missing data.
To summarise, translating our business goals into data mining goals, we have:
1. What features of the listings are related to price?
2. What vibes are happening in each neighbourhood?
3. Are there factors/metrics for that help understand customer experience?
4. Are there things that guests/AirBnB can do to boost guest experience and get more clients?
5. Has there been an increase in the number of AirBnb hosts?
6. What are the busiest times of the year in seattle and by how much do the prices change
clarify who customer and host is
Data mining problem type
We will mine the data to answer or attempt to answer the data mining objectives. The question of vibes happening in each neighhbourhood,factors that help understand/improve customer experience will require exploratory data analysis. The EDA will be perfomed on features on the listings and review data sets to find the metrics.
The question of factors related to price and guest experience will become either a regression or classification problem for which we will do the modelling. The challenges will be selecting the features that are important and preparing the data for modelling (removing outliers, imputing missing values etc).
To answer the question on price and customer experience will deal with feature engineering. Choosing which features will be used and according to the chosen model, determine the parameters.
Criteria for model assessment
The model will be evaluated against tests to ensure its valid.
The first data mining problems will be approached using descriptive statistics and visualisations of the distributions. By these we will be able to ascertain key metrics and some feature importances.
We will also need to ensure that the metric chosen will satisfy the business requirements for being an adequate representation of customer experience, and that it is not just chosen solely to produce good results from the model (Feature importance)
For the second data mining problem, we will use cross validation to evaluate candidates for modelling the data. We will evaluate using the R2 score and the RMSE if using regression. We could also plot residuals of the model to evaluate skew.
The third data mining problem will follow from the solution of the second data mining problem. We can apply a properly evaluated model on the data in order to assess the feature importance. We will ensure that the features chosen from the feature importance are appropriate for using in a strategy that could conceivably target hosts and how they could improve the guests’ experiences.
Finally, it is important to note the broad nature of the project. We will narrow it down to improve customer service. We are only looking to improve guests’ customer experience, and this could be limited to certain subsets of the user base. As long as the user base chosen is large enough for potential insights to be significant, then this will contribute towards satisfying our business goals.
Action plan
Boiling down the above into a plan, we have:
1.Explore and understand the data.
2.Perform exploratory and descriptive analysis in order to find an appropriate response variable related to customer experience and price
3.Based on whether the response variable is continous or discrete either frame a regression or classification problem for predicting this response variable.
4.Clean and tidy the data set in order to prepare it for modelling.
5.Select candidate algorithms for modelling the data and evaluate them, utilizing cross-validation.
6.Use the prediction model to ascertain the most important aspects that hosts can improve guests’ experience.
7.Communicate the results as key points that a non-technical business audience could understand.
Data Understanding
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
!pip install wordcloud
Requirement already satisfied: wordcloud in /home/fucking/anaconda/lib/python3.8/site-packages (1.8.1)
Requirement already satisfied: pillow in /home/fucking/anaconda/lib/python3.8/site-packages (from wordcloud) (8.0.1)
Requirement already satisfied: numpy>=1.6.1 in /home/fucking/anaconda/lib/python3.8/site-packages (from wordcloud) (1.19.2)
Requirement already satisfied: matplotlib in /home/fucking/anaconda/lib/python3.8/site-packages (from wordcloud) (3.3.2)
Requirement already satisfied: certifi>=2020.06.20 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (2020.6.20)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (1.3.0)
Requirement already satisfied: python-dateutil>=2.1 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /home/fucking/anaconda/lib/python3.8/site-packages (from matplotlib->wordcloud) (2.4.7)
Requirement already satisfied: six>=1.5 in /home/fucking/anaconda/lib/python3.8/site-packages (from python-dateutil>=2.1->matplotlib->wordcloud) (1.15.0)
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
[nltk_data] Downloading package stopwords to
[nltk_data] /home/fucking/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/fucking/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/fucking/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
True
Data Gathering
calender_df=pd.read_csv("calendar.csv", delimiter=',')
listings_df=pd.read_csv("listings.csv", delimiter=',')
reviews_df=pd.read_csv("reviews.csv", delimiter=',')
Understanding the calender dataset
calender_df.shape
(1393570, 4)
calender_df.head()
listing_id | date | available | price | |
---|---|---|---|---|
0 | 241032 | 2016-01-04 | t | $85.00 |
1 | 241032 | 2016-01-05 | t | $85.00 |
2 | 241032 | 2016-01-06 | f | NaN |
3 | 241032 | 2016-01-07 | f | NaN |
4 | 241032 | 2016-01-08 | f | NaN |
# Are there missing values
no_missing_vals_cols=set(calender_df.columns[calender_df.isnull().mean()==0])
no_missing_vals_cols
{'available', 'date', 'listing_id'}
#replace True or false with 1 and 0 for available not available and change
calender_df=calender_df.replace(to_replace="t",value="1")
calender_df=calender_df.replace(to_replace="f",value="0")
calender_df['available'] = calender_df['available'].astype(int, errors = 'raise')
#Find the proportion of available to unavailable values
calender_df.available.value_counts()/ calender_df.shape[0]
1 0.67061
0 0.32939
Name: available, dtype: float64
calender_df['price'] = calender_df['price'].str.extract(r'(\d+)', expand=False)
# make price float
calender_df['price'] = calender_df['price'].astype(float, errors = 'raise')
#fill missing price values with mean
calender_df['price'].fillna((calender_df['price'].mean()), inplace=True)
#plot the proportions of available to not available
price_val=calender_df.groupby(['available']).listing_id.count()
price_val.plot.pie(figsize=(6,6))
<AxesSubplot:ylabel='listing_id'>
67% of listings are available and 32.9% are unavailable 365 days.
#check the calender data types
calender_df.dtypes
listing_id int64
date object
available int64
price float64
dtype: object
The date column is saved as an object, we have to convert it to date time dtype for analysis
#convert the date column from object to datetime
calender_df['date'] = pd.to_datetime(calender_df['date'])
calender_df.date.dtypes
dtype('<M8[ns]')
A more interesting approach we will need is to split the date time into day of the week, week of the year, month and year. This will help us uncover insights into trends as relates to time
# split date time into day of the week, week of the year, month of the year and year for indepth analysis
# Create new columns
calender_df['dayofweek'] = calender_df['date'].dt.dayofweek
calender_df['weekofyear'] = calender_df['date'].dt.isocalendar().week
calender_df['month'] = calender_df['date'].dt.month
calender_df['year'] = calender_df['date'].dt.year
calender_df[:5]
listing_id | date | available | price | dayofweek | weekofyear | month | year | |
---|---|---|---|---|---|---|---|---|
0 | 241032 | 2016-01-04 | 1 | 85.000000 | 0 | 1 | 1 | 2016 |
1 | 241032 | 2016-01-05 | 1 | 85.000000 | 1 | 1 | 1 | 2016 |
2 | 241032 | 2016-01-06 | 0 | 137.090652 | 2 | 1 | 1 | 2016 |
3 | 241032 | 2016-01-07 | 0 | 137.090652 | 3 | 1 | 1 | 2016 |
4 | 241032 | 2016-01-08 | 0 | 137.090652 | 4 | 1 | 1 | 2016 |
What we have gathered from the calender data is that listings are grouped into whether or not they are available, their price and date of whether they were available and not available. We learnt that there are 3818 listings in total, 67% available.
To note: We split the datetime into day of the week, week of the year, month and year. This will be useful for our EDA
Understanding the reviews data
The reviews dataset is small and contains comments made by guests. By knowing frequently used words or words associated with quality, we provide insights for AirBnb hosts
df2=reviews_df
df2.head()
listing_id | id | date | reviewer_id | reviewer_name | comments | |
---|---|---|---|---|---|---|
0 | 7202016 | 38917982 | 2015-07-19 | 28943674 | Bianca | Cute and cozy place. Perfect location to every... |
1 | 7202016 | 39087409 | 2015-07-20 | 32440555 | Frank | Kelly has a great room in a very central locat... |
2 | 7202016 | 39820030 | 2015-07-26 | 37722850 | Ian | Very spacious apartment, and in a great neighb... |
3 | 7202016 | 40813543 | 2015-08-02 | 33671805 | George | Close to Seattle Center and all it has to offe... |
4 | 7202016 | 41986501 | 2015-08-10 | 34959538 | Ming | Kelly was a great host and very accommodating ... |
df2.shape
(84849, 6)
df2.dtypes
listing_id int64
id int64
date object
reviewer_id int64
reviewer_name object
comments object
dtype: object
#convert date to datetime
df2['date'] = pd.to_datetime(df2['date'])
# split date time into day of the week, week of the year, month of the year and year for indepth analysis
#calender_df.reset_index(inplace=True)
# Create new columns
df2['dayofweek'] = df2['date'].dt.dayofweek
df2['weekofyear'] = df2['date'].dt.isocalendar().week
df2['month'] = df2['date'].dt.month
df2['year'] = df2['date'].dt.year
#calender_df.drop('level_0',inplace=True, axis=1)
df2[:5]
listing_id | id | date | reviewer_id | reviewer_name | comments | dayofweek | weekofyear | month | year | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 7202016 | 38917982 | 2015-07-19 | 28943674 | Bianca | Cute and cozy place. Perfect location to every... | 6 | 29 | 7 | 2015 |
1 | 7202016 | 39087409 | 2015-07-20 | 32440555 | Frank | Kelly has a great room in a very central locat... | 0 | 30 | 7 | 2015 |
2 | 7202016 | 39820030 | 2015-07-26 | 37722850 | Ian | Very spacious apartment, and in a great neighb... | 6 | 30 | 7 | 2015 |
3 | 7202016 | 40813543 | 2015-08-02 | 33671805 | George | Close to Seattle Center and all it has to offe... | 6 | 31 | 8 | 2015 |
4 | 7202016 | 41986501 | 2015-08-10 | 34959538 | Ming | Kelly was a great host and very accommodating ... | 0 | 33 | 8 | 2015 |
def plot_display(plot_words_list):
wordcloud = WordCloud(width = 1000, height = 700).generate(plot_words_list)
plt.figure(figsize=(18,12))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
We will use Natural Language Toolkit, a Natural Language Processing library for python and use it to process the reviews and comments
from wordcloud import WordCloud
from wordcloud import STOPWORDS
import matplotlib.pyplot as plt
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
comments = df2[['comments','listing_id',]]
commentsDFTopper = comments.sort_values('listing_id',ascending=[0])
commentsDFtop=commentsDFTopper.head(30)
reviews = ''
for index,row in commentsDFtop.iterrows():
p = re.sub('[^a-zA-Z]+',' ', row['comments'])
reviews+=p
reviews_data=nltk.word_tokenize(reviews)
filtered_data=[word for word in reviews_data if word not in stopwords.words('english')]
wnl = nltk.WordNetLemmatizer()
reviews_data=[wnl.lemmatize(data) for data in filtered_data]
reviews_words=' '.join(reviews_data)
[nltk_data] Downloading package stopwords to
[nltk_data] /home/fucking/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/fucking/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/fucking/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
Common words among reviews
Words like great,clean,host, flexible, accomodating were frequently used. These are assocatiated with many listings, hosts should strive to get such
plot_display(reviews)
Understanding Listings data
We need to look at the data available to us in the listings CSV file, in order to gain understanding of the data.
Columns
We start with reading in the file and listing out the columns available to us.
listings_df=pd.read_csv("listings.csv", delimiter=',')
listings_df.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
'space', 'description', 'experiences_offered', 'neighborhood_overview',
'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
'host_location', 'host_about', 'host_response_time',
'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
'host_listings_count', 'host_total_listings_count',
'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
'street', 'neighbourhood', 'neighbourhood_cleansed',
'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
'smart_location', 'country_code', 'country', 'latitude', 'longitude',
'is_location_exact', 'property_type', 'room_type', 'accommodates',
'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
'price', 'weekly_price', 'monthly_price', 'security_deposit',
'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
'maximum_nights', 'calendar_updated', 'has_availability',
'availability_30', 'availability_60', 'availability_90',
'availability_365', 'calendar_last_scraped', 'number_of_reviews',
'first_review', 'last_review', 'review_scores_rating',
'review_scores_accuracy', 'review_scores_cleanliness',
'review_scores_checkin', 'review_scores_communication',
'review_scores_location', 'review_scores_value', 'requires_license',
'license', 'jurisdiction_names', 'instant_bookable',
'cancellation_policy', 'require_guest_profile_picture',
'require_guest_phone_verification', 'calculated_host_listings_count',
'reviews_per_month'],
dtype='object')
# How many data types do we have
listings_df.dtypes.value_counts()
object 62
float64 17
int64 13
dtype: int64
# Integer columns
listings_df.select_dtypes(include = ["int"])[:5]
id | scrape_id | host_id | accommodates | guests_included | minimum_nights | maximum_nights | availability_30 | availability_60 | availability_90 | availability_365 | number_of_reviews | calculated_host_listings_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 241032 | 20160104002432 | 956883 | 4 | 2 | 1 | 365 | 14 | 41 | 71 | 346 | 207 | 2 |
1 | 953595 | 20160104002432 | 5177328 | 4 | 1 | 2 | 90 | 13 | 13 | 16 | 291 | 43 | 6 |
2 | 3308979 | 20160104002432 | 16708587 | 11 | 10 | 4 | 30 | 1 | 6 | 17 | 220 | 20 | 2 |
3 | 7421966 | 20160104002432 | 9851441 | 3 | 1 | 1 | 1125 | 0 | 0 | 0 | 143 | 0 | 1 |
4 | 278830 | 20160104002432 | 1452570 | 6 | 6 | 1 | 1125 | 30 | 60 | 90 | 365 | 38 | 1 |
# Float columns
listings_df.select_dtypes(include = ["float"])[:5]
host_listings_count | host_total_listings_count | latitude | longitude | bathrooms | bedrooms | beds | square_feet | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | license | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 3.0 | 47.636289 | -122.371025 | 1.0 | 1.0 | 1.0 | NaN | 95.0 | 10.0 | 10.0 | 10.0 | 10.0 | 9.0 | 10.0 | NaN | 4.07 |
1 | 6.0 | 6.0 | 47.639123 | -122.365666 | 1.0 | 1.0 | 1.0 | NaN | 96.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | NaN | 1.48 |
2 | 2.0 | 2.0 | 47.629724 | -122.369483 | 4.5 | 5.0 | 7.0 | NaN | 97.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | 10.0 | NaN | 1.15 |
3 | 1.0 | 1.0 | 47.638473 | -122.369279 | 1.0 | 0.0 | 2.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2.0 | 2.0 | 47.632918 | -122.372471 | 2.0 | 3.0 | 3.0 | NaN | 92.0 | 9.0 | 9.0 | 10.0 | 10.0 | 9.0 | 9.0 | NaN | 0.89 |
# object columns
listings_df.select_dtypes(include = ["object"])[:5]
listing_url | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | notes | transit | ... | has_availability | calendar_last_scraped | first_review | last_review | requires_license | jurisdiction_names | instant_bookable | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.airbnb.com/rooms/241032 | 2016-01-04 | Stylish Queen Anne Apartment | NaN | Make your self at home in this charming one-be... | Make your self at home in this charming one-be... | none | NaN | NaN | NaN | ... | t | 2016-01-04 | 2011-11-01 | 2016-01-02 | f | WASHINGTON | f | moderate | f | f |
1 | https://www.airbnb.com/rooms/953595 | 2016-01-04 | Bright & Airy Queen Anne Apartment | Chemically sensitive? We've removed the irrita... | Beautiful, hypoallergenic apartment in an extr... | Chemically sensitive? We've removed the irrita... | none | Queen Anne is a wonderful, truly functional vi... | What's up with the free pillows? Our home was... | Convenient bus stops are just down the block, ... | ... | t | 2016-01-04 | 2013-08-19 | 2015-12-29 | f | WASHINGTON | f | strict | t | t |
2 | https://www.airbnb.com/rooms/3308979 | 2016-01-04 | New Modern House-Amazing water view | New modern house built in 2013. Spectacular s... | Our house is modern, light and fresh with a wa... | New modern house built in 2013. Spectacular s... | none | Upper Queen Anne is a charming neighborhood fu... | Our house is located just 5 short blocks to To... | A bus stop is just 2 blocks away. Easy bus a... | ... | t | 2016-01-04 | 2014-07-30 | 2015-09-03 | f | WASHINGTON | f | strict | f | f |
3 | https://www.airbnb.com/rooms/7421966 | 2016-01-04 | Queen Anne Chateau | A charming apartment that sits atop Queen Anne... | NaN | A charming apartment that sits atop Queen Anne... | none | NaN | NaN | NaN | ... | t | 2016-01-04 | NaN | NaN | f | WASHINGTON | f | flexible | f | f |
4 | https://www.airbnb.com/rooms/278830 | 2016-01-04 | Charming craftsman 3 bdm house | Cozy family craftman house in beautiful neighb... | Cozy family craftman house in beautiful neighb... | Cozy family craftman house in beautiful neighb... | none | We are in the beautiful neighborhood of Queen ... | Belltown | The nearest public transit bus (D Line) is 2 b... | ... | t | 2016-01-04 | 2012-07-10 | 2015-10-24 | f | WASHINGTON | f | strict | f | f |
5 rows × 62 columns
#which columns have missing values
listings_df.isnull().sum()[listings_df.isna().sum()>0]
summary 177
space 569
neighborhood_overview 1032
notes 1606
transit 934
thumbnail_url 320
medium_url 320
xl_picture_url 320
host_name 2
host_since 2
host_location 8
host_about 859
host_response_time 523
host_response_rate 523
host_acceptance_rate 773
host_is_superhost 2
host_thumbnail_url 2
host_picture_url 2
host_neighbourhood 300
host_listings_count 2
host_total_listings_count 2
host_has_profile_pic 2
host_identity_verified 2
neighbourhood 416
zipcode 7
property_type 1
bathrooms 16
bedrooms 6
beds 1
square_feet 3721
weekly_price 1809
monthly_price 2301
security_deposit 1952
cleaning_fee 1030
first_review 627
last_review 627
review_scores_rating 647
review_scores_accuracy 658
review_scores_cleanliness 653
review_scores_checkin 658
review_scores_communication 651
review_scores_location 655
review_scores_value 656
license 3818
reviews_per_month 627
dtype: int64
The listings data set has 92 features/columns. They are obviously too many and we will need to focus.A quick overview of the lists we realise that there are groupings of features.
- Columns about reviews, for instance number_of_reviews,first_review, last_review,review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value
- There are columns related to property like amenitiess, bedrooms, room type etc
- There are columns related to location, properties like neighbourhood,zip cde, city. We will want to see which neighbourhoods are busiest, pricey
- We also have columns that are related to the host’s policies such as price, weekly pricem security deposits
- Host related attributes like host location, host response time, these features may be important for user experience
Review based columns
# list of columns related to reviews
review_cols = [col for col in listings_df.columns if 'review' in col]
review_cols
['number_of_reviews',
'first_review',
'last_review',
'review_scores_rating',
'review_scores_accuracy',
'review_scores_cleanliness',
'review_scores_checkin',
'review_scores_communication',
'review_scores_location',
'review_scores_value',
'reviews_per_month']
listings_df[review_cols].describe()
number_of_reviews | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|
count | 3818.000000 | 3171.000000 | 3160.000000 | 3165.000000 | 3160.000000 | 3167.000000 | 3163.000000 | 3162.000000 | 3191.000000 |
mean | 22.223415 | 94.539262 | 9.636392 | 9.556398 | 9.786709 | 9.809599 | 9.608916 | 9.452245 | 2.078919 |
std | 37.730892 | 6.606083 | 0.698031 | 0.797274 | 0.595499 | 0.568211 | 0.629053 | 0.750259 | 1.822348 |
min | 0.000000 | 20.000000 | 2.000000 | 3.000000 | 2.000000 | 2.000000 | 4.000000 | 2.000000 | 0.020000 |
25% | 2.000000 | 93.000000 | 9.000000 | 9.000000 | 10.000000 | 10.000000 | 9.000000 | 9.000000 | 0.695000 |
50% | 9.000000 | 96.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 1.540000 |
75% | 26.000000 | 99.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 3.000000 |
max | 474.000000 | 100.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 10.000000 | 12.150000 |
There are 3818 rows in total for listings. All columns except for number_of_reviews have missing values. Looking at the columns we see some interesting things:
The range between max and min values for number of reviews is 474 which is huge. The mean of 22 is no way close to the median of 9 and the max of 474.
The standard deviation of the review_scores_rating column at 6.61 is much higher than most other reviewscores columns.
The range of reviews_per_month is a lot smaller than number_of_reviews, with the maximum reviews_per_month at 12.15 compared with the maxium number_of_reviews at 474. This is most likely due to the unbounded time period for which number_of_reviews can be measured, whereas the reviews_per_month is measured over a consistent time period, i.e. a month.
## Whats the correlation between the reviews columns
_df=listings_df[review_cols]
corr=_df.corr()
sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values)
<AxesSubplot:>
Theres strong correlation between basically all the columns except for review scores location. Review scores checking and review scores communication also have strong correlation. Another strong correlation is between reviews per month and number of reviews which is understandable since one is subset of the other
One of our data mining objectives is to find out if theres a metric for understanding the quality of guest’s experience. The reviews columns can help us answer that.
The review scores rating and seems to be the one that is best suited since the other ones are extensions of and correlated which means they may have been used to calculate the rating.
From our earlier analysis, the review scores rating has 3,171 out of 3,818 values that are not empty for this column, which means we are missing 17% of our data.
# a look at the review scores rating
listings_df['review_scores_rating'].hist()
<AxesSubplot:>
Theres generally high scores for reviews rating skewed positively towards the right which means that reviews scores rating is highly important, getting good reviews is very important for hosts
# Another useful metric to note is the number of reviews vs reviews per month
listings_df[['reviews_per_month','number_of_reviews']].hist()
array([[<AxesSubplot:title={'center':'reviews_per_month'}>,
<AxesSubplot:title={'center':'number_of_reviews'}>]], dtype=object)
We already knows theres a strong positive correlation between these two. We notice that, the number of reviews occur within a long range, which just means someone may have many reviews because they have been listed longer. But the reviews per month, shows a more accurate picture
listings_df[['number_of_reviews','first_review','last_review','reviews_per_month']][:5]
number_of_reviews | first_review | last_review | reviews_per_month | |
---|---|---|---|---|
0 | 207 | 2011-11-01 | 2016-01-02 | 4.07 |
1 | 43 | 2013-08-19 | 2015-12-29 | 1.48 |
2 | 20 | 2014-07-30 | 2015-09-03 | 1.15 |
3 | 0 | NaN | NaN | NaN |
4 | 38 | 2012-07-10 | 2015-10-24 | 0.89 |
listings_df[['review_scores_rating','reviews_per_month']].hist()
array([[<AxesSubplot:title={'center':'review_scores_rating'}>,
<AxesSubplot:title={'center':'reviews_per_month'}>]], dtype=object)
So far we see that the reviews per month and review score rating features are useful features for understanding customer experience and thus improving customer experience. As expected some users/hosts have many reviews but fewer reviews per month suggesting they have been listed longer but not neccesarily that they are perfoming better than those with fewer number of ratigs but higher reviews per month
Note: For Data preparation, we may consider a relation between these two features as a response variable for modelling customer experience
Price and related features
new_price_col=['price', 'security_deposit', 'cleaning_fee', 'guests_included',
'extra_people', 'minimum_nights', 'maximum_nights','weekly_price', 'monthly_price']
The money columns contain the dollar sign and are of object type. We need to remove the dollar sign and convert the data type to float for quantitative analysis
# We need to remove all $ to the columns
listings_df['price'] = listings_df['price'].str.extract(r'(\d+)', expand=False)
listings_df['weekly_price'] = listings_df['weekly_price'].str.extract(r'(\d+)', expand=False)
listings_df['monthly_price'] = listings_df['monthly_price'].str.extract(r'(\d+)', expand=False)
listings_df['security_deposit'] = listings_df['security_deposit'].str.extract(r'(\d+)', expand=False)
listings_df['cleaning_fee'] = listings_df['cleaning_fee'].str.extract(r'(\d+)', expand=False)
listings_df['extra_people'] = listings_df['extra_people'].str.extract(r'(\d+)', expand=False)
# change the money values to float
listings_df['cleaning_fee'] = listings_df['cleaning_fee'].astype(float, errors = 'raise')
listings_df['security_deposit'] = listings_df['security_deposit'].astype(float, errors = 'raise')
listings_df['extra_people'] = listings_df['extra_people'].astype(float, errors = 'raise')
listings_df['price'] = listings_df['price'].astype(float, errors = 'raise')
listings_df['weekly_price'] = listings_df['weekly_price'].astype(float, errors = 'raise')
listings_df['monthly_price'] = listings_df['monthly_price'].astype(float, errors = 'raise')
listings_df[new_price_col][:5]
price | security_deposit | cleaning_fee | guests_included | extra_people | minimum_nights | maximum_nights | weekly_price | monthly_price | |
---|---|---|---|---|---|---|---|---|---|
0 | 85.0 | NaN | NaN | 2 | 5.0 | 1 | 365 | NaN | NaN |
1 | 150.0 | 100.0 | 40.0 | 1 | 0.0 | 2 | 90 | 1.0 | 3.0 |
2 | 975.0 | 1.0 | 300.0 | 10 | 25.0 | 4 | 30 | NaN | NaN |
3 | 100.0 | NaN | NaN | 1 | 0.0 | 1 | 1125 | 650.0 | 2.0 |
4 | 450.0 | 700.0 | 125.0 | 6 | 15.0 | 1 | 1125 | NaN | NaN |
# check for missing values in the columns
listings_df[new_price_col].isnull().sum()[listings_df[new_price_col].isnull().sum()>0]
security_deposit 1952
cleaning_fee 1030
weekly_price 1809
monthly_price 2301
dtype: int64
# What are the summary statics saying
listings_df[new_price_col].describe()
price | security_deposit | cleaning_fee | guests_included | extra_people | minimum_nights | maximum_nights | weekly_price | monthly_price | |
---|---|---|---|---|---|---|---|---|---|
count | 3818.00000 | 1866.000000 | 2788.000000 | 3818.000000 | 3818.000000 | 3818.000000 | 3818.000000 | 2009.000000 | 1517.000000 |
mean | 127.71451 | 247.292069 | 61.710904 | 1.672603 | 10.667627 | 2.369303 | 780.447617 | 433.651070 | 67.986816 |
std | 89.16278 | 153.486417 | 48.830341 | 1.311040 | 17.585922 | 16.305902 | 1683.589007 | 296.675884 | 225.403680 |
min | 1.00000 | 1.000000 | 5.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 75.00000 | 100.000000 | 25.000000 | 1.000000 | 0.000000 | 1.000000 | 60.000000 | 232.000000 | 1.000000 |
50% | 100.00000 | 200.000000 | 50.000000 | 1.000000 | 0.000000 | 2.000000 | 1125.000000 | 476.000000 | 2.000000 |
75% | 150.00000 | 300.000000 | 83.000000 | 2.000000 | 20.000000 | 2.000000 | 1125.000000 | 650.000000 | 3.000000 |
max | 999.00000 | 995.000000 | 300.000000 | 15.000000 | 300.000000 | 1000.000000 | 100000.000000 | 999.000000 | 995.000000 |
Security deposit has 48% listings, lots of missing values. It has a huge range between the maximum and minimum and the mean. It also has a huge standard deviation. There seems to be different charges that are not standardized perhaps related to other factors like neighbourhood, property type etc. We will invest that further in EDA. Maximum nights have very high values. 75% of listings have up to 20 extra people
fig = plt.figure(figsize = (18,7))
ax = fig.gca()
listings_df[new_price_col].hist(ax=ax)
<ipython-input-52-1b5312bb5443>:3: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
listings_df[new_price_col].hist(ax=ax)
array([[<AxesSubplot:title={'center':'price'}>,
<AxesSubplot:title={'center':'security_deposit'}>,
<AxesSubplot:title={'center':'cleaning_fee'}>],
[<AxesSubplot:title={'center':'guests_included'}>,
<AxesSubplot:title={'center':'extra_people'}>,
<AxesSubplot:title={'center':'minimum_nights'}>],
[<AxesSubplot:title={'center':'maximum_nights'}>,
<AxesSubplot:title={'center':'weekly_price'}>,
<AxesSubplot:title={'center':'monthly_price'}>]], dtype=object)
It is important while we notice the hist plot to remember that there are some extremely high values (outliers) that may be affecting the distributions. 50% of security deposits are around 200 dollars, Theres a little over 400 listings with security deposits between 0 and 100 which is almost as much as the 50 percent. Of those that charge security deposits,an average of 200 listings charge between 100 dollars to 150 dollars and 300 dollars to 400 dollars. Of those that charge security fees, more than half charge between 0 and 60 dollars. Theres about 1/4 that charge between 60 dollars and 120 dollars. On guests included most listings have between 0 and 1 extra guests. Most listings have between 0 and 20 extra people included.
## Whats the correlation between the money columns
_df2=listings_df[new_price_col]
_corr2=_df2.corr()
sns.heatmap(_corr2, xticklabels=_corr2.columns.values, yticklabels=_corr2.columns.values, annot = True)
<AxesSubplot:>
Price and cleaning fee have a correlation, likely the cleaning fee is added to the price
Location based columns
# look for columns with the word columns in them
location_cols = [col for col in listings_df.columns if 'location' in col]
location_cols
['host_location',
'smart_location',
'is_location_exact',
'review_scores_location']
listings_df[location_cols][:5]
host_location | smart_location | is_location_exact | review_scores_location | |
---|---|---|---|---|
0 | Seattle, Washington, United States | Seattle, WA | t | 9.0 |
1 | Seattle, Washington, United States | Seattle, WA | t | 10.0 |
2 | Seattle, Washington, United States | Seattle, WA | t | 10.0 |
3 | Seattle, Washington, United States | Seattle, WA | t | NaN |
4 | Seattle, Washington, United States | Seattle, WA | t | 9.0 |
listings_df[location_cols][:5].dtypes
host_location object
smart_location object
is_location_exact int64
review_scores_location float64
dtype: object
# is location exact has T, F for Yes or No, True or false we replace it with True or false
listings_df['is_location_exact']=listings_df['is_location_exact'].replace(to_replace="t",value="1")
listings_df['is_location_exact']=listings_df['is_location_exact'].replace(to_replace="f",value="0")
# change is_location_exact from int to Int
listings_df['is_location_exact'] = listings_df['is_location_exact'].astype(int, errors = 'raise')
listings_df[location_cols][:5]
host_location | smart_location | is_location_exact | review_scores_location | |
---|---|---|---|---|
0 | Seattle, Washington, United States | Seattle, WA | 1 | 9.0 |
1 | Seattle, Washington, United States | Seattle, WA | 1 | 10.0 |
2 | Seattle, Washington, United States | Seattle, WA | 1 | 10.0 |
3 | Seattle, Washington, United States | Seattle, WA | 1 | NaN |
4 | Seattle, Washington, United States | Seattle, WA | 1 | 9.0 |
# are there columns with missing values
listings_df[location_cols].isnull().sum()[listings_df[location_cols].isnull().sum()>0]
host_location 8
review_scores_location 655
dtype: int64
The review scores location has the key word location in it but has been considered more relevantly under the review based columns so we will drop it. Also, upon rechecking the listings data see there are other columns that deal with location but dont have location in their column names. We include these here along with the location cols
location_cols=['city','state','zipcode','market','smart_location','country_code','country','host_location','is_location_exact']
listings_df[location_cols][:5]
city | state | zipcode | market | smart_location | country_code | country | host_location | is_location_exact | |
---|---|---|---|---|---|---|---|---|---|
0 | Seattle | WA | 98119 | Seattle | Seattle, WA | US | United States | Seattle, Washington, United States | 1 |
1 | Seattle | WA | 98119 | Seattle | Seattle, WA | US | United States | Seattle, Washington, United States | 1 |
2 | Seattle | WA | 98119 | Seattle | Seattle, WA | US | United States | Seattle, Washington, United States | 1 |
3 | Seattle | WA | 98119 | Seattle | Seattle, WA | US | United States | Seattle, Washington, United States | 1 |
4 | Seattle | WA | 98119 | Seattle | Seattle, WA | US | United States | Seattle, Washington, United States | 1 |
#exclude the is_location exact and find some summary stats
listings_df[['city','state','zipcode','market','smart_location','country_code','country']].describe()
city | state | zipcode | market | smart_location | country_code | country | |
---|---|---|---|---|---|---|---|
count | 3818 | 3818 | 3811 | 3818 | 3818 | 3818 | 3818 |
unique | 7 | 2 | 28 | 1 | 7 | 1 | 1 |
top | Seattle | WA | 98122 | Seattle | Seattle, WA | US | United States |
freq | 3810 | 3817 | 420 | 3818 | 3810 | 3818 | 3818 |
Zip code has the most count of unique entries with 27 different entries, city has 7 unique entries, the rest are probably the same for instance, same market, country code and country. So we exclude country, country code and state to see what we can learn. Almost all listings are in Seattle, WA thats where we have 99.9% of listings, other areas have 1%
listings_df[['city','state','zipcode','smart_location']].groupby(['city']).count()
state | zipcode | smart_location | |
---|---|---|---|
city | |||
Ballard, Seattle | 1 | 1 | 1 |
Phinney Ridge Seattle | 1 | 1 | 1 |
Seattle | 3810 | 3803 | 3810 |
Seattle | 2 | 2 | 2 |
West Seattle | 2 | 2 | 2 |
seattle | 1 | 1 | 1 |
西雅图 | 1 | 1 | 1 |
(listings_df['zipcode'].value_counts()).plot(kind="bar")
<AxesSubplot:>
We see that there are zipcodes that have more listings like the 98122, 98103 and 98102. These are generally higher, the bulk of the listings are on average the same until we get to 98104. Then we see 98199, 126,106,108,133,136 have about the same listings around the 50-60 listings average
Host attributes columns
host_cols = [col for col in listings_df.columns if 'host' in col]
host_cols
['host_id',
'host_url',
'host_name',
'host_since',
'host_location',
'host_about',
'host_response_time',
'host_response_rate',
'host_acceptance_rate',
'host_is_superhost',
'host_thumbnail_url',
'host_picture_url',
'host_neighbourhood',
'host_listings_count',
'host_total_listings_count',
'host_verifications',
'host_has_profile_pic',
'host_identity_verified',
'calculated_host_listings_count']
listings_df[host_cols][:5]
host_id | host_url | host_name | host_since | host_location | host_about | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url | host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | calculated_host_listings_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 956883 | https://www.airbnb.com/users/show/956883 | Maija | 2011-08-11 | Seattle, Washington, United States | I am an artist, interior designer, and run a s... | within a few hours | 96% | 100% | f | https://a0.muscache.com/ac/users/956883/profil... | https://a0.muscache.com/ac/users/956883/profil... | Queen Anne | 3.0 | 3.0 | ['email', 'phone', 'reviews', 'kba'] | t | t | 2 |
1 | 5177328 | https://www.airbnb.com/users/show/5177328 | Andrea | 2013-02-21 | Seattle, Washington, United States | Living east coast/left coast/overseas. Time i... | within an hour | 98% | 100% | t | https://a0.muscache.com/ac/users/5177328/profi... | https://a0.muscache.com/ac/users/5177328/profi... | Queen Anne | 6.0 | 6.0 | ['email', 'phone', 'facebook', 'linkedin', 're... | t | t | 6 |
2 | 16708587 | https://www.airbnb.com/users/show/16708587 | Jill | 2014-06-12 | Seattle, Washington, United States | i love living in Seattle. i grew up in the mi... | within a few hours | 67% | 100% | f | https://a1.muscache.com/ac/users/16708587/prof... | https://a1.muscache.com/ac/users/16708587/prof... | Queen Anne | 2.0 | 2.0 | ['email', 'phone', 'google', 'reviews', 'jumio'] | t | t | 2 |
3 | 9851441 | https://www.airbnb.com/users/show/9851441 | Emily | 2013-11-06 | Seattle, Washington, United States | NaN | NaN | NaN | NaN | f | https://a2.muscache.com/ac/users/9851441/profi... | https://a2.muscache.com/ac/users/9851441/profi... | Queen Anne | 1.0 | 1.0 | ['email', 'phone', 'facebook', 'reviews', 'jum... | t | t | 1 |
4 | 1452570 | https://www.airbnb.com/users/show/1452570 | Emily | 2011-11-29 | Seattle, Washington, United States | Hi, I live in Seattle, Washington but I'm orig... | within an hour | 100% | NaN | f | https://a0.muscache.com/ac/users/1452570/profi... | https://a0.muscache.com/ac/users/1452570/profi... | Queen Anne | 2.0 | 2.0 | ['email', 'phone', 'facebook', 'reviews', 'kba'] | t | t | 1 |
#create a host listings df from the listings,a df that contains only listing data
host_listings_df=listings_df[host_cols].copy()
# find some correlations on the host listing columns
host_listings_df.corr()
host_id | host_listings_count | host_total_listings_count | calculated_host_listings_count | |
---|---|---|---|---|
host_id | 1.000000 | -0.069613 | -0.069613 | -0.176040 |
host_listings_count | -0.069613 | 1.000000 | 1.000000 | 0.224222 |
host_total_listings_count | -0.069613 | 1.000000 | 1.000000 | 0.224222 |
calculated_host_listings_count | -0.176040 | 0.224222 | 0.224222 | 1.000000 |
The corr( ) function returns 4 features, host_id, listings_count, total_listings_count and calculated host listings count. There is no description from AirBnb on what these columns are. We see strong positive 100 percent correlation between host listings count and host total listings count and no other correlation. These two might be the same thing
#host_listings_df
pct_to_float = lambda x: x.str.replace(r'%', r'.0').astype('float') / 100.0
# note that the percentages are now in float e.g 96% is 0.96
# Apply the function to the rate cols
host_listings_df[['host_response_rate','host_acceptance_rate']] = host_listings_df[['host_response_rate','host_acceptance_rate']].apply(pct_to_float, axis=1)
host_listings_df[:5]
host_id | host_url | host_name | host_since | host_location | host_about | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_thumbnail_url | host_picture_url | host_neighbourhood | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | calculated_host_listings_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 956883 | https://www.airbnb.com/users/show/956883 | Maija | 2011-08-11 | Seattle, Washington, United States | I am an artist, interior designer, and run a s... | within a few hours | 0.96 | 1.0 | f | https://a0.muscache.com/ac/users/956883/profil... | https://a0.muscache.com/ac/users/956883/profil... | Queen Anne | 3.0 | 3.0 | ['email', 'phone', 'reviews', 'kba'] | t | t | 2 |
1 | 5177328 | https://www.airbnb.com/users/show/5177328 | Andrea | 2013-02-21 | Seattle, Washington, United States | Living east coast/left coast/overseas. Time i... | within an hour | 0.98 | 1.0 | t | https://a0.muscache.com/ac/users/5177328/profi... | https://a0.muscache.com/ac/users/5177328/profi... | Queen Anne | 6.0 | 6.0 | ['email', 'phone', 'facebook', 'linkedin', 're... | t | t | 6 |
2 | 16708587 | https://www.airbnb.com/users/show/16708587 | Jill | 2014-06-12 | Seattle, Washington, United States | i love living in Seattle. i grew up in the mi... | within a few hours | 0.67 | 1.0 | f | https://a1.muscache.com/ac/users/16708587/prof... | https://a1.muscache.com/ac/users/16708587/prof... | Queen Anne | 2.0 | 2.0 | ['email', 'phone', 'google', 'reviews', 'jumio'] | t | t | 2 |
3 | 9851441 | https://www.airbnb.com/users/show/9851441 | Emily | 2013-11-06 | Seattle, Washington, United States | NaN | NaN | NaN | NaN | f | https://a2.muscache.com/ac/users/9851441/profi... | https://a2.muscache.com/ac/users/9851441/profi... | Queen Anne | 1.0 | 1.0 | ['email', 'phone', 'facebook', 'reviews', 'jum... | t | t | 1 |
4 | 1452570 | https://www.airbnb.com/users/show/1452570 | Emily | 2011-11-29 | Seattle, Washington, United States | Hi, I live in Seattle, Washington but I'm orig... | within an hour | 1.00 | NaN | f | https://a0.muscache.com/ac/users/1452570/profi... | https://a0.muscache.com/ac/users/1452570/profi... | Queen Anne | 2.0 | 2.0 | ['email', 'phone', 'facebook', 'reviews', 'kba'] | t | t | 1 |
We beging to investigate the host listings, theres a number of features, we investigate them in smaller batches starting with the rates
host_listings_df[['host_response_rate','host_acceptance_rate']].hist()
array([[<AxesSubplot:title={'center':'host_response_rate'}>,
<AxesSubplot:title={'center':'host_acceptance_rate'}>]],
dtype=object)
host acceptance rate is always 100% in all occurences, there are no rejections from hosts, host response rate is also almost always 100%, very few hosts have less than 80% response rate
One of the data mining questions is about the increase in listings in AirBnb, we look at the host since feature
# convert it to date time
host_listings_df['host_since']=pd.to_datetime(host_listings_df['host_since'])
# split the date time into week,month and year
host_listings_df['dayofweek'] = host_listings_df['host_since'].dt.dayofweek
host_listings_df['weekofyear'] = host_listings_df['host_since'].dt.isocalendar().week
host_listings_df['month'] = host_listings_df['host_since'].dt.month
host_listings_df['year'] = host_listings_df['host_since'].dt.year
host_listings_df[:5]
host_id | host_url | host_name | host_since | host_location | host_about | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | ... | host_listings_count | host_total_listings_count | host_verifications | host_has_profile_pic | host_identity_verified | calculated_host_listings_count | dayofweek | weekofyear | month | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 956883 | https://www.airbnb.com/users/show/956883 | Maija | 2011-08-11 | Seattle, Washington, United States | I am an artist, interior designer, and run a s... | within a few hours | 0.96 | 1.0 | f | ... | 3.0 | 3.0 | ['email', 'phone', 'reviews', 'kba'] | t | t | 2 | 3.0 | 32 | 8.0 | 2011.0 |
1 | 5177328 | https://www.airbnb.com/users/show/5177328 | Andrea | 2013-02-21 | Seattle, Washington, United States | Living east coast/left coast/overseas. Time i... | within an hour | 0.98 | 1.0 | t | ... | 6.0 | 6.0 | ['email', 'phone', 'facebook', 'linkedin', 're... | t | t | 6 | 3.0 | 8 | 2.0 | 2013.0 |
2 | 16708587 | https://www.airbnb.com/users/show/16708587 | Jill | 2014-06-12 | Seattle, Washington, United States | i love living in Seattle. i grew up in the mi... | within a few hours | 0.67 | 1.0 | f | ... | 2.0 | 2.0 | ['email', 'phone', 'google', 'reviews', 'jumio'] | t | t | 2 | 3.0 | 24 | 6.0 | 2014.0 |
3 | 9851441 | https://www.airbnb.com/users/show/9851441 | Emily | 2013-11-06 | Seattle, Washington, United States | NaN | NaN | NaN | NaN | f | ... | 1.0 | 1.0 | ['email', 'phone', 'facebook', 'reviews', 'jum... | t | t | 1 | 2.0 | 45 | 11.0 | 2013.0 |
4 | 1452570 | https://www.airbnb.com/users/show/1452570 | Emily | 2011-11-29 | Seattle, Washington, United States | Hi, I live in Seattle, Washington but I'm orig... | within an hour | 1.00 | NaN | f | ... | 2.0 | 2.0 | ['email', 'phone', 'facebook', 'reviews', 'kba'] | t | t | 1 | 1.0 | 48 | 11.0 | 2011.0 |
5 rows × 23 columns
We want to see the statistics of which days of the week, week of the year occur more and what those trends are
host_listings_df[['host_since','dayofweek','weekofyear','month','year']].describe()
dayofweek | weekofyear | month | year | |
---|---|---|---|---|
count | 3816.000000 | 3816.000000 | 3816.000000 | 3816.000000 |
mean | 2.634696 | 26.816038 | 6.551887 | 2013.214623 |
std | 2.013244 | 13.972345 | 3.208349 | 1.560423 |
min | 0.000000 | 1.000000 | 1.000000 | 2008.000000 |
25% | 1.000000 | 15.000000 | 4.000000 | 2012.000000 |
50% | 2.000000 | 28.000000 | 7.000000 | 2013.000000 |
75% | 4.000000 | 38.000000 | 9.000000 | 2015.000000 |
max | 6.000000 | 53.000000 | 12.000000 | 2016.000000 |
On average, about half of the hosts have been hosts since July 2013 and interestingly joined on a Wednesday. The maximum year is 2016 because this dataset was compiled in 2016.
host_listings_df[['host_since','dayofweek','weekofyear','month']].hist()
array([[<AxesSubplot:title={'center':'dayofweek'}>,
<AxesSubplot:title={'center':'weekofyear'}>],
[<AxesSubplot:title={'center':'month'}>, <AxesSubplot:>]],
dtype=object)
The highest number of hosts joined on the first day of the week and second day of the week (According to google,in the US thats Sunday and Monday). The distribution of the months is spread accross the whole year, with the most joining between weeks 26 and 38. The months distribution provides us an even clearer picture showing highest joins beginning, mid and end of the year.
host_listings_df['host_since'].hist()
<AxesSubplot:>
This answers one of our data mining questions on whether there has been an increase in hosts joining AirBnb. Since 2008 when AirBnb started, there has been a huge growth, with each year there’s at least an increament of 100. Airbnb is experiencing significant growth.
host_listings_df[['host_since','year','host_neighbourhood']].groupby(['year']).count()
host_since | host_neighbourhood | |
---|---|---|
year | ||
2008.0 | 4 | 4 |
2009.0 | 64 | 63 |
2010.0 | 149 | 146 |
2011.0 | 398 | 389 |
2012.0 | 539 | 514 |
2013.0 | 776 | 723 |
2014.0 | 900 | 809 |
2015.0 | 981 | 866 |
2016.0 | 5 | 4 |
This further elaborates our findings that there’s an upward trend of AirBnB listings since 2008 to 2016. In 10 years, there was a growth from 4 listings to 3816 listings, the growth is exponential. Between 2009 and 2011 the growth rate doubled at a average growth rate of 140% per year. From 2011 the growth rate steadied to an average growth rate of 30%. 2016 has a suprisingly low number possibly due to when this data was collected. We will investigate this. Otherwise, between 2008 and 2016, we see AirBnb listings grow significantly.
host_listings_df['host_since'].describe(datetime_is_numeric=True)
count 3816
mean 2013-09-18 17:26:02.264153856
min 2008-11-10 00:00:00
25% 2012-08-17 00:00:00
50% 2013-12-12 12:00:00
75% 2015-01-14 00:00:00
max 2016-01-03 00:00:00
Name: host_since, dtype: object
An investigation into the host since column reveals why 2016 has such a low count for listings. The dataset was collected 03-01-2016, on just 3 days of January there was already 4 listings at 1.33 listings per day.
listings_df[['host_since','neighbourhood_cleansed']].describe()
host_since | neighbourhood_cleansed | |
---|---|---|
count | 3816 | 3818 |
unique | 1380 | 87 |
top | 2013-08-30 | Broadway |
freq | 51 | 397 |
There are 87 neighbourhoods listed. Broadway is the most frequently listed neighbourhood which occurs 397 times
host_listings_df['host_is_superhost'].value_counts()/host_listings_df.shape[0]
f 0.795705
t 0.203772
Name: host_is_superhost, dtype: float64
For 70% of the listings the host is not the superhost
host_listings_df['host_has_profile_pic'].value_counts()/host_listings_df.shape[0]
t 0.997643
f 0.001833
Name: host_has_profile_pic, dtype: float64
Every host has a profile picture, probably only 1 host in the entire listing doesnt have a picture
host_listings_df['host_identity_verified'].value_counts()/host_listings_df.shape[0]
t 0.784966
f 0.214510
Name: host_identity_verified, dtype: float64
Most host’s identities are verified, only 21.4% of hosts aren’t verified
host_listings_df['host_verifications'].describe()
count 3818
unique 116
top ['email', 'phone', 'facebook', 'reviews', 'kba']
freq 595
Name: host_verifications, dtype: object
There are as many host verifications as there are listings. I assume since its unspecified that host verifications are information about hosts that are given out by the hosts to verify. Most hosts provide email, phone, facebook, reviews and kba for verification
host_listings_df['host_response_time'].value_counts()/host_listings_df.shape[0]
within an hour 0.443164
within a few hours 0.253536
within a day 0.156365
a few days or more 0.009953
Name: host_response_time, dtype: float64
44% of hosts respond within an hour, while 25% respond within a few hours, meaning that a total of 69% of hosts respond well in time guests. 15% will respond within a day. Basically hosts respond at most within a day
host_listings_df['host_thumbnail_url'].describe()
count 3816
unique 2743
top https://a2.muscache.com/ac/pictures/a4d7d053-c...
freq 46
Name: host_thumbnail_url, dtype: object
host_listings_df['host_url'].describe()
count 3818
unique 2751
top https://www.airbnb.com/users/show/8534462
freq 46
Name: host_url, dtype: object
The same is true of host urls, every host has a url
host_listings_df['host_name'].describe()
count 3816
unique 1466
top Andrew
freq 56
Name: host_name, dtype: object
host_listings_df['host_about'].describe()
count 2959
unique 2011
top It would be my pleasure to share and explore t...
freq 46
Name: host_about, dtype: object
Every host has a thumbnail, they aren’t very unique.The most frequently used one is shown
host_listings_df['host_neighbourhood'].describe()
count 3518
unique 102
top Capitol Hill
freq 405
Name: host_neighbourhood, dtype: object
Most hosts stay in capitol Hill
count_vals = host_listings_df.host_neighbourhood.value_counts()
(count_vals[:20]/host_listings_df.shape[0]).plot(kind="bar");
plt.title("Neighbourhood ");
Policies columns
Here we look at columns that related to host policy like ‘requires_license’, ‘license’, ‘jurisdiction_names’, instant_bookable’, ‘cancellation_policy’, ‘require_guest_profile_picture’, ‘require_guest_phone_verification’
policies=['requires_license', 'license', 'jurisdiction_names', 'instant_bookable', 'cancellation_policy','experiences_offered', 'require_guest_profile_picture', 'require_guest_phone_verification']
policies_df=listings_df[policies]
policies_df[:5]
requires_license | license | jurisdiction_names | instant_bookable | cancellation_policy | experiences_offered | require_guest_profile_picture | require_guest_phone_verification | |
---|---|---|---|---|---|---|---|---|
0 | f | NaN | WASHINGTON | f | moderate | none | f | f |
1 | f | NaN | WASHINGTON | f | strict | none | t | t |
2 | f | NaN | WASHINGTON | f | strict | none | f | f |
3 | f | NaN | WASHINGTON | f | flexible | none | f | f |
4 | f | NaN | WASHINGTON | f | strict | none | f | f |
policies_df['requires_license'].value_counts()/policies_df.shape[0]
f 1.0
Name: requires_license, dtype: float64
All of the listings do not require license
policies_df['require_guest_phone_verification'].value_counts()/policies_df.shape[0]
f 0.901781
t 0.098219
Name: require_guest_phone_verification, dtype: float64
90% of hosts do not require guest phone verification
policies_df['require_guest_profile_picture'].value_counts()/policies_df.shape[0]
f 0.915925
t 0.084075
Name: require_guest_profile_picture, dtype: float64
91% of hosts do not require guest profile picture
policies_df['cancellation_policy'].value_counts()/policies_df.shape[0]
strict 0.371137
moderate 0.327658
flexible 0.301205
Name: cancellation_policy, dtype: float64
On cancelation policy, hosts are split almost evenly between strict, moderate and flexible. 37% being very strict on cancelation
policies_df['instant_bookable'].value_counts()/policies_df.shape[0]
f 0.845207
t 0.154793
Name: instant_bookable, dtype: float64
About 84.5% of listings are not instant bookable
policies_df['experiences_offered'].value_counts()
none 3818
Name: experiences_offered, dtype: int64
available=['has_availability',
'availability_30', 'availability_60', 'availability_90',
'availability_365']
available_df=listings_df[available]
available_df[:5]
has_availability | availability_30 | availability_60 | availability_90 | availability_365 | |
---|---|---|---|---|---|
0 | t | 14 | 41 | 71 | 346 |
1 | t | 13 | 13 | 16 | 291 |
2 | t | 1 | 6 | 17 | 220 |
3 | t | 0 | 0 | 0 | 143 |
4 | t | 30 | 60 | 90 | 365 |
available_df['has_availability'].value_counts()
t 3818
Name: has_availability, dtype: int64
All have availability. All listings have availability in general. We just have to inspect how much availability they have
available_df[['availability_30', 'availability_60', 'availability_90', 'availability_365']].describe()
availability_30 | availability_60 | availability_90 | availability_365 | |
---|---|---|---|---|
count | 3818.000000 | 3818.000000 | 3818.000000 | 3818.000000 |
mean | 16.786276 | 36.814825 | 58.082504 | 244.772656 |
std | 12.173637 | 23.337541 | 34.063845 | 126.772526 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 2.000000 | 13.000000 | 28.000000 | 124.000000 |
50% | 20.000000 | 46.000000 | 73.000000 | 308.000000 |
75% | 30.000000 | 59.000000 | 89.000000 | 360.000000 |
max | 30.000000 | 60.000000 | 90.000000 | 365.000000 |
On average, about 17 houses have availability 30 days, 37 have availability 60 days a year, 58 for 90 days a year and about 245 have availability 365 days of the year.
Property related columns
property_=['id','property_type', 'room_type', 'accommodates',
'bathrooms', 'bedrooms', 'beds', 'bed_type','amenities']
property_df=listings_df[property_]
property_df[:5]
id | property_type | room_type | accommodates | bathrooms | bedrooms | beds | bed_type | amenities | |
---|---|---|---|---|---|---|---|---|---|
0 | 241032 | Apartment | Entire home/apt | 4 | 1.0 | 1.0 | 1.0 | Real Bed | {TV,"Cable TV",Internet,"Wireless Internet","A... |
1 | 953595 | Apartment | Entire home/apt | 4 | 1.0 | 1.0 | 1.0 | Real Bed | {TV,Internet,"Wireless Internet",Kitchen,"Free... |
2 | 3308979 | House | Entire home/apt | 11 | 4.5 | 5.0 | 7.0 | Real Bed | {TV,"Cable TV",Internet,"Wireless Internet","A... |
3 | 7421966 | Apartment | Entire home/apt | 3 | 1.0 | 0.0 | 2.0 | Real Bed | {Internet,"Wireless Internet",Kitchen,"Indoor ... |
4 | 278830 | House | Entire home/apt | 6 | 2.0 | 3.0 | 3.0 | Real Bed | {TV,"Cable TV",Internet,"Wireless Internet",Ki... |
property_df.groupby(['room_type']).id.count()/property_df.shape[0]
room_type
Entire home/apt 0.665532
Private room 0.303824
Shared room 0.030644
Name: id, dtype: float64
(property_df.groupby(['room_type']).id.count()/property_df.shape[0]).plot(kind="pie")
<AxesSubplot:ylabel='id'>
66.5% of properties are are entire homes/appartments. 30% are private rooms, only 3% of listings are shared rooms
property_df.groupby(['property_type']).id.count()/property_df.shape[0]
property_type
Apartment 0.447355
Bed & Breakfast 0.009691
Boat 0.002095
Bungalow 0.003405
Cabin 0.005500
Camper/RV 0.003405
Chalet 0.000524
Condominium 0.023834
Dorm 0.000524
House 0.453903
Loft 0.010477
Other 0.005762
Tent 0.001310
Townhouse 0.030906
Treehouse 0.000786
Yurt 0.000262
Name: id, dtype: float64
(property_df.groupby(['property_type']).id.count()/property_df.shape[0]).plot(kind="bar")
<AxesSubplot:xlabel='property_type'>
Appartments and houses are the most common properties with 45% and 44.7% respectively. Bed and breakfast at 95 and the rest shared between other property types
property_df.groupby(['bed_type']).id.count()/property_df.shape[0]
bed_type
Airbed 0.007072
Couch 0.003405
Futon 0.019382
Pull-out Sofa 0.012310
Real Bed 0.957831
Name: id, dtype: float64
(property_df.groupby(['bed_type']).id.count()/property_df.shape[0]).plot(kind="bar")
<AxesSubplot:xlabel='bed_type'>
95% of beds listed are real beds, about 2% being Futons and the rest not beds
property_df.groupby(['accommodates']).id.count()/property_df.shape[0]
accommodates
1 0.067051
2 0.426139
3 0.104243
4 0.205605
5 0.048193
6 0.086957
7 0.013620
8 0.031168
9 0.003405
10 0.006548
11 0.000786
12 0.003929
14 0.000786
15 0.000524
16 0.001048
Name: id, dtype: float64
(property_df.groupby(['accommodates']).id.count()/property_df.shape[0]).plot(kind="bar")
<AxesSubplot:xlabel='accommodates'>
42% of listings accomodate a maximum of 2 guests, 20% accomodate 4, 10% accomodate 3, smaller percentages can accomodate more than 4 people. This makes sense since most listings are houses and apartments
property_df.groupby(['bedrooms']).id.count()/property_df.shape[0]
bedrooms
0.0 0.097433
1.0 0.633054
2.0 0.167627
3.0 0.074123
4.0 0.018072
5.0 0.006286
6.0 0.001572
7.0 0.000262
Name: id, dtype: float64
(property_df.groupby(['bedrooms']).id.count()/property_df.shape[0]).plot(kind="bar")
<AxesSubplot:xlabel='bedrooms'>
63% of listings offer 1 bedroom while 16.7% have 2 bedrooms. 9% have no bedrooms at all
amenities_df=property_df['amenities']
Ammenities are in textual form but can be made into categories since there are frequently reccurring features. Also we can use wordcloud to map some of the most frequently occuring words
# drop empty rows in amenities
amenities_df = amenities_df[amenities_df != '{}']
# create a list of amenities, removing the {} and "" and split each word with a comma
amenities_list = []
for index, row in amenities_df.items():
amenities_list.append(row.replace('{','').replace('}','').replace('"','').split(','))
amenities_list[:3]
[['TV',
'Cable TV',
'Internet',
'Wireless Internet',
'Air Conditioning',
'Kitchen',
'Heating',
'Family/Kid Friendly',
'Washer',
'Dryer'],
['TV',
'Internet',
'Wireless Internet',
'Kitchen',
'Free Parking on Premises',
'Buzzer/Wireless Intercom',
'Heating',
'Family/Kid Friendly',
'Washer',
'Dryer',
'Smoke Detector',
'Carbon Monoxide Detector',
'First Aid Kit',
'Safety Card',
'Fire Extinguisher',
'Essentials'],
['TV',
'Cable TV',
'Internet',
'Wireless Internet',
'Air Conditioning',
'Kitchen',
'Free Parking on Premises',
'Pets Allowed',
'Pets live on this property',
'Dog(s)',
'Cat(s)',
'Hot Tub',
'Indoor Fireplace',
'Heating',
'Family/Kid Friendly',
'Washer',
'Dryer',
'Smoke Detector',
'Carbon Monoxide Detector',
'Essentials',
'Shampoo']]
Now that we have ammenities as a list, we can add this as a column and one hot encode it to represent the categorical variables ( Stack overflow and kaggle ) have explanations on how this is done. We will use that snippet from there
new_amenities_df = pd.Series(amenities_list, name = 'amenities').to_frame()
new_amenities_df[:5]
amenities | |
---|---|
0 | [TV, Cable TV, Internet, Wireless Internet, Ai... |
1 | [TV, Internet, Wireless Internet, Kitchen, Fre... |
2 | [TV, Cable TV, Internet, Wireless Internet, Ai... |
3 | [Internet, Wireless Internet, Kitchen, Indoor ... |
4 | [TV, Cable TV, Internet, Wireless Internet, Ki... |
dummies_amenities_df = new_amenities_df.drop('amenities', 1).join(
pd.get_dummies(
pd.DataFrame(new_amenities_df.amenities.tolist()).stack()
).astype(int).sum(level=0)
)
dummies_amenities_df
24-Hour Check-in | Air Conditioning | Breakfast | Buzzer/Wireless Intercom | Cable TV | Carbon Monoxide Detector | Cat(s) | Dog(s) | Doorman | Dryer | ... | Safety Card | Shampoo | Smoke Detector | Smoking Allowed | Suitable for Events | TV | Washer | Washer / Dryer | Wheelchair Accessible | Wireless Internet | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
2 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3768 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
3769 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3770 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
3771 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3772 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
3773 rows × 41 columns
Now we have all the buzz words as columns
dummies_amenities_df.columns
Index(['24-Hour Check-in', 'Air Conditioning', 'Breakfast',
'Buzzer/Wireless Intercom', 'Cable TV', 'Carbon Monoxide Detector',
'Cat(s)', 'Dog(s)', 'Doorman', 'Dryer', 'Elevator in Building',
'Essentials', 'Family/Kid Friendly', 'Fire Extinguisher',
'First Aid Kit', 'Free Parking on Premises', 'Gym', 'Hair Dryer',
'Hangers', 'Heating', 'Hot Tub', 'Indoor Fireplace', 'Internet', 'Iron',
'Kitchen', 'Laptop Friendly Workspace', 'Lock on Bedroom Door',
'Other pet(s)', 'Pets Allowed', 'Pets live on this property', 'Pool',
'Safety Card', 'Shampoo', 'Smoke Detector', 'Smoking Allowed',
'Suitable for Events', 'TV', 'Washer', 'Washer / Dryer',
'Wheelchair Accessible', 'Wireless Internet'],
dtype='object')
dummies_amenities_df.sum()
24-Hour Check-in 616
Air Conditioning 677
Breakfast 291
Buzzer/Wireless Intercom 538
Cable TV 1446
Carbon Monoxide Detector 2485
Cat(s) 382
Dog(s) 509
Doorman 85
Dryer 2997
Elevator in Building 785
Essentials 3237
Family/Kid Friendly 1963
Fire Extinguisher 2196
First Aid Kit 1680
Free Parking on Premises 2167
Gym 442
Hair Dryer 774
Hangers 846
Heating 3627
Hot Tub 303
Indoor Fireplace 886
Internet 2811
Iron 742
Kitchen 3423
Laptop Friendly Workspace 745
Lock on Bedroom Door 100
Other pet(s) 51
Pets Allowed 472
Pets live on this property 883
Pool 159
Safety Card 727
Shampoo 2670
Smoke Detector 3281
Smoking Allowed 82
Suitable for Events 209
TV 2574
Washer 2992
Washer / Dryer 2
Wheelchair Accessible 300
Wireless Internet 3667
dtype: int64
dummies_amenities_df.sum().sort_values(ascending = False).plot(kind='bar', figsize = (15,5));
There is a long list of amenities on offer, washer/Dryer has only occurences, it might be the only amenity that is insignificant
amenitiesDF = listings_df[['amenities','price','id',]]
amenitiesDFTopper = amenitiesDF.sort_values('price',ascending=[0])
amenitiesDFtop=amenitiesDFTopper.head(30)
allemenities = ''
for index,row in amenitiesDFtop.iterrows():
p = re.sub('[^a-zA-Z]+',' ', row['amenities'])
allemenities+=p
allemenities_data=nltk.word_tokenize(allemenities)
filtered_data=[word for word in allemenities_data if word not in stopwords.words('english')]
wnl = nltk.WordNetLemmatizer()
allemenities_data=[wnl.lemmatize(data) for data in filtered_data]
allemenities_words=' '.join(allemenities_data)
plot_display(allemenities_words)
Of all the amenities that we found and listed, the ones shown on the wordmap are the ones that are strongly related to price. Some of the main ones include Wireless Internet, smoke detector, Monoxide detector, Carbon monoxide detector, Essentials like shampoo, Free Parking and Family and kid friendly. These improve customer experience and also contribute to pricing. The more ammenities the host adds the pricier the listing
EDA
While we have already done most of the EDA during data understanding, there’s a few aspects that we haven’t done. For instance, on the calender data set, we have not investigated the availability and price in relation to day of the week, week of the year, month of the year and year. We also have yet to determine if there are factors/ features that relate to or affect price. We also want to uncover neighbourhood trends, find out which times are busiest in Seattle and how the price is affected by those changes. In answer to some of our data mining objectives
fig = plt.figure(figsize = (15,10))
ax = fig.gca()
listings_df.plot.scatter('longitude', 'latitude', ax=ax)
<AxesSubplot:xlabel='longitude', ylabel='latitude'>
Revisiting the review Data
nltk.download('vader_lexicon')
!pip install langdetect
[nltk_data] Downloading package vader_lexicon to
[nltk_data] /home/fucking/nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
Requirement already satisfied: langdetect in /home/fucking/anaconda/lib/python3.8/site-packages (1.0.9)
Requirement already satisfied: six in /home/fucking/anaconda/lib/python3.8/site-packages (from langdetect) (1.15.0)
reviewsDF=reviews_df
We want to understand the general sentiments in the comments whether they were positive, negative or neutral. In order to do that, we use a built-in NLTK library. To read and score each comment on a polariser
# Our dataframe consists of reviews in different language as well.Hence removing the comments which are not in english
from langdetect import detect
def detect_lang(sente):
sente=str(sente)
try:
return detect(sente)
except:
return "None"
for index,row in reviewsDF.iterrows():
lang=detect_lang(row['comments'])
reviewsDF.at[index,'language'] = lang
# print(lang)
#taking rows whose language is English
EngReviewsDF=reviewsDF[reviewsDF.language=='en']
EngReviewsDF.head[:5]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-73-eb087a4ca30d> in <module>
17 EngReviewsDF=reviewsDF[reviewsDF.language=='en']
18
---> 19 EngReviewsDF.head[:5]
TypeError: 'method' object is not subscriptable
We check for positive polarity in the sentiments on the reviews section
polarDF=EngReviewsDF[['pos']]
polarDF=polarDF.groupby(pd.cut(polarDF["pos"], np.arange(0, 1.1, 0.1))).count()
polarDF=polarDF.rename(columns={'pos':'count_of_Comments'})
polarDF=polarDF.reset_index()
polarDF=polarDF.rename(columns={'pos':'range_i'})
for i,r in polarDF.iterrows():
polarDF.at[i,'RANGE'] = float(str(r['range_i'])[1:4].replace(',',''))
polarDF.at[i,'Sentiment'] = 'positive'
del polarDF['range_i']
polarDF.head()
count_of_Comments | RANGE | Sentiment | |
---|---|---|---|
0 | 202 | 0.0 | positive |
1 | 1792 | 0.1 | positive |
2 | 3712 | 0.2 | positive |
3 | 2890 | 0.3 | positive |
4 | 1319 | 0.4 | positive |
We check for negative polarity as well
polarDFneg=EngReviewsDF[['neg']]
polarDFneg=polarDFneg.groupby(pd.cut(polarDFneg["neg"], np.arange(0, 1.1, 0.1))).count()
polarDFneg=polarDFneg.rename(columns={'neg':'count_of_Comments'})
polarDFneg=polarDFneg.reset_index()
polarDFneg=polarDFneg.rename(columns={'neg':'range_i'})
for i,r in polarDFneg.iterrows():
polarDFneg.at[i,'RANGE'] = float(str(r['range_i'])[1:4].replace(',',''))
polarDFneg.at[i,'Sentiment'] = 'negative'
del polarDFneg['range_i']
for i,r in polarDFneg.iterrows():
polarDF=polarDF.append(pd.Series([r[0],r[1],r[2]],index=['count_of_Comments','RANGE','Sentiment']),ignore_index=True)
polarDFneg.head()
count_of_Comments | RANGE | Sentiment | |
---|---|---|---|
0 | 3180 | 0.0 | negative |
1 | 130 | 0.1 | negative |
2 | 8 | 0.2 | negative |
3 | 1 | 0.3 | negative |
4 | 0 | 0.4 | negative |
Checking for neutral sentiments as well
polarDFneut=EngReviewsDF[['neu']]
polarDFneut=polarDFneut.groupby(pd.cut(polarDFneut["neu"], np.arange(0, 1.0, 0.1))).count()
polarDFneut=polarDFneut.rename(columns={'neu':'count_of_Comments'})
polarDFneut=polarDFneut.reset_index()
polarDFneut=polarDFneut.rename(columns={'neu':'range_i'})
for i,r in polarDFneut.iterrows():
polarDFneut.at[i,'RANGE'] = float(str(r['range_i'])[1:4].replace(',',''))
polarDFneut.at[i,'Sentiment'] = 'neutral'
del polarDFneut['range_i']
for i,r in polarDFneut.iterrows():
polarDF=polarDF.append(pd.Series([r[0],r[1],r[2]],index=['count_of_Comments','RANGE','Sentiment']),ignore_index=True)
polarDFneut.head()
count_of_Comments | RANGE | Sentiment | |
---|---|---|---|
0 | 1 | 0.0 | neutral |
1 | 17 | 0.1 | neutral |
2 | 83 | 0.2 | neutral |
3 | 208 | 0.3 | neutral |
4 | 560 | 0.4 | neutral |
Then we plot the polarity of the sentiments against the number of compliments to see the general sentiment
plt.figure(figsize=(10,10))
sns.factorplot(data=polarDF, x="RANGE", y="count_of_Comments",col="Sentiment")
/home/fucking/anaconda/lib/python3.8/site-packages/seaborn/categorical.py:3704: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
warnings.warn(msg)
<seaborn.axisgrid.FacetGrid at 0x7f5da14bc3d0>
<Figure size 720x720 with 0 Axes>
There are very few negative comments, neutral sentiments are skewed towards the range.
inputDF = pd.read_csv("listings.csv", delimiter=',')
inputDF=inputDF[['number_of_reviews','price','review_scores_rating']]
# replacing NaN values with 0
inputDF.fillna(0, inplace=True)
#Extracting prices from the table
price = inputDF['price']
prices=[]
#clean the data to make it float
for p in price:
p=float(p[1:].replace(',',''))
prices.append(p)
#replace the price column with the new column
inputDF['price']=prices
price_review = inputDF[['number_of_reviews', 'price']].sort_values(by = 'price')
price_review.plot(x = 'price',
y = 'number_of_reviews',
style = 'o',
figsize =(12,8),
legend = False,
title = 'Reviews based on Price')
plt.xlabel("price")
plt.ylabel("Number of reviews")
Text(0, 0.5, 'Number of reviews')
There are more reviews concentrated around a certain price range of between up to $ 250. We see here that customers prefer cheaper listings and use cheaper listings more hence the many reviews
Revisiting the calender data
calender_df[:5]
listing_id | date | available | price | dayofweek | weekofyear | month | year | day_Name | holiday | us_holidays_name | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 241032 | 2016-01-04 | 1 | 85.000000 | 0 | 1 | 1 | 2016 | default | False | working |
1 | 241032 | 2016-01-05 | 1 | 85.000000 | 1 | 1 | 1 | 2016 | default | False | working |
2 | 241032 | 2016-01-06 | 0 | 137.090652 | 2 | 1 | 1 | 2016 | default | False | working |
3 | 241032 | 2016-01-07 | 0 | 137.090652 | 3 | 1 | 1 | 2016 | default | False | working |
4 | 241032 | 2016-01-08 | 0 | 137.090652 | 4 | 1 | 1 | 2016 | default | False | working |
fig, axs = plt.subplots(1, 3, figsize=(16, 5))
sns.boxplot(data=calender_df, x='month', y='price', ax=axs[0], fliersize=0)
sns.boxplot(data=calender_df, x='dayofweek', y='price', ax=axs[1], fliersize=0)
sns.boxplot(data=calender_df, x='year', y='price', ax=axs[2], fliersize=0)
for ax in axs:
ax.set_ylim(0, 350)
The price of listings is highest between day 3 and 4 of the week, and month 6 to 9 of the year.
# we modify the data to include average price and a year and month column
yearDF=calender_df.groupby(['year','month']).price.mean()
yearDF=yearDF.reset_index()
yearDF=yearDF.rename(columns={'price':'average_Price'})
yearDF['year-Month']=yearDF['year'].map(str) + "-" + yearDF['month'].map(str)
yearDF.to_csv('./year_month_data.csv')
yearDF[::5]
year | month | average_Price | year-Month | |
---|---|---|---|---|
0 | 2016 | 1 | 128.293029 | 2016-1 |
5 | 2016 | 6 | 143.543198 | 2016-6 |
10 | 2016 | 11 | 135.396148 | 2016-11 |
yearDF.plot(kind="bar",x='year-Month', y='average_Price',color = '#662dff', figsize =(15,8),
title = 'Seattle Airbnb prices trend over months')
plt.ylabel('Average Price')
Text(0, 0.5, 'Average Price')
The average price is more or less the same throughout 2016 with a hike between June and Sept as seen earlier in the analysis
!pip install holidays
Requirement already satisfied: holidays in /home/fucking/anaconda/lib/python3.8/site-packages (0.11.1)
Requirement already satisfied: hijri-converter in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (2.1.1)
Requirement already satisfied: korean-lunar-calendar in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (0.2.1)
Requirement already satisfied: convertdate>=2.3.0 in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (2.3.2)
Requirement already satisfied: six in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (1.15.0)
Requirement already satisfied: python-dateutil in /home/fucking/anaconda/lib/python3.8/site-packages (from holidays) (2.8.1)
Requirement already satisfied: pymeeus<=1,>=0.3.13 in /home/fucking/anaconda/lib/python3.8/site-packages (from convertdate>=2.3.0->holidays) (0.5.11)
Requirement already satisfied: pytz>=2014.10 in /home/fucking/anaconda/lib/python3.8/site-packages (from convertdate>=2.3.0->holidays) (2020.1)
Another useful library for calender analysis is to install the holiday library so that we know each holiday in the US to see if the days and or dates with higher prices corresponded to holidays
# We make a copy of the dataset so that we mess with the copy and preserve the original
calendarDF=calender_df.copy()
#replacing NaN values with 0
calendarDF.fillna(0, inplace=True)
calendarDF = calendarDF[calendarDF.price != 0]
#Extracting prices from the table
price = calendarDF['price']
prices=[]
for p in price:
p = re.sub('[^0-9.]+','', p)
prices.append(float(p))
#replace the price column with the new column
calendarDF['price']=prices
calendarDF = calendarDF[calendarDF.price >= 0]
#separating date column into day month and year
calendarDF['Year'],calendarDF['Month'],calendarDF['Day']=calendarDF['date'].str.split('-',2).str
calendarDF.head()
<ipython-input-629-91ebf8a550a9>:19: FutureWarning: Columnar iteration over characters will be deprecated in future releases.
calendarDF['Year'],calendarDF['Month'],calendarDF['Day']=calendarDF['date'].str.split('-',2).str
listing_id | date | available | price | Year | Month | Day | |
---|---|---|---|---|---|---|---|
0 | 241032 | 2016-01-04 | t | 85.0 | 2016 | 01 | 04 |
1 | 241032 | 2016-01-05 | t | 85.0 | 2016 | 01 | 05 |
9 | 241032 | 2016-01-13 | t | 85.0 | 2016 | 01 | 13 |
10 | 241032 | 2016-01-14 | t | 85.0 | 2016 | 01 | 14 |
14 | 241032 | 2016-01-18 | t | 85.0 | 2016 | 01 | 18 |
Now that we have recreated the calender data set like we did before and seperated the datetime to day, month and year, we can use these for our EDA. Having installed the holidays, we add them to the data set
from datetime import date
import datetime
import calendar
import holidays
calendarDF.fillna(0, inplace=True)
us_holidays = holidays.UnitedStates()
calendarDF['day_Name']='default'
calendarDF['holiday']='False'
calendarDF['us_holidays_name']='working'
for index,row in calendarDF.iterrows():
sdate = datetime.date(int(row['Year']),int(row['Month']),int(row['Day']))
vall=date(int(row['Year']),int(row['Month']),int(row['Day'])) in us_holidays
calendarDF.at[index,'day_Name'] = calendar.day_name[sdate.weekday()]
calendarDF.at[index,'holiday'] = vall
calendarDF.at[index,'us_holidays_name'] = us_holidays.get(sdate)
calendarDF.to_csv('./holidays_data.csv')
calendarDF.head()
listing_id | date | available | price | Year | Month | Day | day_Name | holiday | us_holidays_name | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 241032 | 2016-01-04 | t | 85.0 | 2016 | 01 | 04 | Monday | False | None |
1 | 241032 | 2016-01-05 | t | 85.0 | 2016 | 01 | 05 | Tuesday | False | None |
9 | 241032 | 2016-01-13 | t | 85.0 | 2016 | 01 | 13 | Wednesday | False | None |
10 | 241032 | 2016-01-14 | t | 85.0 | 2016 | 01 | 14 | Thursday | False | None |
14 | 241032 | 2016-01-18 | t | 85.0 | 2016 | 01 | 18 | Monday | True | Martin Luther King Jr. Day |
We estimate and calculate the average price of listings per day to see the trend of prices per day of the week
#calculating Average price for each day
dayDF=calendarDF.groupby('day_Name').price.mean()
dayDF=dayDF.reset_index()
dayDF['day_num']=0
for index,row in dayDF.iterrows():
if row['day_Name']=='Monday':
dayDF.at[index,'day_num']=1
if row['day_Name']=='Tuesday':
dayDF.at[index,'day_num']=2
if row['day_Name']=='Wednesday':
dayDF.at[index,'day_num']=3
if row['day_Name']=='Thursday':
dayDF.at[index,'day_num']=4
if row['day_Name']=='Friday':
dayDF.at[index,'day_num']=5
if row['day_Name']=='Saturday':
dayDF.at[index,'day_num']=6
if row['day_Name']=='Sunday':
dayDF.at[index,'day_num']=7
dayDF=dayDF.sort_values('day_num',ascending=[1])
dayDF=dayDF.rename(columns={'price':'Average_Price'})
dayDF
day_Name | Average_Price | day_num | |
---|---|---|---|
1 | Monday | 135.676414 | 1 |
5 | Tuesday | 135.408764 | 2 |
6 | Wednesday | 135.447880 | 3 |
4 | Thursday | 136.476032 | 4 |
0 | Friday | 143.036294 | 5 |
2 | Saturday | 143.202136 | 6 |
3 | Sunday | 136.459941 | 7 |
What we see is that the average price is higher on weekends between Friday and Saturday
import matplotlib.pyplot as plt
dayname = list()
for i in dayDF['day_Name']:
dayname.append(i)
avgprice = list()
for i in dayDF['Average_Price']:
avgprice.append(i)
graph_input = dict(zip(dayname,avgprice))
plt.scatter(dayname,avgprice)
plt.show()
calendarDF.groupby('us_holidays_name').listing_id.count()
us_holidays_name
Christmas Day 2829
Christmas Day (Observed) 2831
Columbus Day 2650
Independence Day 2371
Labor Day 2544
Martin Luther King Jr. Day 2231
Memorial Day 2583
New Year's Day 2922
New Year's Day (Observed) 2776
Thanksgiving 2746
Veterans Day 2718
Washington's Birthday 2529
Name: listing_id, dtype: int64
In general, all holidays have bookings with New Year’s day being the highest.
calendarDF.groupby('us_holidays_name').price.mean()
us_holidays_name
Christmas Day 137.974903
Christmas Day (Observed) 137.895797
Columbus Day 134.985660
Independence Day 151.402362
Labor Day 142.087264
Martin Luther King Jr. Day 121.740475
Memorial Day 143.233837
New Year's Day 138.126968
New Year's Day (Observed) 135.572767
Thanksgiving 136.054989
Veterans Day 140.070272
Washington's Birthday 125.353895
Name: price, dtype: float64
Independence day has the highest price, Martin Luther King Jr Day and Washington’s Birthday being the cheapest.
Independence day affectionately known as the 4th of July in the USA is sometimes celebrated as a long weekend. So we check the average prices over the weekends. Especially around July as trends have shown a spike in prices between June and August.
# analyzing data from date 4th of July to date 13th of July which includes both long weekend
#and normal workdays to compare prices
marDF=calendarDF[(calendarDF['Year'] == '2016') & (calendarDF['Month'] == '07' )&
((calendarDF['Day'] == '04' )|(calendarDF['Day'] == '05' )|(calendarDF['Day'] == '06' )
| (calendarDF['Day'] == '07' )| (calendarDF['Day'] == '08' )| (calendarDF['Day'] == '09' )
| (calendarDF['Day'] == '10' )| (calendarDF['Day'] == '11' )| (calendarDF['Day'] == '12' )
| (calendarDF['Day'] == '13' ))]
marDF=marDF.groupby('Day').price.mean()
marDF=marDF.reset_index()
marDF=marDF.sort_values('Day',ascending=[1])
marDF=marDF.rename(columns={'price':'Average_Price'})
marDF.head(10)
Day | Average_Price | |
---|---|---|
0 | 04 | 151.402362 |
1 | 05 | 150.522832 |
2 | 06 | 150.200837 |
3 | 07 | 150.709570 |
4 | 08 | 156.260469 |
5 | 09 | 156.448161 |
6 | 10 | 150.302538 |
7 | 11 | 149.866250 |
8 | 12 | 150.058504 |
9 | 13 | 150.656785 |
Clearly and conclusively, weekends cost more than week days. A plot of average price per day of the week should show this
x=marDF['Day'].tolist()
y=marDF['Average_Price'].tolist()
plt.plot(x,y, 'ro-')
plt.ylabel('Average Price')
plt.xlabel('Days')
plt.show()
This answers our data mining question of what the busiest times in Seattle are and how the prices change. From our analysis, the summer is the busiest times in seattle and prices are higher. The Independence day holiday is the most expensive to visit in though holidays in general are pricier. Good holidays to visit seattle on are Martin Luther King Jr and Washington’s birthday. Weekends are also pricer than week days
Since we have NLT we can utilise it to process the text features
summary = listings_df[['summary','price']]
summary = summary[pd.notnull(summary['summary'])]
summary = summary[summary['summary']!=0]
summary = summary.sort_values('price',ascending=[0])
top100DF = summary.head(1000)
from nltk.corpus import stopwords
import string
words=''
for index,row in top100DF.iterrows():
words += row['summary']
string_punctuation = string.punctuation
ignoreChar=['\r','\n','',' ',"'s"]
nums=['0','1','2','3','4','5','6','7','8','9']
summary_data=nltk.word_tokenize(words)
words_only = [l.lower() for l in summary_data if l not in string_punctuation if l not in ignoreChar if l not in nums]
filtered_data=[word for word in words_only if word not in stopwords.words('english')]
wnl = nltk.WordNetLemmatizer()
final_data=[wnl.lemmatize(data) for data in filtered_data]
final_words=' '.join(final_data)
final_words[:500]
'100 walk score close convention center westlake station pike place market amazon belltown capitol hill spotless reliable accommodating professional host apartment lower level classic 1915 craftsman bungalow heart wallingford walk score 93 ideally located 45th 50th block restaurant grocery store drug stores.walk downtown seattle attraction charming b b tree-lined side street capitol hill neighborhood.awesome two bedroom condo perfect convention traveler business traveler couple wanting see enjoy '
plot_display(final_words)
The most frequently used words in listing summary are home, room, downtown, apartment,neighbourhd, garden, ballard bus line, bathroom. Hosts listing houses should use these key words to get more customers
Revisiting Price related metrics
some_features=listings_df[['id','property_type','price','room_type','beds','bedrooms',
'neighbourhood_cleansed','bathrooms','accommodates','latitude','longitude']]
some_features.fillna(0,inplace=True)
some_features['price'] = some_features['price'].str.extract(r'(\d+)', expand=False)
# make price float
some_features['price'] = some_features['price'].astype(float, errors = 'raise')
some_features = some_features[some_features.bathrooms >0]
some_features = some_features[some_features.beds >0]
some_features = some_features[some_features.price >0]
some_features = some_features[some_features.accommodates >0]
some_features.head()
plt.figure(figsize=(18,22))
ax = sns.heatmap(some_features.groupby([
'neighbourhood_cleansed', 'bedrooms']).price.mean().unstack(),annot=True, fmt=".0f")
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
(87.5, -0.5)
The heatmap shows that the number of bed rooms greatly affect price, the more the bedrooms the higher the price but neighbourhood is also important. As expected some neighbourhoods with fewer bedrooms will be more expensive vice versa
# apartments are the most expensive room type,in
plt.figure(figsize=(12,12))
sns.heatmap(some_features.groupby(['property_type', 'room_type']).price.mean().unstack(),annot=True, fmt=".0f")
<AxesSubplot:xlabel='room_type', ylabel='property_type'>
The room type certainly affects the price, apartments are the most expensive followed by boats.
some_features.groupby(['bedrooms']).price.mean().plot(kind="bar")
<AxesSubplot:xlabel='bedrooms'>
Key Findings from the Analysis of the data
When we started we sort to understand and answer these questions:
- What features of the listings are related to price?
- What vibes are happening in each neighbourhood?
- Are there factors/metrics for that help understand customer experience?
- Are there things that guests/AirBnB can do to boost guest experience and get more clients?
- Has there been an increase in the number of AirBnb hosts?
- What are the busiest times of the year in seattle and by how much do the prices change
Our Analysis of the data has uncovered these findings.
Price
- The neighbourhood, number of beds, how many nights the guest will be staying, whether they have extra guests, the day of the week, the months, the neighbourhood, whether its a holiday and Amenities provided
- Of all the amenities that we found and listed, the ones shown on the wordmap are the ones that are strongly related to price. Some of the main ones include Wireless Internet, smoke detector, Monoxide detector, Carbon monoxide detector, Essentials like shampoo, Free Parking and Family and kid friendly. These improve customer experience and also contribute to pricing. The more ammenities the host adds the pricier the listing
- Appartments and houses are the most common properties with 45% and 44.7% of all the listings of room types and they are the most expensive
Vibes in each neighbourhood
- Broadway is the most frequently listed neighbourhood which occurs 397 times
- We see that there are zipcodes that have more listings like the 98122, 98103 and 98102. These are generally higher, the bulk of the listings are on average the same until we get to 98104. Then we see 98199, 126,106,108,133,136 have about the same listings around the 50-60 listings average
**Has there been an increase in AirBnb’s hosts
- Since 2008 to 2016. In 10 years, there was a growth from 4 listings to 3816 listings, the growth is exponential. Between 2009 and 2011 the growth rate doubled at a average growth rate of 140% per year. From 2011 the growth rate steadied to an average growth rate of 30%. About 50% of hosts joined July 2013
Busiest times in Seattle and how prices change
- The summer is the busiest times in seattle, between June and August. Prices are highest during these times. Holidays are generally higher priced on average than normal days.The Independence day holiday is the most expensive to visit in, in Seattle. Good holidays to visit seattle on are Martin Luther King Jr and Washington’s birthday, they are cheaper as well. Weekends are also higher priced than week days, prices go up on Friday and Saturday
Customer metrics
- reviews per month and review score rating features are useful features for understanding customer experience and thus improving customer experience. As expected some users/hosts have many reviews but fewer reviews per month suggesting they have been listed longer but not neccesarily that they are perfoming better than those with fewer number of ratigs but higher reviews per month
Data Preparation
Our business understanding is broad, we sought to understand features that we can link to the customer experience and features that affect pricing. As we come down to preparing the data for modelling, we realise that we need to narrow down to one aspect to be investigated fully first then as time permits we can do the other one. We will start with and narrow our focus on price first. We will model price prediction.
What we have already learned about price from our EDA has revealed the features that seem to be related to or affect pricing. We will focus on these features. We concluded that:
- Amenities
- number of reviews
- propety type
- room type
- neighbourhood
- number of beds
- how many nights
- The guest will be staying
- The day of the week, the months
- Whether its a holiday
- Amenities provided: and on the ammenities we provided a list of amenities that are essential
We will now clean and prepare the columns with these and related features. Fill in missing values. Convert the categorical variables into one hot encoded dummies
Selecting the columns that we want
We have 3 datasets:
calendar.csv - calendar data for the listings: availability dates, price for each date.
listings.csv - summary information on listing in Seattle such as: location, host information, cleaning and guest fees, amenities etc.
reviews.csv - summary review data for the listings. We will only use the calender data and listings, the reviews is in textual format so we wont be modelling it. During our EDA we have already predetermined the columns that affect price, so here we just list them
#Select only the columns which we are interested in
selected_listings_cols = [
'id', 'zipcode', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms',
'beds', 'bed_type', 'amenities', 'price', 'weekly_price', 'monthly_price', 'security_deposit',
'cleaning_fee', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication',
'review_scores_location', 'review_scores_value', 'cancellation_policy',
]
new_listings_df = listings_df[selected_listings_cols]
new_listings_df['price'][:5]
0 85.0
1 150.0
2 975.0
3 100.0
4 450.0
Name: price, dtype: float64
Drop missing values
# Drop rows with missing values of the number of reviews
# Drop rows with missing values of the bathrooms
# Drop rows with missing values of the property type
review_scores_cols = [col for col in listings_df.columns if 'review_scores' in col]
drop_cols = ['number_of_reviews','bathrooms','property_type'] + review_scores_cols
new_listings_df.dropna(subset = drop_cols, axis = 0, inplace = True)
# Drop rows where the number of reviews is 0.
new_listings_df = new_listings_df[new_listings_df['number_of_reviews'] != 0]
# Drop rows where the amenities are empty
new_listings_df = new_listings_df[new_listings_df['amenities'] != '{}']
# Reset index
new_listings_df = new_listings_df.reset_index(drop = True)
<ipython-input-133-45b57bcb1fc0>:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_listings_df.dropna(subset = drop_cols, axis = 0, inplace = True)
Fill missing values
# Fill missing values of the bedrooms and beds as 1
new_listings_df[['bedrooms', 'beds']] = new_listings_df[['bedrooms', 'beds']].fillna(value = 1)
# Change the data type of the price and related columns and fill missing values as 0
new_listings_price_df = new_listings_df[['price','weekly_price','monthly_price','security_deposit','cleaning_fee']]
for col in ['price','weekly_price','monthly_price','security_deposit','cleaning_fee']:
new_listings_price_df = pd.concat([new_listings_price_df.drop(columns = [col]), new_listings_price_df[col].str.replace('$','').str.replace(',','').astype(float)], axis = 1)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-135-c8147b92ab2f> in <module>
3
4 for col in ['price','weekly_price','monthly_price','security_deposit','cleaning_fee']:
----> 5 new_listings_price_df = pd.concat([new_listings_price_df.drop(columns = [col]), new_listings_price_df[col].str.replace('$','').str.replace(',','').astype(float)], axis = 1)
6
7 new_listings_price_df.fillna(value = 0, inplace =True)
~/anaconda/lib/python3.8/site-packages/pandas/core/generic.py in __getattr__(self, name)
5133 or name in self._accessors
5134 ):
-> 5135 return object.__getattribute__(self, name)
5136 else:
5137 if self._info_axis._can_hold_identifiers_and_holds_name(name):
~/anaconda/lib/python3.8/site-packages/pandas/core/accessor.py in __get__(self, obj, cls)
185 # we're accessing the attribute of the class, i.e., Dataset.geo
186 return self._accessor
--> 187 accessor_obj = self._accessor(obj)
188 # Replace the property with the accessor object. Inspired by:
189 # https://www.pydanny.com/cached-property.html
~/anaconda/lib/python3.8/site-packages/pandas/core/strings.py in __init__(self, data)
2098
2099 def __init__(self, data):
-> 2100 self._inferred_dtype = self._validate(data)
2101 self._is_categorical = is_categorical_dtype(data.dtype)
2102 self._is_string = data.dtype.name == "string"
~/anaconda/lib/python3.8/site-packages/pandas/core/strings.py in _validate(data)
2155
2156 if inferred_dtype not in allowed_types:
-> 2157 raise AttributeError("Can only use .str accessor with string values!")
2158 return inferred_dtype
2159
AttributeError: Can only use .str accessor with string values!
new_listings_price_df.fillna(value = 0, inplace =True)
/home/fucking/anaconda/lib/python3.8/site-packages/pandas/core/frame.py:4317: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().fillna(
# Calculate weekly and monthly price
new_listings_price_df['calc_weekly_price'] = new_listings_price_df['price'] * 7
new_listings_price_df['calc_monthly_price'] = new_listings_price_df['price'] * 30
new_listings_price_df[:5]
<ipython-input-158-c05eff8d17ca>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_listings_price_df['calc_weekly_price'] = new_listings_price_df['price'] * 7
<ipython-input-158-c05eff8d17ca>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_listings_price_df['calc_monthly_price'] = new_listings_price_df['price'] * 30
price | weekly_price | monthly_price | security_deposit | cleaning_fee | calc_weekly_price | calc_monthly_price | |
---|---|---|---|---|---|---|---|
0 | 85.0 | 595.0 | 2550.0 | 0.0 | 0.0 | 595.0 | 2550.0 |
1 | 150.0 | 1.0 | 3.0 | 100.0 | 40.0 | 1050.0 | 4500.0 |
2 | 975.0 | 6825.0 | 29250.0 | 1.0 | 300.0 | 6825.0 | 29250.0 |
3 | 450.0 | 3150.0 | 13500.0 | 700.0 | 125.0 | 3150.0 | 13500.0 |
4 | 120.0 | 800.0 | 3600.0 | 0.0 | 40.0 | 840.0 | 3600.0 |
# Fill the weekly and monthky price by its calculated values
for idx, row in new_listings_price_df.iterrows():
if row['weekly_price'] == 0:
new_listings_price_df.loc[idx, ['weekly_price']] = row['calc_weekly_price']
if row['monthly_price'] == 0:
new_listings_price_df.loc[idx, ['monthly_price']] = row['calc_monthly_price']
new_listings_price_df[:5]
price | weekly_price | monthly_price | security_deposit | cleaning_fee | calc_weekly_price | calc_monthly_price | |
---|---|---|---|---|---|---|---|
0 | 85.0 | 595.0 | 2550.0 | 0.0 | 0.0 | 595.0 | 2550.0 |
1 | 150.0 | 1.0 | 3.0 | 100.0 | 40.0 | 1050.0 | 4500.0 |
2 | 975.0 | 6825.0 | 29250.0 | 1.0 | 300.0 | 6825.0 | 29250.0 |
3 | 450.0 | 3150.0 | 13500.0 | 700.0 | 125.0 | 3150.0 | 13500.0 |
4 | 120.0 | 800.0 | 3600.0 | 0.0 | 40.0 | 840.0 | 3600.0 |
#drop the calculated prices columns
new_listings_price_df.drop(columns = ['calc_weekly_price', 'calc_monthly_price'], inplace = True)
/home/fucking/anaconda/lib/python3.8/site-packages/pandas/core/frame.py:4163: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().drop(
new_listings_price_df
price | weekly_price | monthly_price | security_deposit | cleaning_fee | |
---|---|---|---|---|---|
0 | 85.0 | 595.0 | 2550.0 | 0.0 | 0.0 |
1 | 150.0 | 1.0 | 3.0 | 100.0 | 40.0 |
2 | 975.0 | 6825.0 | 29250.0 | 1.0 | 300.0 |
3 | 450.0 | 3150.0 | 13500.0 | 700.0 | 125.0 |
4 | 120.0 | 800.0 | 3600.0 | 0.0 | 40.0 |
... | ... | ... | ... | ... | ... |
3118 | 154.0 | 1.0 | 5.0 | 150.0 | 85.0 |
3119 | 65.0 | 455.0 | 1950.0 | 0.0 | 20.0 |
3120 | 95.0 | 600.0 | 2.0 | 500.0 | 0.0 |
3121 | 359.0 | 2513.0 | 10770.0 | 0.0 | 230.0 |
3122 | 79.0 | 553.0 | 2370.0 | 500.0 | 50.0 |
3123 rows × 5 columns
Categorical values changed to dummy variable
# Create dummy columns of cancellation policy, room type, property type and bed type
cancellation_policy_dummy_df = pd.get_dummies(new_listings_df['cancellation_policy'], prefix = 'cancellation_policy')
room_type_dummy_df = pd.get_dummies(new_listings_df['room_type'], prefix = 'room_type')
property_type_dummy_df = pd.get_dummies(new_listings_df['property_type'], prefix = 'property_type')
bed_type_dummy_df = pd.get_dummies(new_listings_df['bed_type'], prefix = 'bed_type')
bed_type_dummy_df
bed_type_Airbed | bed_type_Couch | bed_type_Futon | bed_type_Pull-out Sofa | bed_type_Real Bed | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... |
3118 | 0 | 0 | 0 | 0 | 1 |
3119 | 0 | 0 | 0 | 0 | 1 |
3120 | 0 | 0 | 0 | 0 | 1 |
3121 | 0 | 0 | 0 | 0 | 1 |
3122 | 0 | 0 | 0 | 0 | 1 |
3123 rows × 5 columns
Create dummy variables based on amenities
# Create dummy columns based on the ammenities
# Drop rows with empty rows
amenities_series = new_listings_df['amenities']
amenities_series = amenities_series[amenities_series != '{}']
# Iterate over rows and format them as list
amenities_list = []
for index, row in amenities_series.items():
amenities_list.append(row.replace('{','').replace('}','').replace('"','').split(','))
# Convert the list to a data frame
amenities_df = pd.Series(amenities_list, name = 'amenities').to_frame()
# Create a dummy data frame
dummies_amenities_df = amenities_df.drop('amenities', 1).join(
pd.get_dummies(
pd.DataFrame(amenities_df.amenities.tolist()).stack()
).astype(int).sum(level=0)
)
# Reset index
# dummies_amenities_df = dummies_amenities_df.reset_index(drop=True)
dummies_amenities_df[:5]
24-Hour Check-in | Air Conditioning | Breakfast | Buzzer/Wireless Intercom | Cable TV | Carbon Monoxide Detector | Cat(s) | Dog(s) | Doorman | Dryer | ... | Safety Card | Shampoo | Smoke Detector | Smoking Allowed | Suitable for Events | TV | Washer | Washer / Dryer | Wheelchair Accessible | Wireless Internet | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
2 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 41 columns
dummy_df = pd.concat([cancellation_policy_dummy_df, room_type_dummy_df, property_type_dummy_df, bed_type_dummy_df, dummies_amenities_df], axis = 1)
dummy_df[:5]
cancellation_policy_flexible | cancellation_policy_moderate | cancellation_policy_strict | room_type_Entire home/apt | room_type_Private room | room_type_Shared room | property_type_Apartment | property_type_Bed & Breakfast | property_type_Boat | property_type_Bungalow | ... | Safety Card | Shampoo | Smoke Detector | Smoking Allowed | Suitable for Events | TV | Washer | Washer / Dryer | Wheelchair Accessible | Wireless Internet | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
2 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
3 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 68 columns
Concatinate the dummy dataframes together
concat_listings_df =pd.concat([dummy_df,new_listings_price_df],axis=1)
concat_listings_df ['price']
#pd.concat([new_listings_df.select_dtypes(include = ['int', 'float']), new_listings_price_df, dummy_df], axis = 1)
0 85.0
1 150.0
2 975.0
3 450.0
4 120.0
...
3118 154.0
3119 65.0
3120 95.0
3121 359.0
3122 79.0
Name: price, Length: 3123, dtype: float64
# Create a list which contains always non available listing_id
always_f_listing_id = list(calender_df.groupby('listing_id')['available'].sum().loc[lambda x : x == 0].index.values)
calender_df[:5]
listing_id | date | available | price | dayofweek | weekofyear | month | year | |
---|---|---|---|---|---|---|---|---|
0 | 241032 | 2016-01-04 | 1 | 85.000000 | 0 | 1 | 1 | 2016 |
1 | 241032 | 2016-01-05 | 1 | 85.000000 | 1 | 1 | 1 | 2016 |
2 | 241032 | 2016-01-06 | 0 | 137.090652 | 2 | 1 | 1 | 2016 |
3 | 241032 | 2016-01-07 | 0 | 137.090652 | 3 | 1 | 1 | 2016 |
4 | 241032 | 2016-01-08 | 0 | 137.090652 | 4 | 1 | 1 | 2016 |
calender_df.isna().sum()
listing_id 0
date 0
available 0
price 0
dayofweek 0
weekofyear 0
month 0
year 0
dtype: int64
Data Modelling
We will use
- concat_listings_df and the calender_df
concat_listings_df.columns
Index(['cancellation_policy_flexible', 'cancellation_policy_moderate',
'cancellation_policy_strict', 'room_type_Entire home/apt',
'room_type_Private room', 'room_type_Shared room',
'property_type_Apartment', 'property_type_Bed & Breakfast',
'property_type_Boat', 'property_type_Bungalow', 'property_type_Cabin',
'property_type_Camper/RV', 'property_type_Chalet',
'property_type_Condominium', 'property_type_Dorm',
'property_type_House', 'property_type_Loft', 'property_type_Other',
'property_type_Tent', 'property_type_Townhouse',
'property_type_Treehouse', 'property_type_Yurt', 'bed_type_Airbed',
'bed_type_Couch', 'bed_type_Futon', 'bed_type_Pull-out Sofa',
'bed_type_Real Bed', '24-Hour Check-in', 'Air Conditioning',
'Breakfast', 'Buzzer/Wireless Intercom', 'Cable TV',
'Carbon Monoxide Detector', 'Cat(s)', 'Dog(s)', 'Doorman', 'Dryer',
'Elevator in Building', 'Essentials', 'Family/Kid Friendly',
'Fire Extinguisher', 'First Aid Kit', 'Free Parking on Premises', 'Gym',
'Hair Dryer', 'Hangers', 'Heating', 'Hot Tub', 'Indoor Fireplace',
'Internet', 'Iron', 'Kitchen', 'Laptop Friendly Workspace',
'Lock on Bedroom Door', 'Other pet(s)', 'Pets Allowed',
'Pets live on this property', 'Pool', 'Safety Card', 'Shampoo',
'Smoke Detector', 'Smoking Allowed', 'Suitable for Events', 'TV',
'Washer', 'Washer / Dryer', 'Wheelchair Accessible',
'Wireless Internet', 'price', 'weekly_price', 'monthly_price',
'security_deposit', 'cleaning_fee'],
dtype='object')
selected_cols = [
'accommodates',
'bathrooms',
'bedrooms',
'beds',
'review_scores_accuracy',
'review_scores_cleanliness',
'review_scores_checkin',
'review_scores_communication',
'review_scores_location',
'cancellation_policy_flexible',
'cancellation_policy_moderate',
'cancellation_policy_strict',
'room_type_Entire home/apt',
'room_type_Private room',
'room_type_Shared room',
'property_type_Apartment',
'property_type_Bed & Breakfast',
'property_type_Boat',
'property_type_Bungalow',
'property_type_Cabin',
'property_type_Camper/RV',
'property_type_Chalet',
'property_type_Condominium',
'property_type_Dorm',
'property_type_House',
'property_type_Loft',
'property_type_Other',
'property_type_Tent',
'property_type_Townhouse',
'property_type_Treehouse',
'property_type_Yurt',
'bed_type_Airbed',
'bed_type_Couch',
'bed_type_Futon',
'bed_type_Pull-out Sofa',
'bed_type_Real Bed',
'24-Hour Check-in',
'Air Conditioning',
'Breakfast',
'Buzzer/Wireless Intercom',
'Cable TV',
'Carbon Monoxide Detector',
'Cat(s)',
'Dog(s)',
'Doorman',
'Dryer',
'Elevator in Building',
'Essentials',
'Family/Kid Friendly',
'Fire Extinguisher',
'First Aid Kit',
'Free Parking on Premises',
'Gym',
'Hair Dryer',
'Hangers',
'Heating',
'Hot Tub',
'Indoor Fireplace',
'Internet',
'Iron',
'Kitchen',
'Laptop Friendly Workspace',
'Lock on Bedroom Door',
'Other pet(s)',
'Pets Allowed',
'Pets live on this property',
'Pool',
'Safety Card',
'Shampoo',
'Smoke Detector',
'Smoking Allowed',
'Suitable for Events',
'TV',
'Washer',
'Wheelchair Accessible',
'Wireless Internet'
]
Machine Learning: Model Selection
We will start with XGBoost first because it is a complex and advanced gradient boosting algorith that can deal with many features ( our data set has too many variables)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import plot_importance
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
X=concat_listings_df.drop(['price'],axis=1).select_dtypes(exclude=['object'])
y=concat_listings_df['price']
# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, y)
# summarize scores
#set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])
[ 1.53466214 1.2987518 2.90972465 13.47652381 9.74081468 7.69373935
2.41685525 0.8211285 11.37626548 0.40519549 0.57030881 0.34520482
0.31617885 0.9355316 1.59113654 2.53268221 0.78594757 1.47508666
0.39350675 0.95576457 0.97043513 0.09909161 1.74685022 0.96428115
0.74663051 0.70456078 1.47429514 1.53097578 1.80095279 1.45991194
1.55088211 2.31376894 1.11652724 1.07495732 0.98874016 2.92842166
1.17212899 2.27682701 1.02246066 2.30845121 0.93481025 1.24830463
1.31057935 2.64698341 1.51811673 1.46783559 1.14992775 2.44973911
1.88447824 1.07653317 1.58715301 1.20123227 1.25092065 1.04704867
0.44190869 1.13826204 1.55845695 3.05327576 1.22631481 1.24680062
0.83378803 0.86773784 1.69089912 2.00411682 1.2468099 0.11179384
1.24166705 0.95109599 11.99159758 13.29630536 2.89720016 11.95987119]
[[1.000e+00 5.950e+02 2.550e+03 0.000e+00]
[1.000e+00 1.000e+00 3.000e+00 4.000e+01]
[1.000e+00 6.825e+03 2.925e+04 3.000e+02]
[1.000e+00 3.150e+03 1.350e+04 1.250e+02]
[0.000e+00 8.000e+02 3.600e+03 4.000e+01]]
model = LogisticRegression(solver='lbfgs')
rfe = RFE(model, 3)
fit = rfe.fit(X, y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/utils/validation.py:67: FutureWarning: Pass n_features_to_select=3 as keyword args. From version 0.25 passing these as positional arguments will result in an error
warnings.warn("Pass {} as keyword args. From version 0.25 "
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/home/fucking/anaconda/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
# Train, test and validation data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=99)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=99)
# Create a model and fit the data to it
xgb_model = XGBRegressor(
max_depth=15,
n_estimators=1000,
min_child_weight=10,
colsample_bytree=0.6,
subsample=0.6,
eta=0.2,
seed=0,
learning_rate = 0.05,
n_jobs=2)
xgb_model.fit(
X_train,
y_train,
eval_metric="rmse",
eval_set=[(X_train, y_train), (X_val, y_val)],
verbose=10,
early_stopping_rounds = 5)
[0] validation_0-rmse:147.14436 validation_1-rmse:146.99689
[10] validation_0-rmse:95.54114 validation_1-rmse:95.90320
[20] validation_0-rmse:64.75782 validation_1-rmse:66.27467
[30] validation_0-rmse:47.27364 validation_1-rmse:48.85769
[40] validation_0-rmse:38.16735 validation_1-rmse:41.10921
[50] validation_0-rmse:32.86573 validation_1-rmse:36.87308
[60] validation_0-rmse:29.69966 validation_1-rmse:34.57921
[70] validation_0-rmse:27.70108 validation_1-rmse:33.17731
[80] validation_0-rmse:26.40963 validation_1-rmse:32.58391
[90] validation_0-rmse:25.15677 validation_1-rmse:31.58672
[100] validation_0-rmse:24.32839 validation_1-rmse:31.16549
[103] validation_0-rmse:24.15864 validation_1-rmse:31.21159
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.6, eta=0.2, gamma=0,
gpu_id=-1, importance_type='gain', interaction_constraints='',
learning_rate=0.05, max_delta_step=0, max_depth=15,
min_child_weight=10, missing=nan, monotone_constraints='()',
n_estimators=1000, n_jobs=2, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0,
subsample=0.6, tree_method='exact', validate_parameters=1,
verbosity=None)
xgb_train_pred = xgb_model.predict(X_train)
xgb_val_pred = xgb_model.predict(X_val)
xgb_test_pred = xgb_model.predict(X_test)
for i in [[y_train, xgb_train_pred], [y_val, xgb_val_pred], [y_test, xgb_test_pred]]:
print(r2_score(i[0], i[1]))
0.9248414694882681
0.8683763163400222
0.916730205761942
The model actually performs relatively well with between 85 to 90% accuracy on the training data and testing data
fig, ax = plt.subplots(figsize=(10, 15))
plot_importance(xgb_model,ax=ax);
The feature importances are not entirely consistent with our EDA findings possibly due to dimensionality problems and too much correlation. Perhaps we could revisit our feature engineering and reduce the features
Customer Experience prediction
Our other busines objective was to determine if there are metrics that we can use to determine or understand customer experience. We have already done the EDA and feature engineering. We just have to do the final feature selection and model
df=listings_df.copy()
df['cx_score'] = df['review_scores_rating'] / 100 * df['reviews_per_month']
We found that there’s a strong relation between reviews scores rating and reviews per month. Since we can only have one response variable, we can combine these two by multiply them. We call the column cx_score
listings_df['neighbourhood_cleansed'].value_counts()[:10]
Broadway 397
Belltown 234
Wallingford 167
Fremont 158
Minor 135
University District 122
Stevens 119
First Hill 108
Central Business District 103
Lower Queen Anne 94
Name: neighbourhood_cleansed, dtype: int64
We list the top 10 neighbourhoods which we will use to model our data. We will choose one neighbourhood which we will use to model the data with