Imagine you own a restaurant and you want to analyze not only the trend of your revenue, but also the reason behind periods of particularly high earnings, moment of the day where a particular kind of clients comes to your restaurant, why some days tips are higher than others and so on.

Knowing all those features would allow you to exploit them and enhance the attractiveness of your restaurant.

In this article, I’m going to analyze and build a Machine Learning model on a dataset containing very interesting information about restaurant tips. On Kaggle, you can read the following description of this dataset, called ‘Tips’:

"Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips (at least in restaurants in the United States) are a major component of pay. In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990. The restaurant, located in a suburban shopping mall, was part of a national chain and served a varied menu. In observance of local law, the restaurant offered to seat in a non-smoking section to patrons who requested it. Each record includes a day and time, and taken together, they show the server’s work schedule."

Plus, I’m going to do so in an interactive way, so that the user is able to ask for specific analytics/predictions. For this purpose, I will use the Python package Streamlit, a very powerful tool to easily make web app.

So let’s start computing some analytics.

Exploratory analysis

Once downloaded my dataset (and asked my app to show it to me with a checkbox), I want to visualize some meaningful summaries of it. I will write my code in a .py file and then run in on my Terminal via streamlit run tips.py.

So let’s start with some basics plotting, using a bar chart:

import plotly.graph_objects as go
import pandas as pd
import streamlit as st
import plotly.express as px
import streamlit as st

st.title('Restaurants tips analysis')

tips = px.data.tips()

if st.checkbox('Show dataframe'):
    st.write(tips)
st.subheader('Visualizing bar charts')

label = st.selectbox('Which label do you want to visualize?', ['total_bill', 'tips'])
x = st.selectbox('Pick the x variable: ', ['sex', 'smoker', 'day', 'time', 'size']) 

tips=tips.sort_values('size')
fig = px.bar(tips, x=x, y=label, height=400)
st.plotly_chart(fig)


if st.checkbox('Want to group with respect to another variable?'):
    group1 = st.selectbox('Pick the group variable: ', ['smoker', 'day', 'time', 'size','sex'])
    fig1 = px.bar(tips, x=x, y=label, color=group1, barmode='group',
             height=400)
    st.plotly_chart(fig1)

    
if st.checkbox('Want to group with respect to another variable?'):
    group2 = st.selectbox('Pick the group variable: ', ['day', 'time', 'size','sex', 'smoker'])
    fig2 = px.bar(tips, x=x, y=label, color=group1, barmode='group',facet_col=group2,
             height=400)
    st.plotly_chart(fig2)
Animated GIF

Namely, if we want to collect information with respect to sex, smoker and day of the week, we obtain something like that:

And we can do it interactively, that means, changing the variables of our interest any time we want.

Another interesting feature to collect is the distribution of our potential target variables, that are tips and total_bill:

st.subheader('Visualizing distributions')    

label1 = st.selectbox('Which label do you want to visualize?', ['total_bill', 'tip'])


fig3 = px.histogram(tips, x=label1, hover_data=tips.columns, marginal="box")
st.plotly_chart(fig3)

if st.checkbox('Want to condition the probability?'):
    group1_d = st.selectbox('Pick the conditioning variable: ', ['smoker', 'day', 'time', 'size','sex'])
    fig4 = px.histogram(tips, x=label1, color=group1_d,
                   marginal="box", hover_data=tips.columns)
    st.plotly_chart(fig4)

    
if st.checkbox('Want to condition the probability on a further variable?'):
    group2_d = st.selectbox('Pick the group variable: ', ['day', 'time', 'size','sex', 'smoker'])
    fig5 = px.histogram(tips, x=label1, color=group1_d, facet_col=group2_d,
                   marginal="box",
                   hover_data=tips.columns)
    st.plotly_chart(fig5)
Animated GIF

Namely, the distribution of total_bill, conditioning on smoker and day is the following:

Nice, now let’s build our ML model which, in this case, will be a regression task, since both total_bill and tips are continuous.

ML and predictions

Before running our linear regression, we have to manipulate our dataset, since it contains categorical data, like sex or smoker, which cannot be ingested by our model as they are. Hence, we need to encode them as dummy variables, which are indeed used to capture qualitative/ordinal information:

tips.replace({ 'sex': {'Male':0 , 'Female':1} , 'smoker' : {'No': 0 , 'Yes': 1}} ,inplace=True)
days=pd.get_dummies(df['day'],drop_first=True)
tips = pd.concat([tips,days],axis=1)
times=pd.get_dummies(tips['time'],drop_first=True)
tips = pd.concat([tips,times],axis=1)
tips.drop(['day','time'],inplace=True,axis=1)
tips.head()

As you can see, now instead of Male and Female we have 0 and 1, and the same holds for smoker.

Note: for day and time, I used the one-hot encoding procedure (you can read more about it here), and I dropped one column (Fri for day, Dinner for time) since it will represent my reference group (also called ‘baseline’).

Now let’s build our model, allowing the user to select both target and features:

st.subheader('Running Regression')   


X = st.multiselect('Pick the covariates: ', ['sex','smoker','size','Thur','Sat','Sun','Lunch']) 
X = tips[X]
y = st.selectbox('Select the target: ', ['tip', 'total_bill'])
y = tips[y]

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test , y_train , y_test = train_test_split(X,y,test_size=0.3,random_state=123)

model = LinearRegression()
model.fit(X_train, y_train)

predictions=model.predict(X_test)

if st.checkbox('Show score'):
    score=model.score(X_test,y_test)
    st.write(f'Score is: {score}')
Animated GIF

And that’s it! Now you can play with the features and see which combination of them leads to the highest R2 score. Of course, this is a very ‘rustic’ way to decide which variables are more important than others, nevertheless it doesn’t mean it is not effective: if you find a combination which leads to a great R2 score, why this naive criterion shouldn’t be good? Besides, if you consider how intuitive and easy to implement it is, you can well appreciate its potentiality.

I hope you enjoyed the reading and, if you are interested in Streamlit, you can read the official documentation here and some of my previous article about the topic:

References:

Published by valentinaalto

I'm a 22-years-old student based in Milan, passionate about everything related to Statistics, Data Science and Machine Learning. I'm eager to learn new concepts and techniques as well as share them with whoever is interested in the topic.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: