Data Science and Analytics have a huge variety of fields of applications, basically every time pieces of information are delivered in the form of data.

The sports industry makes no exception. There is a great business all around, and having the possibility to study the market of sports via powerful analytics tools is a great added value.

In this article, I’m going to provide some tools of analysis of football matches. The idea is developing a web app with Python Streamlit (if you want to read an introduction about this tool, you can read my former article here) which, in an intuitive and interactive way, allows the user to summarize relevant information from huge datasets. For this purpose, I’m going to use the International football results from 1872 to 2019, available on Kaggle.

First thing first, let’s import our dataset and have a look at it:

import pandas as pd
df = pd.read_csv('results.csv')
df.head()

Where:

date - date of the match
home_team - the name of the home team
away_team - the name of the away team
home_score - full-time home team score including extra time, not including penalty-shootouts
away_score - full-time away team score including extra time, not including penalty-shootouts
tournament - the name of the tournament
city - the name of the city/town/administrative unit where the match was played
country - the name of the country where the match was played
neutral - TRUE/FALSE column indicating whether the match was played at a neutral venue

Now, the idea is creating a series of interactions with this dataset which highlights the features of teams or tournaments we are interested in. There is plenty of information we can obtain, and this article is not meant to be an exhaustive list of it. Nevertheless, one of the advantages of Streamlit is its ability to be continuously modified, hence you can make your basic framework much more complex over time, without getting rid of what you have been doing so far.

As in my previous article, I’m attaching the entire code here:

import streamlit as st
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots


st.title('Internationa Football matches')
df = pd.read_csv("results.csv")

if st.checkbox('Show dataframe'):
    st.write(df)

st.subheader('Filtering dataset per team')

teams = st.multiselect('Pick your teams', df['home_team'].unique())

new_df = df[(df['home_team'].isin(teams)) | (df['away_team'].isin(teams)) ]  

if st.checkbox('Show only home matches'):
    st.write(df[(df['home_team'].isin(teams))])

if st.checkbox('Show only away matches'):
    st.write(df[(df['away_team'].isin(teams))])

if st.checkbox('Show entire dataset'):    
    st.write(new_df)
    
st.subheader('Filtering dataset per event')
events = st.multiselect('Pick your events', df['tournament'].unique())
new_df_event = new_df[(new_df['tournament'].isin(events))]
st.write(new_df_event) 
            
st.subheader('Showing wins, losses and draws per team')

team_wins = st.selectbox('Pick your teams', df['home_team'].unique()) 
new_df_wins = df[(df['home_team']==team_wins)|(df['away_team']==team_wins)]
new_df_wins=new_df_wins.reset_index(drop=True)
    
    
    
wins = 0
losses = 0
draw = 0
x = []    
    
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins:
        if new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            wins+=1
            x.append(1)
        elif new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            losses+=1
            x.append(-1)
        else:
            draw +=1
            x.append(0)
    else:
        if new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            wins+=1
            x.append(1)
        elif new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            losses+=1
            x.append(-1)
        else:
            draw +=1
            x.append(0)
    
    
labels = ['Wins','Losses','Draws']
values = [wins, losses, draw]

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
st.plotly_chart(fig)






fig2 = go.Figure()

fig2.add_trace(go.Scatter(x=list(new_df_wins['date']), y=x))



# Add range slider
    
fig2.update_layout(
    xaxis=go.layout.XAxis(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label="1m",
                     step="month",
                     stepmode="backward"),
                dict(count=6,
                     label="6m",
                     step="month",
                     stepmode="backward"),
                dict(count=1,
                     label="YTD",
                     step="year",
                     stepmode="todate"),
                dict(count=1,
                     label="1y",
                     step="year",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

st.plotly_chart(fig2)





wins_h = 0
losses_h = 0
draw_h = 0
wins_a = 0
losses_a = 0
draw_a = 0
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins:
        if new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            wins_h+=1
        elif new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            losses_h+=1
        else:
            draw_h+=1
for i in range(len(new_df_wins)):
    if not new_df_wins['home_team'][i]==team_wins:
        if new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            wins_a+=1
        elif new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            losses_a+=1
        else:
            draw_a +=1


values_home = [wins_h, losses_h, draw_h]
values_away = [wins_a, losses_a, draw_a]
fig3 = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig3.add_trace(go.Pie(labels=labels, values=values_home, name="Home"),
              1, 1)
fig3.add_trace(go.Pie(labels=labels, values=values_away, name="Away"),
              1, 2)

fig3.update_layout(
    title_text="Wins, losses and draws home vs away",
    annotations=[dict(text='Home', x=0.18, y=0.5, font_size=20, showarrow=False),
                 dict(text='Away', x=0.82, y=0.5, font_size=20, showarrow=False)])

fig3.update_traces(hole=.4, hoverinfo="label+percent+name")
st.plotly_chart(fig3)

#4 subplots to see whether playing in a neutral field is causal

wins_h_neutral = 0
losses_h_neutral = 0
draw_h_neutral = 0
wins_h_notneutral = 0
losses_h_notneutral = 0
draw_h_notneutral = 0

wins_a_neutral = 0
losses_a_neutral = 0
draw_a_neutral = 0
wins_a_notneutral = 0
losses_a_notneutral = 0
draw_a_notneutral = 0

for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins and new_df_wins['neutral'][i]:
        if new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            wins_h_neutral+=1
        elif new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            losses_h_neutral+=1
        else:
            draw_h_neutral+=1
            
            
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins and not new_df_wins['neutral'][i]:
        if new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            wins_h_notneutral+=1
        elif new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            losses_h_notneutral+=1
        else:
            draw_h_notneutral+=1            
            
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]!=team_wins and new_df_wins['neutral'][i]:
        if new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            wins_a_neutral+=1
        elif new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            losses_a_neutral+=1
        else:
            draw_a_neutral +=1
            
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]!=team_wins and not new_df_wins['neutral'][i]:
        if new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            wins_a_notneutral+=1
        elif new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            losses_a_notneutral+=1
        else:
            draw_a_notneutral +=1            
            
            
            
values_home_neutral = [wins_h_neutral, losses_h_neutral, draw_h_neutral]
values_away_neutral = [wins_a_neutral, losses_a_neutral, draw_a_neutral]
values_home_notneutral = [wins_h_notneutral, losses_h_notneutral, draw_h_notneutral]
values_away_notneutral = [wins_a_notneutral, losses_a_notneutral, draw_a_notneutral]


fig4 = make_subplots(rows=2, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Home neutral', 'Away neutral', 'Home not neutral', 'Away not neutral'])
fig4.add_trace(go.Pie(labels=labels, values=values_home_neutral, name="Home neutral"),
              1, 1)
fig4.add_trace(go.Pie(labels=labels, values=values_away_neutral, name="Away neutral"),
              1, 2)
fig4.add_trace(go.Pie(labels=labels, values=values_home_notneutral, name="Home not neutral"),
              2, 1)
fig4.add_trace(go.Pie(labels=labels, values=values_away_notneutral, name="Away not neutral"),
              2, 2)

fig4.update_layout(title_text='Wins, losses and draws home vs away, neutral vs not neutral')

fig4.update_traces(hole=.4, hoverinfo="label+percent+name")
st.plotly_chart(fig4)   


#best performance
st.subheader('Best Performance')

t = []
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins:
        t.append(new_df_wins['home_score'][i])
    else:
        t.append(new_df_wins['away_score'][i])
        
        
m = np.argmax(np.array(t), axis=0)
out = new_df_wins.iloc[m]
st.write(out)

I saved the content of my script in a file called soccer.py and then run it on my terminal with streamlit soccer.py.

Now let’s examine it piece by piece. First, once given the possibility to the user to visualize the entire dataset, I added some filters, so that you can choose the team(s) you want to visualize in your dataset.

st.title('International Football matches')
df = pd.read_csv("results.csv")

if st.checkbox('Show dataframe'):
    st.write(df)

st.subheader('Filtering dataset per team')

teams = st.multiselect('Pick your teams', df['home_team'].unique())

new_df = df[(df['home_team'].isin(teams)) | (df['away_team'].isin(teams)) ] 

Plus, you can decide whether to visualize those matches where your picked team played at home or away:

if st.checkbox('Show only home matches'):
    st.write(df[(df['home_team'].isin(teams))])

if st.checkbox('Show only away matches'):
    st.write(df[(df['away_team'].isin(teams))])

if st.checkbox('Show entire dataset'):    
    st.write(new_df)

I also added a filter for the type of tournament:

st.subheader('Filtering dataset per event')
events = st.multiselect('Pick your events', df['tournament'].unique())
new_df_event = new_df[(new_df['tournament'].isin(events))]
st.write(new_df_event) 
            

Nice, now let’s go examining some features about wins and losses. The idea is that, once selected the team you are interested in, you will be shown a series of information (mainly in graphic form) about the wins/losses trends across time and tournaments.

A very basic computation we can do for a team is counting the total number of wins, losses and draws, and then showing our results with a pie chart:

team_wins = st.selectbox('Pick your teams', df['home_team'].unique()) 
new_df_wins = df[(df['home_team']==team_wins)|(df['away_team']==team_wins)]
new_df_wins=new_df_wins.reset_index(drop=True)
    
    
    
wins = 0
losses = 0
draw = 0
x = []    
    
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins:
        if new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            wins+=1
            x.append(1)
        elif new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            losses+=1
            x.append(-1)
        else:
            draw +=1
            x.append(0)
    else:
        if new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            wins+=1
            x.append(1)
        elif new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            losses+=1
            x.append(-1)
        else:
            draw +=1
            x.append(0)
    
    
labels = ['Wins','Losses','Draws']
values = [wins, losses, draw]

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
st.plotly_chart(fig)

We can also have a look at the historical path of this data. One way to visualize some meaningful information is displaying a time series where the output takes value 1 if in that match there was a wins, -1 if there was a loss, 0 if there was a draw. By doing so, if there are periods where the trend is flat at 1 (that means, the team we are examining has repetitively won many matches), we might be interested in further investigating about that period (namely, the name of the coach, whether the team played in a specific tournament or not, whether it played at home or not…).

So let’s compute our time series (I already stored my values in an array x in my previous code):

fig2 = go.Figure()

fig2.add_trace(go.Scatter(x=list(new_df_wins['date']), y=x))



# Add range slider
    
fig2.update_layout(
    xaxis=go.layout.XAxis(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label="1m",
                     step="month",
                     stepmode="backward"),
                dict(count=6,
                     label="6m",
                     step="month",
                     stepmode="backward"),
                dict(count=1,
                     label="YTD",
                     step="year",
                     stepmode="todate"),
                dict(count=1,
                     label="1y",
                     step="year",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

st.plotly_chart(fig2)

As you can see, I added interactive widgets (sliders and buttons) so that you can focus on relevant periods.

Now, something that might be relevant while investigating about wins/losses trend, is analyzing whether the location of the match (in terms of home/away) does affect its result. For this purpose, let’s first split our wins/losses/draws between those occurred at home and those occurred away:

wins_h = 0
losses_h = 0
draw_h = 0
wins_a = 0
losses_a = 0
draw_a = 0
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins:
        if new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            wins_h+=1
        elif new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            losses_h+=1
        else:
            draw_h+=1
for i in range(len(new_df_wins)):
    if not new_df_wins['home_team'][i]==team_wins:
        if new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            wins_a+=1
        elif new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            losses_a+=1
        else:
            draw_a +=1


values_home = [wins_h, losses_h, draw_h]
values_away = [wins_a, losses_a, draw_a]
fig3 = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig3.add_trace(go.Pie(labels=labels, values=values_home, name="Home"),
              1, 1)
fig3.add_trace(go.Pie(labels=labels, values=values_away, name="Away"),
              1, 2)

fig3.update_layout(
    title_text="Wins, losses and draws home vs away",
    annotations=[dict(text='Home', x=0.18, y=0.5, font_size=20, showarrow=False),
                 dict(text='Away', x=0.82, y=0.5, font_size=20, showarrow=False)])

fig3.update_traces(hole=.4, hoverinfo="label+percent+name")
st.plotly_chart(fig3)

As you can see, for our team of interest (Scotland), there is a clear evidence that most of the wins occurred while playing at home. We can further investigate about this relation by considering also the neutrality of the location:

wins_h_neutral = 0
losses_h_neutral = 0
draw_h_neutral = 0
wins_h_notneutral = 0
losses_h_notneutral = 0
draw_h_notneutral = 0

wins_a_neutral = 0
losses_a_neutral = 0
draw_a_neutral = 0
wins_a_notneutral = 0
losses_a_notneutral = 0
draw_a_notneutral = 0

for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins and new_df_wins['neutral'][i]:
        if new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            wins_h_neutral+=1
        elif new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            losses_h_neutral+=1
        else:
            draw_h_neutral+=1
            
            
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins and not new_df_wins['neutral'][i]:
        if new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            wins_h_notneutral+=1
        elif new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            losses_h_notneutral+=1
        else:
            draw_h_notneutral+=1            
            
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]!=team_wins and new_df_wins['neutral'][i]:
        if new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            wins_a_neutral+=1
        elif new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            losses_a_neutral+=1
        else:
            draw_a_neutral +=1
            
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]!=team_wins and not new_df_wins['neutral'][i]:
        if new_df_wins['home_score'][i]<new_df_wins['away_score'][i]:
            wins_a_notneutral+=1
        elif new_df_wins['home_score'][i]>new_df_wins['away_score'][i]:
            losses_a_notneutral+=1
        else:
            draw_a_notneutral +=1            
            
            
            
values_home_neutral = [wins_h_neutral, losses_h_neutral, draw_h_neutral]
values_away_neutral = [wins_a_neutral, losses_a_neutral, draw_a_neutral]
values_home_notneutral = [wins_h_notneutral, losses_h_notneutral, draw_h_notneutral]
values_away_notneutral = [wins_a_notneutral, losses_a_notneutral, draw_a_notneutral]


fig4 = make_subplots(rows=2, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Home neutral', 'Away neutral', 'Home not neutral', 'Away not neutral'])
fig4.add_trace(go.Pie(labels=labels, values=values_home_neutral, name="Home neutral"),
              1, 1)
fig4.add_trace(go.Pie(labels=labels, values=values_away_neutral, name="Away neutral"),
              1, 2)
fig4.add_trace(go.Pie(labels=labels, values=values_home_notneutral, name="Home not neutral"),
              2, 1)
fig4.add_trace(go.Pie(labels=labels, values=values_away_notneutral, name="Away not neutral"),
              2, 2)

fig4.update_layout(title_text='Wins, losses and draws home vs away, neutral vs not neutral')

fig4.update_traces(hole=.4, hoverinfo="label+percent+name")
st.plotly_chart(fig4)   

Finally, we can retrieve the best performance of our team (here, I considered as best performance that where my team did the highest number of goals).

st.subheader('Best Performance')

t = []
for i in range(len(new_df_wins)):
    if new_df_wins['home_team'][i]==team_wins:
        t.append(new_df_wins['home_score'][i])
    else:
        t.append(new_df_wins['away_score'][i])
        
        
m = np.argmax(np.array(t), axis=0)
out = new_df_wins.iloc[m]
st.write(out)


st.subheader('Comparing 2 teams')
team1 = st.selectbox('Pick one team', df['home_team'].unique())
team2 = st.selectbox('Pick one team', df['home_team'].unique())

So we collected relevant information about our team of interest with intuitive widgets, mainly relying on a graphical representation of data. Here, I focused on the analysis of one team. In my next articles, I will propose further implementations of this code, starting from comparing and inspecting the matches of two teams and then proceeding with some predictions.

So stay tuned for next steps!

References:

Published by valentinaalto

I'm a 22-years-old student based in Milan, passionate about everything related to Statistics, Data Science and Machine Learning. I'm eager to learn new concepts and techniques as well as share them with whoever is interested in the topic.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: