In my previous article (Part 1 of this series), I’ve been implementing some interesting visualization tools for a meaningful exploratory analysis. Then, with the Python package Streamlit, I made them interactive in the form of a web app.

In this article, I’m going to continue working on the same dataset as before, this time focusing on the interaction between two teams. I will keep using Plotly as visualization tool, since it provides the possibility to interact with graphs and collect relevant information. Since I won’t attach the code of my previous article, if you are new to Streamlit I strongly recommend to read it before starting.

Now, as anticipated, I want to dwell on the matches between two teams of interest. So, let’s start by filtering our initial dataset (available here) with users’ multiselection:

import streamlit as st
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots


st.title('Internationa Football matches')
df = pd.read_csv("results.csv")

st.subheader('Comparing 2 teams')
teams_to_compare = st.multiselect('Pick your teams', df['home_team'].unique())

comparison = df[(df['home_team'].isin(teams)) & (df['away_team'].isin(teams)) ]  
comparison = comparison.reset_index(drop=True)
st.write(comparison)
st.write('Number of matches: ', len(comparison))

The object ‘teams_to_compare’ will be a list of two teams, and I’m interested in analyzing those matches where the two teams played one against the other (regardless of which one was playing at home). Then, I asked my app to show me the new filtered dataset together with the number of matches:

Here, I’m interested in all the matches England vs Scotland, and this is how my final dataset looks like.

Now let’s perform some analytics on these two teams.

First, I want to know which is the match with the highest intensity of play, which I decided to quantify as the total number of goals. So, I created a new Pandas series as the sum of the two ‘scores’ columns and then computed the index of the maximum value of that series.

st.subheader('Highest intensity of play')

out_c = comparison.iloc[np.argmax(np.array(comparison['home_score']+comparison['away_score']))]
st.write(out_c)

So, the most played match was that of the British Championship of 4/15/1961. With the same reasoning, you can investigate any kind of performance. Namely, you can ask to display the match with the highest gap in score between the two teams.

Now, I want to visualize the proportion of wins, losses and draws between my teams. For this purpose, I will use a Plotly pie chart:

team1_w = 0
team2_w = 0
teams_draw=0
team1_cum=[]
team2_cum=[]


for i in range(len(comparison)):
    if comparison['home_team'][i]==teams_to_compare[0]:
        if comparison['home_score'][i]>comparison['away_score'][i]:
            team1_w+=1
            team1_cum.append(1)
            team2_cum.append(0)
        elif comparison['home_score'][i]<comparison['away_score'][i]:
            team2_w+=1
            team1_cum.append(0)
            team2_cum.append(1)
        else:
            teams_draw+=1
            team1_cum.append(0)
            team2_cum.append(0)
    else:
        if comparison['home_score'][i]<comparison['away_score'][i]:
            team1_w+=1
            team1_cum.append(1)
            team2_cum.append(0)
        elif comparison['home_score'][i]>comparison['away_score'][i]:
            team2_w+=1
            team1_cum.append(0)
            team2_cum.append(1)
        else:
            teams_draw+=1
            team1_cum.append(0)
            team2_cum.append(0)
            
            
            
comparison_labels = ['Team 1 wins','Team 2 wins','Draws']
comparison_values = [team1_w, team2_w, teams_draw]

fig5 = go.Figure(data=[go.Pie(labels=comparison_labels, values=comparison_values)])
st.plotly_chart(fig5) 

In the code above, I also defined two lists, team1_cum and team2_cum, so that I can inspect the path across time of wins of my two teams. So let’s build a line chart with buttons and sliders:

st.subheader('Cumulative wins of two teams')

fig6 = go.Figure()

fig6.add_trace(go.Scatter(x=list(new_df_wins['date']), y=np.cumsum(np.array(team1_cum)), name='team 1'))
fig6.add_trace(go.Scatter(x=list(new_df_wins['date']), y=np.cumsum(np.array(team2_cum)), name='team 2'))


# Add range slider
    
fig6.update_layout(
    xaxis=go.layout.XAxis(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label="1m",
                     step="month",
                     stepmode="backward"),
                dict(count=6,
                     label="6m",
                     step="month",
                     stepmode="backward"),
                dict(count=1,
                     label="YTD",
                     step="year",
                     stepmode="todate"),
                dict(count=1,
                     label="1y",
                     step="year",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

st.plotly_chart(fig6)

Note: in the pie chart, it seemed that team2 (England) won the majority of matches against team1 (Scotland). So why from the line chart above it seems that, for the majority of time, Scotland dominated England? Well, the reason lies in the dataset: England and Scotland played the majority of their matches after 1910, hence it is consistent with the information collected before.

Furthermore, this graph is meaningful. Indeed, we see that up to 1910 (more or less), Scotland has always dominated England. What was the reason for this inversion of trend? One might be interested in focusing on this specific occurrence:

There are two further elementsI want to retrieve. First, I want to see how many times those matches have been played in each city. To do so, I will build a bar chart which displays, for each city, how many times that city occurred in my filtered dataset:

st.subheader('Frequency of city of matches')

cities = comparison.groupby('city').count()['country'].index.values
occurrences = comparison.groupby('city').count()['country'].values
occurrences.sort()


fig7 = go.Figure(go.Bar(
            x=occurrences,
            y=cities,
            orientation='h'))


st.plotly_chart(fig7)

Second, I want to collect some information about types of tournament. The idea is plotting a bubble chart whose x and y coordinates are the home and away scores, the size represents the intensity of play of that match (sum of goals) and the color represents the type of tournament. Plus, in order to know which of my teams was playing at home, I will set as hover_name the home team, which will be displayed at the top of each bubble.

st.subheader('Tournament information')

comparison['challenge']=np.array(comparison['home_score']+comparison['away_score'])
fig8 = px.scatter(comparison, x="home_score", y="away_score",
	         size="challenge", color="tournament",
                 hover_name="home_team")

st.plotly_chart(fig8) 

The first glimpse of this graph shows how the matches with the highest number of goals seem to be those of the British Championship. Finally, let’s combine this information with that of the frequency of the type of tournament for each couple of teams:

tour = st.selectbox('Select a tournament', comparison['tournament'].unique())

comparison_t = comparison[comparison['tournament']==tour] 
per = len(comparison_t)/len(comparison)

st.write(f"{round(per*100,2)}% of matches between the 2 teams have been played as {tour} matches")

So not only British Championship hosted the highest intensity matches, but also the highest number of matches between England and Scotland.

Again, as anticipated in my previous article, those are just few of the analytics you can build on your dataset. It really depends on the information you need, nevertheless a first explanatory insight is always a good starting point, since it might provide new intuitions and perspective of analysis.

I hope you enjoyed the reading!

References:

Published by valentinaalto

I'm a 22-years-old student based in Milan, passionate about everything related to Statistics, Data Science and Machine Learning. I'm eager to learn new concepts and techniques as well as share them with whoever is interested in the topic.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: