In this section, we observe different statistics of our big graph. This preliminary analysis helps us to understand the properties of the graph and therefore get to know how to explore it.
Let’s start by the number of nodes and edges:
First thing to notice, the graph is very sparse:
Sparsity = $\frac{ | E | }{ | E_{max} | }$ = 0.01% |
As we want to do clustering, we take a look at the connected components.
It looks like we can not rely on connected components only.
Who are the main speakers of our graph ? Are people very connected ? Let’s figure this out !
def plot_degree_distribution(G):
degrees = {}
for node in G.nodes():
degree = G.degree(node)
if degree not in degrees:
degrees[degree] = 0
degrees[degree] += 1
sorted_degree = sorted(degrees.items())
deg = [k for (k,v) in sorted_degree]
cnt = [v for (k,v) in sorted_degree]
fig, ax = plt.subplots()
plt.bar(deg, cnt, color='b', log = True,ec = 'b', align='edge')
plt.title("Degree Distribution")
plt.ylabel("Frequency")
plt.xlabel("Degree")
plt.show()
fig, ax = plt.subplots()
plt.bar(deg, cnt, color='b', log = True,ec = 'b', align='edge')
plt.title("Degree Distribution (log-scale)")
plt.ylabel("Frequency")
plt.xlabel("Degree")
plt.xscale('log')
plt.show()
plot_degree_distribution(G)
Most of the speakers have very low degrees, but some have very high degree. Indeed the degree distribution is following a power-law, which is typical for real world networks.
Who are those very famous people ?
degrees = dict(G.degree(G.nodes()))
sorted_degree = sorted(degrees.items(), key=itemgetter(1), reverse=True)
for speaker, degree in sorted_degree[:10]:
print(G.nodes[speaker]['name'], 'is linked to', degree, 'people')
Donald Trump is linked to 2570 people
Narendra Modi is linked to 812 people
Emmanuel Macron is linked to 752 people
Nancy Pelosi is linked to 733 people
Mike Pompeo is linked to 718 people
Boris Johnson is linked to 692 people
Andrew Cuomo is linked to 690 people
Benjamin Netanyahu is linked to 669 people
António Guterres is linked to 646 people
Justin Trudeau is linked to 620 people
The Top 10 central speakers in our graph are very famous country leaders. We could have expected this, especially that Donald Trump is the most central.
Are there obvious and interpretable clusters ?
We use Louvain clustering method to check either we can identify interpretable cluster:
It’s a too large number for us to interpret each group by hand, we then focuse on attributes.
Which speaker attributes could be useful to filter on ? We compute the homophily with respect to gender, nationality and political party.
The homopholy estimate the similarity of connections in the graph with respect to the given attribute.
Results of homopholy:
Those results show that nationality is a good attribute to observe clusters. Indeed on the 3D graph we clearly distinguish clusters of speakers with the same nationality.