We present our project on modeling a speaker network using Quotebank. The dataset is a corpus of 178 million quotations attributed to the speakers who uttered them, extracted from 162 million English news articles published between 2008 and 2020. We use a speaker attribute table from Wikidata to extract information on the individuals. This project aims to explore the relationships between people quoted in the Quotebank dataset. Specifically, we construct a graph based on co-quotation of speakers in the same articles, using years from 2015 to 2020. Visualizing these relationships can give us an understanding of the networks and communities that are behind quotes, such as professional domains and fields of expertise, political orientation, and even like-mindedness. Secondly we focus on specific case studies using our graph properties to understand links between speakers in those scenarios.
Click Here for Full Screen And Interactive Data Viz
In this section, we observe different statistics of our big graph. This preliminary analysis helps us to understand the properties of the graph and therefore get to know how to explore it.
Let’s start by the number of nodes and edges:
First thing to notice, the graph is very sparse: Sparsity = \(\frac{|E|}{|E_{max}|}\) = 0.01%
As we want to do clustering, we take a look at the connected components.
It looks like we can not rely on connected components only.
Who are the main speakers of our graph ? Are people very connected ? Let’s figure this out !
Most of the speakers have very low degrees, but some have very high degree. Indeed the degree distribution is following a power-law, which is typical for real world networks.
Who are those very famous people ?
The Top 10 central speakers in our graph are very famous country leaders. We could have expected this, especially that Donald Trump is the most central.
Are there obvious and interpretable clusters ?
We use Louvain clustering method to check either we can identify interpretable cluster:
It’s a too large number for us to interpret each group by hand, we then focuse on attributes.
Which speaker attributes could be useful to filter on ? We compute the homophily with respect to gender, nationality and political party.
The homophily estimate the similarity of connections in the graph with respect to the given attribute.
Results of homophily:
Those results show that nationality is a good attribute to observe clusters. Indeed on the 3D graph we clearly distinguish clusters of speakers with the same nationality.