By the end of this homework we expect you to be able to:
data
folder provided in the repository in read-only mode. (Or alternatively, be sure you
don’t change any of the files.)plotly
, should be strictly avoided!Congratulations! You have just been hired as a data scientist at Piccardi Music, a promising new music label created by a mysterious Italian disc jockey "Signor Piccardi". The company hired you to carry out a variety of data-related tasks, which will be explained in further detail below.
For this homework you will use a dataset of 18,403 music reviews scraped from Pitchfork¹, including relevant metadata such as review author, review date, record release year, review score, and genre, along with the respective album's audio features pulled from Spotify's API. The data consists of the following columns:
Column | Description |
---|---|
artist |
The name of the artist who created the album being reviewed. |
album |
The name of the album being reviewed. |
recordlabel |
The name of the record label(s) who published the album. |
releaseyear |
The year that the album was released. |
score |
The score given to the album by the reviewer on a scale of 0.0 to 10.0. |
reviewauthor |
The name of the author who reviewed the album. |
genre |
The genre assigned to the album by Pitchfork. |
reviewdate |
The date that the review was published. |
key |
The estimated overall musical key of the track. Integers map to pitches using standard Pitch Class notation (e.g., 0 = C, 2 = D, and so on) |
acousticness |
A confidence measure from 0.0 to 1.0 of whether an album is acoustic. 1.0 represents high confidencethat the album is acoustic. |
danceability |
How suitable an album is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 1.0 is most danceable. |
energy |
A perceptual measure of intensity and activity, from 0.0 to 1.0, where 1.0 represents high energy. Metal is often high energy. |
instrumentalness |
Predicts whether an album contains no vocals, from 0.0 to 1.0. The closer to 1.0, the more likely the album contains no vocals. |
liveness |
Detects the presence of an audience, from 0.0 to 1.0. Scores greater than 0.8 indicate a strong likelihood the album is live. |
loudness |
The overall loudness of the album in decibels (dB). |
speechiness |
Measures the presence of spoken words in an album on a scale from 0.0 to 1.0. Scores higher than 0.66 indicate an album made entirely of spoken words, while scores below 0.33 indicate music and other non-speech-like elements. |
valence |
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by an album, where values closer to 1.0 indicate more positive sounds. |
tempo |
The overall estimated tempo of an album in beats per minute (BPM). |
¹Pinter, Anthony T., et al. "P4KxSpotify: A Dataset of Pitchfork Music Reviews and Spotify Musical Features." Proceedings of the International AAAI Conference on Web and Social Media. Vol. 14. 2020.
# CHANGE THIS IF YOU NEED/WANT TOO
# pandas / numpy
import pandas as pd
import numpy as np
rng = np.random.default_rng() # for random number generation
# plotting
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
from matplotlib.lines import Line2D
# datetime operations
from datetime import datetime
# ttest and euclidean distance
from scipy.stats import ttest_ind
from scipy.spatial.distance import seuclidean
# linear fit using statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
# good ole sklearn
from sklearn.metrics import euclidean_distances, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
# displaying markdown strings
from IPython.display import display, Markdown, Latex
# settings
rcParams['figure.dpi']= 100
pd.set_option('display.max_rows', 3000)
sns.set(style="whitegrid")
# data folder
#DATA_FOLDER = 'data/'
DATA_FOLDER = r'/Users/geoffraymarie/EPFL/ADA/ADA2021/Homework/02 - Regression Observational Studies Stats and Supervised Learning/data/'
The first project you embark on in your new job is to build a regressor to predict whether an album will be well received or not. According to Signor Piccardi (your boss), this algorithm may eventually be helpful in forecasting the success of albums produced by Piccardi Music.
Task 1 (Initial analyses — 10 pts)
As a good data scientist, the first thing you do is to have a good look at the data that was handed to you.
pandas
. Identify and remove duplicate reviews, i.e., two reviews with albums by the same band with the same name (keep the first occurrence). Print the number of rows in your dataframe.df = pd.read_csv(DATA_FOLDER + 'pitchfork.csv.gz', compression='gzip',
parse_dates=['reviewdate'], dtype={'releaseyear': int}
)
shape_before = df.shape[0]
df.drop_duplicates(subset = ['artist','album'], keep = 'first', inplace = True)
print('After removing the duplicate {} reviews, the dataframe has now {} rows.'\
.format(shape_before - df.shape[0], df.shape[0]))
After removing the duplicate 47 reviews, the dataframe has now 16738 rows.
fig, ax1 = plt.subplots()
yearmin, yearmax = df.releaseyear.min(), df.releaseyear.max()
sns.histplot(x = 'releaseyear', stat = 'count', data = df, ax = ax1,
bins = [k for k in range(yearmin, yearmax + 1)])
ax1.minorticks_on()
ax1.set_yscale('log')
ax1.tick_params(axis='x', labelrotation = 30)
ax2 = ax1.twinx()
ax2.plot(df.groupby('releaseyear').score.agg('mean'), color = 'red', marker = '+', lw = 0.7)
ax2.set_ylabel('Average score of albums');
ax2.grid(None)
plt.title('Distribution of album release years and the average score of albums per year');
df.describe().loc[['mean', '50%', 'min', 'max']]
score | releaseyear | danceability | energy | key | loudness | speechiness | acousticness | instrumentalness | liveness | valence | tempo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | 7.048596 | 2009.346338 | 0.512334 | 0.601276 | 5.216501 | -9.283268 | 0.090742 | 0.301914 | 0.274748 | 0.196402 | 0.405268 | 120.326487 |
50% | 7.300000 | 2010.000000 | 0.511348 | 0.624722 | 5.230769 | -8.444263 | 0.056665 | 0.228844 | 0.149363 | 0.174261 | 0.406288 | 120.397346 |
min | 0.000000 | 1957.000000 | -1.000000 | -1.000000 | -1.000000 | -51.728750 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 |
max | 10.000000 | 2019.000000 | 0.974000 | 0.999000 | 11.000000 | 4.078000 | 0.958000 | 0.996000 | 0.982000 | 0.978000 | 0.971000 | 215.972000 |
aberrant_num_columns = ['danceability', 'energy', 'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo']
to_drop = df[df[aberrant_num_columns].lt(0.0).any(axis = 1)].shape[0]
df = df[~df[aberrant_num_columns].lt(0.0).any(axis = 1)]
print('We removed {} rows with irregular values.'.format(to_drop))
We removed 0 rows with irregular values.
df.describe().loc[['mean', '50%', 'min', 'max']]
score | releaseyear | danceability | energy | key | loudness | speechiness | acousticness | instrumentalness | liveness | valence | tempo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | 7.048536 | 2009.345965 | 0.512967 | 0.601951 | 5.219034 | -9.286912 | 0.091208 | 0.302476 | 0.275303 | 0.196910 | 0.405853 | 120.377216 |
50% | 7.300000 | 2010.000000 | 0.511364 | 0.624823 | 5.230769 | -8.446428 | 0.056681 | 0.229000 | 0.149641 | 0.174304 | 0.406300 | 120.406490 |
min | 0.000000 | 1957.000000 | 0.038667 | 0.000126 | 0.000000 | -51.728750 | 0.008644 | 0.000001 | 0.000000 | 0.015300 | 0.000010 | 23.983333 |
max | 10.000000 | 2019.000000 | 0.974000 | 0.999000 | 11.000000 | 4.078000 | 0.958000 | 0.996000 | 0.982000 | 0.978000 | 0.971000 | 215.972000 |
# should we ask about pitch class ?
#I think we should ask about the keys. Because they should be intergers,
#but it's the case only for 17% of the data
#df.key[df.key.isin([k for k in range(-1,12)])].shape[0]/df.shape[0]
df.key.hist(bins = 100)
<AxesSubplot:>
fig, ax = plt.subplots(4,3,figsize= (14,14))#, sharex = 'row', sharey = 'row')
# list of names of numerical columns
num_cols = df.describe().loc[['mean', '50%', 'min', 'max']].columns.values
for i, elt in enumerate(num_cols):
sbplt = ax[i//3, i%3]
sbplt.hist(df[elt])#, bins = 100)
sbplt.set_title(elt)
sbplt.set_yscale('log')
fig.tight_layout(pad=0.5)
fig.subplots_adjust(top=0.93)
plt.suptitle('Distribution of all numerical features');
genre
column, assign the value 'Other'
for albums where the value is either 'none'
or NaN
.to_modif = (df.genre == 'none') | (df.genre.isna())
df.loc[to_modif, 'genre'] = 'Other'
print(f"There were {to_modif.sum()} genre values replaces by 'Other'.")
There were 11 genre values replaces by 'Other'.
# list of categorical columns
cat_cols = df.columns.difference(num_cols)
for cat in cat_cols:
n = df[cat].unique().size
if n <= 10:
print(f'"{cat}" column has {n} distinct values : \n{", ".join(df[cat].unique())}.')
else :
print(f'"{cat}" column has {n} distinct values.')
print()
"album" column has 16176 distinct values. "artist" column has 7890 distinct values. "genre" column has 10 distinct values : Electronic, Folk/Country, Rock, Rap, Global, Experimental, Metal, Pop/R&B, Jazz, Other. "recordlabel" column has 3030 distinct values. "reviewauthor" column has 554 distinct values. "reviewdate" column has 4876 distinct values.
sns.histplot(x = 'genre', stat = 'count', data = df[cat_cols])
plt.xticks(rotation=70);
plt.grid(axis = 'x', b = None)
#Y/N ?
#plt.yscale('log')
plt.title('Repartiton of the different genres');
df_test = df.copy()
count_art = df.artist.value_counts()
count_art.name = 'count_art'
df_test = df_test.merge(count_art,right_index=True, left_on='artist', how='inner')
sns.regplot(data=df_test, y = 'score', x='count_art', line_kws={'color': 'red'} )
plt.title('Relation between number of albums of the artist and score of the album.');
Next, you decide to prepare the code that will help you in training your machine learning models. Also, you implement a simple baseline. For this task, unless otherwise stated you must implement functions yourself, instead of relying on scikit-learn
(you can use numpy
or pandas
, though!).
genre
column, create a new column called {genre}_onehot
(e.g., for genre=jazz
, create jazz_onehot
). Collectively, these new columns should "one hot-encode" the genre column—for instance, if for a given album the genre
is filled with the value jazz
, the jazz_onehot
column should equal 1 and all other {genre}_onehot
columns should equal 0. onehot = pd.get_dummies(df.genre)
onehot.columns = [c+'_onehot' for c in onehot.columns]
onehot.head()
Electronic_onehot | Experimental_onehot | Folk/Country_onehot | Global_onehot | Jazz_onehot | Metal_onehot | Other_onehot | Pop/R&B_onehot | Rap_onehot | Rock_onehot | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
numpy_helper(df, cols)
to obtain a numpy.array
out of your dataframe
. The function should receive a dataframe df
with N rows and a list of M columns cols
, and should return a np.array
of dimension (NxM).def numpy_helper(df, cols):
'''Return the numpy array of selected cols in df.'''
return df[cols].values
X
containing all genre-related one-hot features, and an array of outcomes y
containing scores. Using the function sklearn.model_selection.train_test_split
with random_state=123
, split the data into a train set containing 70% of all data, and a test set containing the remaining 30%.X = numpy_helper(onehot, onehot.columns)
y = numpy_helper(df, 'score')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=123)
class Baseline():
def __init__(self):
self.avg = None
def fit(self, X, y):
self.avg = np.mean(y)
def predict(self, X):
return self.avg * np.ones((X.shape[0],))
baseline = Baseline()
baseline.fit(X_train, y_train)
sklearn
implementation here.r2_score(y_test, baseline.predict(y_test))
-1.0839329994682956e-05
Task 3 (Regression — 14 pts)
Finally, you get down to business and train your regression models.
sklearn
) that predicts the outcome score
using the features "releaseyear", "key", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "valence", "tempo" and the one-hot encoded genre-related columns. Using a 70/30 train-test split similar to what you did in task two (hereinafter referred to as "the random split", use the same random seed, random_state=123
), report the $R^2$ for the testing set.amodel = LinearRegression()
# data conversion to numpy array and random splitting
df_onehot = pd.merge(df, onehot, left_index=True, right_index=True)
cols = [ "releaseyear", "key", "acousticness", "danceability",
"energy", "instrumentalness", "liveness", "loudness",
"speechiness", "valence", "tempo"] + list(onehot.columns)
X,y = numpy_helper(df_onehot, cols), numpy_helper(df_onehot, 'score')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=123)
#model training
model.fit(X_train, y_train)
print(f"The R^2 of the linear regression on the random split is {r2_score(y_test, model.predict(X_test)):.4f}.")
The R^2 of the linear regression on the random split is 0.0387.
model = LinearRegression()
#copy to silence setting on copy warnings arising later
df_train = df_onehot[df_onehot.releaseyear < 2000].copy()
df_test = df_onehot[df_onehot.releaseyear >= 2003].copy()
X_train,y_train = numpy_helper(df_train, cols), numpy_helper(df_train, 'score')
X_test ,y_test = numpy_helper(df_test, cols) , numpy_helper(df_test, 'score')
#model training
model.fit(X_train, y_train)
print(f"The R^2 of the linear regression on the longitudinal split is {r2_score(y_test, model.predict(X_test)):.4f}.")
The R^2 of the linear regression on the longitudinal split is -0.2823.
df_train.loc[:,'Period'] = 'Before 2000'
df_test.loc[:,'Period'] = '2003 or after'
df_plot = pd.concat((df_train, df_test), axis=0)
sns.histplot(df_plot, x='score', hue='Period', common_norm=False, stat='density')
plt.title('Evolution of the distribution of scores before 200 and after 2003.');
y_pred = model.predict(X_test)
sns.histplot(x=y_test - y_pred)
plt.title('Residuals of the 3.2 model on the test set')
plt.xlabel("$Y- Y'$");
#bootstrap CI
N = 10_000
proba = np.zeros(N)
n = y_test.size
for i in range(N):
idx = rng.integers(0,n,n) # n integers between 0 and n : indices of the samples
y_sample = y_test[idx]
X_sample = X_test[idx]
y_pred = model.predict(X_sample)
err = np.where(np.abs(y_pred - y_sample) > 2, 1, 0) #1 if |res| > 2, else 0
proba[i] = err.mean()
sns.histplot(x=proba, binwidth=5e-4)
plt.xlabel('Probability')
plt.title(f'Probability of prediction being off by more than 2 scores points,\n $N = {N}$')
plt.xticks(rotation = 45);