Homework 2 (HW2)


By the end of this homework we expect you to be able to:

  1. Preprocess data and make it amenable to statistical analysis and machine learning models;
  2. Train and test out-of-the-box machine learning models in Python;
  3. Carry out statistical hypothesis testing;
  4. Carry out simple multivariate regression analyses;
  5. Use techniques to control for covariates;

Important Dates


Some rules

  1. You are allowed to use any built-in Python library that comes with Anaconda. If you want to use an external library, you may do so, but must justify your choice.
  2. Make sure you use the data folder provided in the repository in read-only mode. (Or alternatively, be sure you don’t change any of the files.)
  3. Be sure to provide a textual description of your thought process, the assumptions you made, the solution you implemented, and explanations for your answers. A notebook that only has code cells will not suffice.
  4. For questions containing the /Discuss:/ prefix, answer not with code, but with a textual explanation (in markdown).
  5. Back up any hypotheses and claims with data, since this is an important aspect of the course.
  6. Please write all your comments in English, and use meaningful variable names in your code. Your repo should have a single notebook (plus the required data files) in the master/main branch. If there are multiple notebooks present, we will not grade anything.
  7. We will not run your notebook for you! Rather, we will grade it as is, which means that only the results contained in your evaluated code cells will be considered, and we will not see the results in unevaluated code cells. Thus, be sure to hand in a fully-run and evaluated notebook. In order to check whether everything looks as intended, you can check the rendered notebook on the GitHub website once you have pushed your solution there.
  8. In continuation to the previous point, interactive plots, such as those generated using plotly, should be strictly avoided!
  9. Make sure to print results or dataframes that confirm you have properly addressed the task.

Context

Congratulations! You have just been hired as a data scientist at Piccardi Music, a promising new music label created by a mysterious Italian disc jockey "Signor Piccardi". The company hired you to carry out a variety of data-related tasks, which will be explained in further detail below.


The data

For this homework you will use a dataset of 18,403 music reviews scraped from Pitchfork¹, including relevant metadata such as review author, review date, record release year, review score, and genre, along with the respective album's audio features pulled from Spotify's API. The data consists of the following columns:

Column Description
artist The name of the artist who created the album being reviewed.
album The name of the album being reviewed.
recordlabel The name of the record label(s) who published the album.
releaseyear The year that the album was released.
score The score given to the album by the reviewer on a scale of 0.0 to 10.0.
reviewauthor The name of the author who reviewed the album.
genre The genre assigned to the album by Pitchfork.
reviewdate The date that the review was published.
key The estimated overall musical key of the track. Integers map to pitches using standard Pitch Class notation (e.g., 0 = C, 2 = D, and so on)
acousticness A confidence measure from 0.0 to 1.0 of whether an album is acoustic. 1.0 represents high confidencethat the album is acoustic.
danceability How suitable an album is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 1.0 is most danceable.
energy A perceptual measure of intensity and activity, from 0.0 to 1.0, where 1.0 represents high energy. Metal is often high energy.
instrumentalness Predicts whether an album contains no vocals, from 0.0 to 1.0. The closer to 1.0, the more likely the album contains no vocals.
liveness Detects the presence of an audience, from 0.0 to 1.0. Scores greater than 0.8 indicate a strong likelihood the album is live.
loudness The overall loudness of the album in decibels (dB).
speechiness Measures the presence of spoken words in an album on a scale from 0.0 to 1.0. Scores higher than 0.66 indicate an album made entirely of spoken words, while scores below 0.33 indicate music and other non-speech-like elements.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by an album, where values closer to 1.0 indicate more positive sounds.
tempo The overall estimated tempo of an album in beats per minute (BPM).

¹Pinter, Anthony T., et al. "P4KxSpotify: A Dataset of Pitchfork Music Reviews and Spotify Musical Features." Proceedings of the International AAAI Conference on Web and Social Media. Vol. 14. 2020.

Part 1: Will this album be a hit?

The first project you embark on in your new job is to build a regressor to predict whether an album will be well received or not. According to Signor Piccardi (your boss), this algorithm may eventually be helpful in forecasting the success of albums produced by Piccardi Music.

Task 1 (Initial analyses — 10 pts)

As a good data scientist, the first thing you do is to have a good look at the data that was handed to you.

  1. Load the data using pandas. Identify and remove duplicate reviews, i.e., two reviews with albums by the same band with the same name (keep the first occurrence). Print the number of rows in your dataframe.
  1. Plot the distribution of album release years and the average score of albums per year.
  1. For numerical columns, calculate the mean, median, minimum value and maximum value. Additionally, plot the distribution for all the numerical columns in a single image (possibly with multiple subplots). Your image should be at most 14 inches wide by 14 inches long.
We observe a problem of range for some categories: - `danceability`, `energy`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence` should go from **0.0 to 1.0** and some are smaller than 0.0. - `tempo`should be **greater than 0**. We drop the reviews with irregularites and recompute the mean, median, min, max. About the `key`, feature we supposed it was obtained with a programm, explaining why the majority of the data is not integers. We decided to keep it like that.
  1. For categorical columns, list how many different values there are in each column. If there are less than 10 distinct values for a category, print them all. For the genre column, assign the value 'Other' for albums where the value is either 'none' or NaN.
  1. Discuss: This dataset was built with found data—i.e., the Pitchfork reviews were not made with the goal of training a machine learning model. Grounded on the previous analyses and in Pitchfork's Wikipedia page), point three (exactly!) ways in which this data may not be representative of music albums in general due to the way the data was collected.
1. The repartiton of the genre is not representative of the market : - What is a `Global` music ? Is it *Classical* music ? If yes, it seems underestimated. I don't think 'Global' is classical, but more like world. But then I think classical is not present in the dataset, which might be an indicator. - The first genre represented in the database is `Rock`. O the Wikipedia page we can see that Pitchfork was originately dedicated to *indie-rock* ie. a type of rocj of the 70es played in the USA and the UK. We could infer that a large part of the `Rock`music reviewed is not representative of all the rock music composed. 2. Pitchfork's music reviews use two different rating systems. Also the review drop around 2000 seems fishy - did they change their rating system ? 3. Reviewers will probably focus on either albums of well-known bands, or good albums : there's no point in reviewing an album that nobody would listen to anyway. That would explain the high average of the scores. (That theory is disproven by the following graph : artists with more albums tend to have higher scores)

Task 2 (Pre-processing and pipeline code — 12 pts)

Next, you decide to prepare the code that will help you in training your machine learning models. Also, you implement a simple baseline. For this task, unless otherwise stated you must implement functions yourself, instead of relying on scikit-learn (you can use numpy or pandas, though!).

  1. For each possible value in the genre column, create a new column called {genre}_onehot (e.g., for genre=jazz, create jazz_onehot). Collectively, these new columns should "one hot-encode" the genre column—for instance, if for a given album the genre is filled with the value jazz, the jazz_onehot column should equal 1 and all other {genre}_onehot columns should equal 0.
  1. Create a function numpy_helper(df, cols) to obtain a numpy.array out of your dataframe. The function should receive a dataframe df with N rows and a list of M columns cols, and should return a np.array of dimension (NxM).
  1. For each album, build an array of features X containing all genre-related one-hot features, and an array of outcomes y containing scores. Using the function sklearn.model_selection.train_test_split with random_state=123, split the data into a train set containing 70% of all data, and a test set containing the remaining 30%.
  1. Create your own baseline regressor. Using the training data (in the training stage), your regressor should estimate the average score for all albums. Then, for the test data (in the prediction stage), your classifier should always output the average score (computed on the training data).
  1. Calculate the coefficient of determination ($R^2$) of your baseline on the test data. You are allowed to use the sklearn implementation here.
As expected, the $R^2$ of the baseline is very close to 0.
  1. Discuss: Your train-test split randomly selected 70% of all data for the training set. Why is this a problem for the broader task of predicting whether a future album will be successful or not?
A random split doesn't necessarily take into account the underlying distribution of data. We might want to ensure that the model is evaluated on every example, for example by using $k$-fold instead of only one train-test split.

Task 3 (Regression — 14 pts)

Finally, you get down to business and train your regression models.

  1. Build a Linear Regression model (use sklearn) that predicts the outcome score using the features "releaseyear", "key", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "valence", "tempo" and the one-hot encoded genre-related columns. Using a 70/30 train-test split similar to what you did in task two (hereinafter referred to as "the random split", use the same random seed, random_state=123), report the $R^2$ for the testing set.a
  1. Create an alternate train-test split (hereinafter referred to as "the longitudinal split") where you train with data from albums released before 2000 and test with data from 2003 and after. Report the $R^2$ for the testing set using the same model you developed for the previous question. Discuss: give the correct interpretation of $R^2$ value for the longitudinal split.
The $R^2$ is actually negative for a linear regression using the longitudinal split. This means that the prediction is worth than the average of over the test set. This is not surprising, as the distribution of scores changes drastically after 1999 (cf Task 1). Because of this, the distribution of scores is very different between train et test set with the longitudinal split.
  1. For a given entry $X$ your model outputs a predicted score $Y'$. The difference between the real score $Y$ and the predicted score $Y'$ is called the "residual". Considering the model trained in 3.2, plot the distribution of your residuals for the test set. Additionally, estimate what is the probability that your score prediction (from 3.2) is off by more than 2-points? Provide bootstrapped confidence intervals for your answer.
This distribution is not Gaussian, as the linear model expects. Indeed, it is left-skewed, which is expected since the test set is shifted left compared to the train set.
  1. Experiment with training a different regressor, a Gradient Boosting Regressor. This regressor is related to the Boosted decision trees that you have seen in class. This model performs extremely well for a variety of tasks and is often used in machine learning competitions for tabular data (e.g., on Kaggle). Train the regressor using the longitudinal split and the same features as in 3.2, use the default hyperparameters. Report the $R^2$ for the testing set.
  1. Discuss: Hypothesize a reason for the difference in performance between the Linear regression and the Gradient Boosting Regressor.
The train/test split is not adapted at all to score a model for this task, given the difference in distribution between the two. It's likely that the Gradient Booster regressor performs better than the Linear Regression, but that this better performance translates into a worse score on the test set. One could call that "overfitting", but really the problem lies a priori with the data split and not the model.

Task 4 (Are we solving the correct problem? — 16 pts)

All your efforts so far have assumed that decisions are taken at the "album" level, which is often not the case for bands with multiple albums. In those cases, it could be interesting to predict what is the success of a given band album given the features of the album and of previous albums.

  1. Create a new dataframe that contains one row per band with more than 1 album. This dataframe should have the same columns as the data provided to you, considering the latest album of the respective band (note that this is determined by the release year of the album, not the date when it was reviewed). Additionally, for each feature considered in Task 3.1 (including the one-hot encoded features), create an additional column post-fixed with _previous (e.g., danceability_previous). These columns should contain the average values for all of the band's previous albums. Also, create a column score_previous with the average score of previous albums. Print the number of rows in the dataframe as well as the name of the columns.
  1. Train a Gradient Boosting Regressor considering all features created in Task 4.1 (note that score is the outcome and everything else is a feature, including score_previous). Use the 70/30 random train-test split, the default hyperparameters, and report the $R^2$ for the testing set.
This model and features improve on the previous linear regression on the random split.
  1. Can hyperparameter tuning improve your model? Write modular code (i.e., a function) to divide your training data into $N$ folds and perform cross-validation. Experiment tuning two hyperparameters of the Gradient Boosting Regressor: n_estimators and learning_rate. For each possible combination of the two hyperparameters (see below for the range of values that you should try for each hyperparameter), train your model in a cross-validation setup with $N=20$ folds. Report the mean $R^2$ along with the 90% CI for each scenario.

    • n_estimators $ \in \{ 100, 200, 300, 400\}$
    • learning_rate $ \in \{ 0.1, 0.05, 0.01\}$.

    With the best hyperparameters obtained, train your model with the entire training set and report the $R^2$ on the testing set.

  1. Discuss: How do these results compare with the previous setup (the scenario considered in Task 3.4)? Point out two reasons why it is hard to compare the results obtained in 4.3 and 3.4 at face value? How would you fairly compare the two different setups?
This results are obviously far better than those in task 3.4, given that we now have a positive $R^2$. However, this setups are really not easy to compare, for several reasons : - We don't use the same train/test split in the two setups, and the longitudinal split used in task 3.4 probably makes it more difficult to have a good accuracy, since the distribution of $y$ is different on the train and test set. - We use neither the same features nor the same model in the two settings. For that reason, we can't easily attribute any improvement to the change in model or in features. To fairly compare this two setups, we would use the same train/test split, the random 70/30 split. Then, we would observe the scores of the linear and Gradient Boosting regressions once on each set of features. This way, we can compare different models using the same features, and different features with the same models.

Part 2: Second Album Syndrome

Your second project at Piccardi Music is to shed light on one of the business's oldest enigmas: the "second album syndrome." In a nutshell, the "second album syndrome" is a theory that states that the second album of a band always sucks. (Related read)

Assume—for the purpose of this task—that the Pitchfork data contains all albums for all artists it covers (even though this might not be true in reality).

Task 5 (Preliminary analyses — 8 pts)

You begin by carrying out some preliminary data processing and analyses.

  1. Create a new dataframe (from the original data with duplicates dropped, the same as you obtained after finishing Task 1.1) containing only albums from artists with two or more albums and where the release year is not empty.
  1. Create a new column album_number which indicates how many albums the artist has produced before this one (before the second album, the artist has already produced one album).
As the temporality is very important, and that the `releaseyear` feature gives only the year, what should we do about the artists that released their 2 first albums during the same year? After having read the posts on Zulip, and discussed with Manoel during the Lab session, we *admit* that if 2 albums were released the same year, we can sort them according the `reviewdate`. --- But at the end, we dropped this albums and redid the calculus. Spoiler: The results were comparable.
  1. Calculate the mean and the standard error fo the mean of the scores of the first and second albums in the dataset. Additionally, plot the two distributions.
  1. Use an appropriate method to determine if the difference in means of 1st and 2nd albums is statistically significant?
  1. Discuss: Do these analyses suggest that the "second album syndrome" exists?
According to the t-test on the score observed for the first and the second album on the cleaned data, it seems that the difference of score is significative ($p < 0.01$). That would suggest that the "second album syndrome" theory is **true**. This theory claims that the quality of a second album is often worst than the first one, ie is less appreciated.

Task 6 (Regression analysis — 20 pts)

Next, you proceed to examine some hypotheses about the "second album syndrome" using a regression framework. Namely:

  1. Create a new dataframe containing one row per 1st-2nd album pair. The dataframe should contain rows:
    • score_diff: the difference in scores between the second and the first album (second - first).
    • time_diff: the number of days elapsed between the first and the second album.
    • did_style_change: a dummy variable that indicates whether the style of the music has changed. To obtain it, first, calculate the standardized euclidean distance of music-related numerical features¹ between the second and the first album. Second, assign 1 to the 20% most distant 1st-2nd album pairs and 0 to all others.

¹ Music related numerical features are: "key", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "valence", and "tempo".

In order to create the new dataframe, we: 1. Create a new dataframe `df`, by merging the sub dataframes (`album_number` == 0 or 1) according to the `artist`. (The parameter `validate = "1:1"` checks if merge keys are unique in both left and right datasets). To differentiate the features of the second album (ie `album_number` == 1), we added the suffix `_1` to the corresponding name of the column. To make the df lighter we dropped the columns that are no longer useful. In `df2`, one row corresponds to the data of the 2 first albums of one artist. 2. Compute and add the `score_diff` column. 3. Compute and add the `time_diff` column (assuming that one year = 365 days). 4. Define a function that returns the standardized euclidean distance of music-related numerical features for the 2 first albums. 5. Apply it. 6. Finally create the `did_it_change` columns that is worth 1 for the 20% highest scores.
  1. Fit a linear regression using statsmodels with this dataframe. Your regression should consider only an intercept, i.e., "score_diff ~ 1".
  1. Discuss: Interpret the $R^2$ in your regression here. Does this analysis confirm what you observed in Task 5? Why?
The model `score_diff ~ 1` consideres that the difference of score is independent of the predictors. In other word, we are trying to see **if the evolution of the scores is aleatory or not**. $R^2$ corresponds to the fraction of explained variance, which indicates how well our model fits to the data. The more $R^2$ is close to one, the best the model is. In our case $R^2$ is null, so **the intercept model is terrible**! -- However, an intercept-only model has, by definition, no variance, so it explains none of the variance of the observed data. For that reason, we logically have $R^2=0$ for this model. In Task 5, with a t-test, we rejected the null-hypothesis (the mean-score of the first and the second album are the same), ie there is a significant difference of score. **This is confirmed here**, as the $p$-value of the intercept is lower than 0.001.
  1. Include the time_diff and did_style_change as covariates in your model. Fit the regression again and report the summary of your model.
  1. Discuss: Interpret the coefficients time_diff and did_style_change. Carefully explain whether they provide evidence towards each of the aforementioned hypotheses? Do they rule out other reasons that may cause the "second album syndrome effect"?
- `did_style_change`: The coefficient is positif, which would involve that the second album is globally more appreciated when the group changed of style. **But** if we look at the standard error and the 99% interval confidence range, we are not sure of the sign of the coefficient, which means we are not sure of the positive or negative effect of a changment of style => **We can not concluce**. - `time_diff`: The coefficient is negative (but very small, because values of `time_diff` take in a big values) which means that the more time has passed between the first and the second album, the less the second album was rated. **This contredict the "time spent hypothesis" of the "second album syndrome effect"**. - `did_style_change`: The coefficient is positive, which could imply that the second album is in average more appreciated when the group changed of style. But $p>0.05$ : we do not observe significant evidence toward the **"style change hypothesis"**. These coefficients only inform us on the effects of the time difference and the style change. They do not inform us on other possible effects.
  1. Create a new column called time_diff_standardized. It should be a standardized version of the time_diff column. Repeat the regression done in 6.4 using the time_diff_standardized column instead of the time_diff column.
  1. Discuss: Explain how the interpretation of the coefficients associated with this new column time_diff_standardized differ from its non-standardized version time_diff?
We have now to take the standardization into account. As the values are of the magnitude of 1 (and not 300-1000), the coefficient corresponding to the predictors increased.
--- We create the dataframe `df_bis`by dropping the artists that produced they 2 first albums the same year and reapply all the opeartions of the task.
The p-value is similar witout the drop or not.
The results of the regression analysis are the same ; ther is no signifiant changes. Hence, we conclude that the dropping had no effect and we keep the data undropped for the next task.

Task 7 (Sanity checks — 6 pts)

You decide to perform a few last sanity checks for your analysis.

  1. Discuss: If the Second Album Syndrome existed, i.e., something was special about the second album (as in if it was bad for a very particular reason that afflicted album number 2 more than all others), what would you expect to happen to the mean score of the third album?
Supposing the "Second Album Syndrome" exists, and that the second album what particullarly worster than the other, we would expect the third album **to have a better score**.
  1. Using the dataset you created in Task 5, calculate the mean and the standard error of the mean for the 1st, 2nd, 3rd, and 4th albums. Test whether the difference between the average score of the second and the third album is statistically significant.
  1. Discuss: Does this suggest that the Second Album Syndrome exists?
As the evolution of the score mean between the second and the third album is lower than the standard deviation of the mean of the scores of the four first albums, **it doesn't suggest that the "Second Album Syndrome" exists**.

Task 8 (Eureka — 14 pts)

Your boss, Signor Piccardi, proposes that you carry out a simulation to make things clearer. Assuming that:

Carry out the following simulation:

Analyzing the scores obtained in this simulation, provide a coherent explanation for the scores obtained in Task 7.2.


Hint: You can use numpy to sample random variables (e.g. numpy.random.normal)