Homework 2 (HW2)


By the end of this homework we expect you to be able to:

  1. Preprocess data and make it amenable to statistical analysis and machine learning models;
  2. Train and test out-of-the-box machine learning models in Python;
  3. Carry out statistical hypothesis testing;
  4. Carry out simple multivariate regression analyses;
  5. Use techniques to control for covariates;

Important Dates


Some rules

  1. You are allowed to use any built-in Python library that comes with Anaconda. If you want to use an external library, you may do so, but must justify your choice.
  2. Make sure you use the data folder provided in the repository in read-only mode. (Or alternatively, be sure you don’t change any of the files.)
  3. Be sure to provide a textual description of your thought process, the assumptions you made, the solution you implemented, and explanations for your answers. A notebook that only has code cells will not suffice.
  4. For questions containing the /Discuss:/ prefix, answer not with code, but with a textual explanation (in markdown).
  5. Back up any hypotheses and claims with data, since this is an important aspect of the course.
  6. Please write all your comments in English, and use meaningful variable names in your code. Your repo should have a single notebook (plus the required data files) in the master/main branch. If there are multiple notebooks present, we will not grade anything.
  7. We will not run your notebook for you! Rather, we will grade it as is, which means that only the results contained in your evaluated code cells will be considered, and we will not see the results in unevaluated code cells. Thus, be sure to hand in a fully-run and evaluated notebook. In order to check whether everything looks as intended, you can check the rendered notebook on the GitHub website once you have pushed your solution there.
  8. In continuation to the previous point, interactive plots, such as those generated using plotly, should be strictly avoided!
  9. Make sure to print results or dataframes that confirm you have properly addressed the task.

Context

Congratulations! You have just been hired as a data scientist at Piccardi Music, a promising new music label created by a mysterious Italian disc jockey "Signor Piccardi". The company hired you to carry out a variety of data-related tasks, which will be explained in further detail below.


The data

For this homework you will use a dataset of 18,403 music reviews scraped from Pitchfork¹, including relevant metadata such as review author, review date, record release year, review score, and genre, along with the respective album's audio features pulled from Spotify's API. The data consists of the following columns:

Column Description
artist The name of the artist who created the album being reviewed.
album The name of the album being reviewed.
recordlabel The name of the record label(s) who published the album.
releaseyear The year that the album was released.
score The score given to the album by the reviewer on a scale of 0.0 to 10.0.
reviewauthor The name of the author who reviewed the album.
genre The genre assigned to the album by Pitchfork.
reviewdate The date that the review was published.
key The estimated overall musical key of the track. Integers map to pitches using standard Pitch Class notation (e.g., 0 = C, 2 = D, and so on)
acousticness A confidence measure from 0.0 to 1.0 of whether an album is acoustic. 1.0 represents high confidencethat the album is acoustic.
danceability How suitable an album is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 1.0 is most danceable.
energy A perceptual measure of intensity and activity, from 0.0 to 1.0, where 1.0 represents high energy. Metal is often high energy.
instrumentalness Predicts whether an album contains no vocals, from 0.0 to 1.0. The closer to 1.0, the more likely the album contains no vocals.
liveness Detects the presence of an audience, from 0.0 to 1.0. Scores greater than 0.8 indicate a strong likelihood the album is live.
loudness The overall loudness of the album in decibels (dB).
speechiness Measures the presence of spoken words in an album on a scale from 0.0 to 1.0. Scores higher than 0.66 indicate an album made entirely of spoken words, while scores below 0.33 indicate music and other non-speech-like elements.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by an album, where values closer to 1.0 indicate more positive sounds.
tempo The overall estimated tempo of an album in beats per minute (BPM).

¹Pinter, Anthony T., et al. "P4KxSpotify: A Dataset of Pitchfork Music Reviews and Spotify Musical Features." Proceedings of the International AAAI Conference on Web and Social Media. Vol. 14. 2020.

Part 1: Will this album be a hit?

The first project you embark on in your new job is to build a regressor to predict whether an album will be well received or not. According to Signor Piccardi (your boss), this algorithm may eventually be helpful in forecasting the success of albums produced by Piccardi Music.

Task 1 (Initial analyses — 10 pts)

As a good data scientist, the first thing you do is to have a good look at the data that was handed to you.

  1. Load the data using pandas. Identify and remove duplicate reviews, i.e., two reviews with albums by the same band with the same name (keep the first occurrence). Print the number of rows in your dataframe.
  1. Plot the distribution of album release years and the average score of albums per year.
  1. For numerical columns, calculate the mean, median, minimum value and maximum value. Additionally, plot the distribution for all the numerical columns in a single image (possibly with multiple subplots). Your image should be at most 14 inches wide by 14 inches long.
We observe a problem of range for some categories: - `danceability`, `energy`, `speechiness`, `acousticness`, `instrumentalness`, `liveness`, `valence` should go from **0.0 to 1.0** and some are smaller than 0.0. - `tempo`should be **greater than 0**. We drop the reviews with irregularites and recompute the mean, median, min, max. About the `key`, feature we supposed it was obtained with a programm, explaining why the majority of the data is not integers. We decided to keep it like that.
  1. For categorical columns, list how many different values there are in each column. If there are less than 10 distinct values for a category, print them all. For the genre column, assign the value 'Other' for albums where the value is either 'none' or NaN.
  1. Discuss: This dataset was built with found data—i.e., the Pitchfork reviews were not made with the goal of training a machine learning model. Grounded on the previous analyses and in Pitchfork's Wikipedia page), point three (exactly!) ways in which this data may not be representative of music albums in general due to the way the data was collected.
1. The repartiton of the genre is not representative of the market : - What is a `Global` music ? Is it *Classical* music ? If yes, it seems underestimated. I don't think 'Global' is classical, but more like world. But then I think classical is not present in the dataset, which might be an indicator. - The first genre represented in the database is `Rock`. O the Wikipedia page we can see that Pitchfork was originately dedicated to *indie-rock* ie. a type of rocj of the 70es played in the USA and the UK. We could infer that a large part of the `Rock`music reviewed is not representative of all the rock music composed. 2. Pitchfork's music reviews use two different rating systems. Also the review drop around 2000 seems fishy - did they change their rating system ? 3. Reviewers will probably focus on either albums of well-known bands, or good albums : there's no point in reviewing an album that nobody would listen to anyway. That would explain the high average of the scores. (That theory is disproven by the following graph : artists with more albums tend to have higher scores)

Task 2 (Pre-processing and pipeline code — 12 pts)

Next, you decide to prepare the code that will help you in training your machine learning models. Also, you implement a simple baseline. For this task, unless otherwise stated you must implement functions yourself, instead of relying on scikit-learn (you can use numpy or pandas, though!).

  1. For each possible value in the genre column, create a new column called {genre}_onehot (e.g., for genre=jazz, create jazz_onehot). Collectively, these new columns should "one hot-encode" the genre column—for instance, if for a given album the genre is filled with the value jazz, the jazz_onehot column should equal 1 and all other {genre}_onehot columns should equal 0.
  1. Create a function numpy_helper(df, cols) to obtain a numpy.array out of your dataframe. The function should receive a dataframe df with N rows and a list of M columns cols, and should return a np.array of dimension (NxM).
  1. For each album, build an array of features X containing all genre-related one-hot features, and an array of outcomes y containing scores. Using the function sklearn.model_selection.train_test_split with random_state=123, split the data into a train set containing 70% of all data, and a test set containing the remaining 30%.
  1. Create your own baseline regressor. Using the training data (in the training stage), your regressor should estimate the average score for all albums. Then, for the test data (in the prediction stage), your classifier should always output the average score (computed on the training data).
  1. Calculate the coefficient of determination ($R^2$) of your baseline on the test data. You are allowed to use the sklearn implementation here.
As expected, the $R^2$ of the baseline is very close to 0.
  1. Discuss: Your train-test split randomly selected 70% of all data for the training set. Why is this a problem for the broader task of predicting whether a future album will be successful or not?
A random split doesn't necessarily take into account the underlying distribution of data. We might want to ensure that the model is evaluated on every example, for example by using $k$-fold instead of only one train-test split.

Task 3 (Regression — 14 pts)

Finally, you get down to business and train your regression models.

  1. Build a Linear Regression model (use sklearn) that predicts the outcome score using the features "releaseyear", "key", "acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "valence", "tempo" and the one-hot encoded genre-related columns. Using a 70/30 train-test split similar to what you did in task two (hereinafter referred to as "the random split", use the same random seed, random_state=123), report the $R^2$ for the testing set.a
  1. Create an alternate train-test split (hereinafter referred to as "the longitudinal split") where you train with data from albums released before 2000 and test with data from 2003 and after. Report the $R^2$ for the testing set using the same model you developed for the previous question. Discuss: give the correct interpretation of $R^2$ value for the longitudinal split.
The $R^2$ is actually negative for a linear regression using the longitudinal split. This means that the prediction is worth than the average of over the test set. This is not surprising, as the distribution of scores changes drastically after 1999 (cf Task 1). Because of this, the distribution of scores is very different between train et test set with the longitudinal split.
  1. For a given entry $X$ your model outputs a predicted score $Y'$. The difference between the real score $Y$ and the predicted score $Y'$ is called the "residual". Considering the model trained in 3.2, plot the distribution of your residuals for the test set. Additionally, estimate what is the probability that your score prediction (from 3.2) is off by more than 2-points? Provide bootstrapped confidence intervals for your answer.
This distribution is not Gaussian, as the linear model expects. Indeed, it is left-skewed, which is expected since the test set is shifted left compared to the train set.