Capsone Project: Spotify Team

Problem and Dataset

We focus on the task of continuing playlists. That is, given a playlist with a few songs, we wish to suggest songs that are natural extensions of the given playlist. One way to think of it is building a "radio" given seed tracks. For this task, we had available the million playlist dataset which was recently released by Spotify and contains actual playlists from users.

Approach

We adopted a 2 step hierarchical approach, where the first task proposes a pool of candidate songs based on an adaptation of Word2vec. We compare the songs in the candidate pool using eight different features, and the results are aggregated using a bayesian framework which suggests the final list of songs to continue the playlist.

Evaluation

We evaluate our system by splitting playlists into two parts, and evaluating how well our playlist continuation recommendations based on the first part can recover songs in the second part of the original playlist. The metrics used for this task are adaptations of standard precision and recall metrics called r-precision and NDCG

What makes playlist continuation hard?

Should a "good" continuation preserve the mood of the playlist (For ex: happy, sad), or the utility (For ex: Travel, date night, workout)? Or should it maximize similarity in meaning of lyrics? similarity of the audio? Given the inherent ambiguity of this task, streaming companies like Spotify currently rely on experts to design playlists for a given mood or a use case!

Model Details

There are over 400 million songs in the Million Playlist Dataset. For any model, to be able to identify top 10 songs from a set of 400 million is an extremely daunting task. Most past works have approached this tasks in either a collaborative filtering like approach, where they ignore the content of the song and only use people's usage patterns, or using content driven approaches which try to identify features that best reflect song similarity. We adapt a hybrid technique here. We start with collaborative filtering like approach called Track2Vec to identify a pool of songs that are good candidates for suggestion. Track2Vec is analogous to Word2vec, and is trained similarly by treating playlists as documents and songs as words. In our analysis, we found that top 10,000 candidates suggested by the "pooling" model is able to recover all the ground truth suggestions for a given playlist with about 80% recall.

Information about track in the candidate pool generated from above dataset is embedded into a vector space. One embedding for each kind of information: Artist, Track, Lyrics, Sound Features and so on. For each of these embeddings, we use a different metric. Multiple metrics were tried (including cosine, L2, L1) and the best one was chosen. A fair assumption is that a non-linear function of these pieces of information will be a good global feature for a song. The way we approach this is by learning transformations such that a weighted linear combination of the transformed features can approximate the unkown non-linear function. The transformations tried for different features ranged from word2vec like methods to LSTMs for lyrics. Finally, we learn the weights for the linear combination mentioned above using a Bayesian optimization approach which tries to reduce the r-precision across the whole dataset. More details can be found here.

UI for collecting feedback

In order to obtain user feedback on our results, and as a means to obtain results for evaluation which respect the notion that there are multiple possible playlist continuations which all would be equally good to a user, we made a UI to collect information from users. Furthermore, this system also provides us with a way to see how utility driven playlists can be generated by tuning the weights of different features in our prediction model.

For instance, it is reasonable to assume that for a playlist which is about "break-ups", it would make more sense to suggest songs with similar lyrics. However, for a playlist for "working-out", sound features would be better. However, there are nuances of correlations between these features missed out in this assumption. By generating experiments that weigh one feature more than another, and seeing people's response to it, we can identify the optimal importance weights for the feature, given a particular target utility or mood. More details here.

We show here a comparison of our results to the current leaderboards on the Spotify RecSys challenge. As is clear, our model beats the current top submissions by a huge margin. However, there is a caveats to this analysis. Firstly, our model uses 8 features as opposed to the 3 in the guidelines for the RecSys challenge. This highlights the importance of using more features for playlist continuation. We plan to make a variation of our model with 3 features for a more direct comparison as well.