# An analysis of Uncertainty Quantification algorithms and Bias in Recommender Systems

Authors

Alyssa Ho, Robert Beliveau, Shynn Lawrence, Vivek Alumootil, Dr. Ali

Abstract

Companies across the world are currently using recommendation systems in order to uniquely serve products, media, and ads to users such that users are more inclined to use their service. Household names – Spotify, Amazon, Netflix, Facebook, Google – all use such recommendation systems within their products; Because such systems are used so widely, we decided to investigate different methods of quantifying uncertainty about recommendations. Furthermore, we investigated correlations between data in the dataset we worked with, finding instances of unequal representation that could cause unethically bias towards different groups in recommendation system implementations.

## Background

Thousands of petabytes of data are processed daily, with thousands of multibillionaire companies using data to improve the lives of their customers. One of the most useful techniques software companies use is to recommend items/media based on their previous interests. Such a system is generally classified as a Recommender System and revolves around the idea that better recommendations usually lead to more satisfied customers (and higher profits).

There are many types of Recommender Systems, though many popular ones are used daily by many people. One of the most notable examples of Recommender Systems is the System used in Netflix. Netflix recommends movies based on a user’s previous rating of different movies, demographic data collected on users, and different features of movies (the movie genre, movie length, etc). Another popular example of a company using a Recommender System is Spotify. Spotify uses similar data about users, movies, and ratings (likes/dislikes) to recommend more music to its users. These technologies have been popularized in the area of entertainment and socialization (particularly by Instagram, Facebook, and Youtube), however, they can also be extremely useful in aiding humans – especially in the areas of medicine recommendations, criminal sentence recommendations, and parole release systems.

When considering all of the impacts that Recommender Systems have on Society, potential discrimination and bias must be carefully investigated. As many data scientists and Statisticians observe, “Garbage in – Garbage out.” In other words, the algorithms used in data science are rarely inherently discriminatory, but pre-existing bias in data usually leads to biased models. For this reason, one of the topics our project focused on was finding prominent bias in popular datasets.

Another relatively unforged path in the Recommender System world is Uncertainty Quantification – Giving bounds for recommendations and quantifying the model’s uncertainty for different recommendations. A particularly interesting aspect of Uncertainty Quantification (UQ) techniques is the resulting model’s inference bound size, coverage (similar to accuracy), and validation time – all of which are dependent on the type of UQ system and the method of making recommendations.

Overall, our goal in this project was to observe bias in datasets and compare different uncertainty quantification techniques paired with different recommendation algorithms. In this new age of Big Data and AI, our project hopes to help elucidate different trade-offs software engineers and data scientists will encounter and the high possibility of discriminatory ML systems stemming from the use of biased data.

## Data

We used the MovieLens 1M ratings data. It included three separate tables: ratings, users, and movies. All unknown factors were changed to a -1 and ignored during training.

The next step was to change string values to integers. We did that as follows, categorizing gender, genre, and zip code:

We hypothesized our model would do better with continuous data as we were using linear regression models, so we added random noise to our ratings. To see if our hypothesis was correct, we also tried the same models on discrete data to compare their effectiveness. This did not seem to make any difference.

Lastly, we split the data in multiple ways. For Bootstrap models, ratings were split into a training and test set. This simplified graphic depicts the two ways we split data where red is train and green is test. Splitting the data by columns or rows did not make much of a difference as can be seen in our Bootstrap Averages Results Section.

Splitting by rows to find trends in movie ratings:

[wpdatatable id=2]

Splitting by columns to find trends in user ratings:

[wpdatatable id=3]

For Conformal Inference models, ratings were split into a training, validation and test set. This simplified graphic depicts the one way we split data where red is train, yellow is validation, and green is test.

[wpdatatable id=4]

## Fairness/Bias

Before we explain our models, there is an important data bias we discovered that must be taken into account. Firstly, this database consists of 1709 females compared to 4331 males. This gender disparity can negatively affect the movies predicted as the model has been trained on mostly male data. We plan to continue the project to further observe if this is the case. Despite this unevenness, the gender of a user does not seem to significantly impact what rating they would give a movie of a certain genre. We came to this conclusion after graphing a simplified dataset of 1475 men and 539 women against all 18 genres: Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western. It’s important to note that the following graphs only show a normalized distribution that doesn’t show the total lack of female ratings against male ratings.

As can be seen above, the percentages for each rating is pretty close for Film-Noir and Romance. Some have a slightly greater difference like for Fantasy and Musicals.

However, the differences in height never exceed 0.05 for any genre. More testing will have to be done to see if these slight differences actually add a bias and change the results of prediction models.

Aside from that, there are some interesting trends we discovered from just plotting these points. For example, every graph except Documentary follows a similar curve with a peak of 4. In addition, males are more likely to give out the highest rating of 5 than females as 100% of orange bars at x=5 are lower than blue.

## Methods

Bootstrap Averages:

This resampling method creates a rating prediction for the test set by taking the average of known ratings in a training set and then resampling the training data with replacement over a series of trials, collecting every average to then calculate the truest average. This average is then used to fill unknown ratings in the test dataset. To test the accuracy of this method, a 90% and 10% quantile were calculated from each matrix column of collected averages to generate reasonable intervals. Coverages were then measured by the average number of test ratings that fell within that interval.

Bootstrap Linear Regression:

The Bootstrap with Linear Regression method operates by going into each column of the dataset and splitting the column into two different sections, representing the testing and training sets. The testing set is ⅔ of the column, while the training is the remaining third, this is to maintain the same datasets as the other methods that rely on testing, validation, and training sets. After the data is divided into the sets, the -1 values are removed from each of the sets to remove any unnecessary values. At that point, the movies with less than a certain amount of ratings are removed from the set, the smallest value that this can be is 10, as any lower results in errors from sizes being too small. Each row that is removed for being a -1, such that a specific user hasn’t rated it, is removed on the user dataset as well, to prevent the bootstrap from grabbing a user who hasn’t rated that movie.

A bootstrap training set is then created with the same size as the original training set, and random rows are selected and placed into the bootstrap set. The random indices are also applied in the user dataset, grabbing the same random rows as the ratings dataset and saving them in an external array. From there, a linear regression model is created, fitted with the bootstrapped training set, and the model creates a prediction. That prediction is then compared with the testing set and given a score based on if each of the points are between the 90th quantile of the prediction. If all of the points are between the prediction’s 90th quantile, the model would receive a score of 1, and 0 if none of the points are between the quantile. After this scoring, the program repeats the process of randomly selecting indices, placing them into external arrays, predicting values using linear regression, and scoring 25 times to ensure that the first time wasn’t a randomly biased sample where every row was the same value.

Conformal Inference Averages:

For Conformal Inference with Averages, the dataset is initially split into thirds. The first third is used for training. In the case of recommendations with averages, every prediction is just the simple average of all the ratings for the movie in the training dataset. The second third is used for validation – predictions for the known data are made, and the accuracy of the prediction (the prediction minus the actual value) is collected. Then, the 90th quantile of this collection is taken and will be the bounds for all future recommendations – for example, if the 90th quantile was 1.5, then a recommendation of 3 for a movie would result in a 90% chance of the actual rating falling between [1.5, 4.5]. The final third of the dataset is used for testing – a similar process to validation – however instead of just predictions, the quantile taken in the previous step is used for creating bounds. If the actual value is within the prediction 土 quantile, then the model is awarded a 1. If it is not within the bounds, it is given a 0. Taking the average of all 1s and 0s – dependent on the accuracy of the model bounds – should give nearly 90%. In other words, creating bounds through this method allows the model to recommend a range of values that, on average, contain the actual rating 90% of the time.

Conformal Inference Linear Regression:

We split the data into three sets: training, validation and testing, split by rows (users). User features were read in from the data set and each user had a user vector containing their age and gender (one-hot encoding). In the training set, for each movie, linear regression was done on the users who have rated that movie and a vector was produced which maps user feature vectors to ratings for that movie. Thus, we ended up with a set of vectors, one for each movie. In the validation set, for each movie, the 90% quantile of the array of errors (defined differently for each submethod, but generally a difference between the predicted value and true rating) for that movie was calculated and stored. In the testing set, we used the stored quantiles to form confidence intervals (if our predicted value was p and the 90% quantile for the movie was q, then our interval was (p-q, p+q). We then calculated the coverage (the fraction of intervals we created that contained the true rating). This was predicted to be 0.9, since the quantile we took was 90%. Movies with less than a certain number of ratings were ignored.

## Results

Bootstrap Averages With Continuous Ratings:

We decided to split the dataset into training and test with two different methods to see which one would work better. The first was by rows, so the model would need to find averages by each movie, while the second was by columns, so averages were found for each user. Although bootstrap by users does slightly better on coverage, splitting data by columns or rows does not seem to make a drastic impact.

Compared to other models, the interval sizes generated by Bootstrap averages were very small. The reason for this is due to the fact that the calculated averages of each movie/user are very similar, many only decimals apart. Thus, the quantiles will also be close in number so the size, the difference between the quantiles, are very small. This is also why our coverage is slim since the majority of test data will not be able to fall in such a small interval. This is a huge problem as we want our coverage to meet the 90% mark.

The graphs also show an immense amount of outliers. We believe the reason for this is due to the large number of movies with little known ratings. Because the number of ratings are so different for each movie, the coverage for each will reflect on this diversity, producing many outliers. To test this, we increased the variable r, the minimum number of ratings movies needed. If they did not exceed r, they were dropped from the database. The following two graphs prove that this is the case: with a higher variable r, the less outliers appear on the plots.

Although increasing r does produce fewer outliers, it does not fix the problem of our small coverage and size. We suspected the problem was calculating the averages and how similar they were to each other. Therefore, we decided to just get rid of that step and calculate quantiles from the list of resampled, whole number data. It worked.

Bootstrap with Linear Regression:

Surprisingly, combining both Bootstrap and Linear Regression results in a lot fewer outliers than the other methods. This does come with the downside that every point is within the range of 0.0 and 1.0, definitely not as confident as the Bootstrap with Averages. Taking a look at the sizes now, the maximum size being a 4 and the lowest being 0, it reflects almost exactly what the coverage showed. This broad spectrum of both quantiles and coverages shows that this method of combining Bootstrap and Linear Regression is not the most optimal or accurate in any sense of the word. The average coverage and average sizes were both around the center of the spectrum, showing almost a bell-curve like representation with bootstrap and linear regression.

Even with the changes in R, the amount of movies dropped from the analysis for not having enough ratings, the only data pieces that changed were that the size dropped slightly, to around 2.1 instead of 2.34. If R was between 0 and 100, the linear regression model wouldn’t be fitted correctly, allowing for a complete misjudgement of the general trend.

Outside of the R changes and the general inaccuracies of the coverage and sizes, the runtime of the program was a completely different factor that could range from an hour to a few days. This runtime primarily comes from the CVXPY library, where for each movie that had enough ratings, each user who rated it, and for each bootstrap resampling, a problem would be created, fitted to a linear regression model, solved, and predicted, the process taking around 10µs. It is definitely optimized to do complex equations singularly, but the repetition that is required exponentially increases the time taken. This difference can be shown with the difference of the bootstrap resampling amount, one being 25 and the other being 100. When the bootstrap resampling amount was 25, meaning that there would be 25 resamples per user, the program had a runtime of around 16 hours. When the resampling amount was 100, the program had a predicted runtime of 3 days, assuming that it had the same R values as the 25 resample.

Overall, this method resulted in the highest size distribution and the lowest average coverage out of the other methods mentioned, not including the runtime, storage cost, and maintenance the system requires.

Conformal Inference with Averages:

Interestingly, there are many outliers below the Coverage graph. This is due to the fact that there are a high number of movies with not many ratings – for example, some movies that have only been rated by < 10 people. These movies easily become outliers as their low amount of ratings pull their average either very high, or really low – and thus the recommendations for them are not as accurate. An alternate method of presenting the testing coverages and sizes include only calculating the accuracy of the model on the movies which have more than 100 ratings in the validation dataset:

Note how in this case, the mean in the Coverage box-plot is actually somewhat higher than 0.9, though the actual average coverage is 0.9. This is because matplotlib (the python library used for plotting the data) is plotting all of the average coverages per movie – not the total coverage. In other words, it is plotting the “mean of means”, which is not the actual mean of the entire data as a whole.

A differing perspective on conformal inference includes taking the quantiles from the validation set on a per-movie basis – each movie has a different quantile calculated on its accuracy in the validation set. This should, theoretically, make the resulting prediction bounds more accurate as they would be more fine tuned for each movie. Another addition that could be made is increasing the bounds based on the amount of ratings in the validation set. For example, a movie which has only one rating in the validation set would have larger bounds because the model would have less confidence, because it only ran through validation on only one data-point for that specific movie. This is done by multiplying the quantile per-movie by 1 + 1/n, with n being the amount of ratings in the validation set for that movie. Using this technique, the previously mentioned coverage outliers almost completely disappear – as the bounds are more suited for movies with low amounts of ratings. Indeed, the following data corroborates that fact:

As seen in these box and whisker charts, the amount of coverage outliers is extremely low, and the overall average coverage is 5% higher due to the “fine-tuning” (movie-specific quantiles) with this method.

Conformal Inference with Linear Regression:

We split the data by rows (by movies) into the 3 sets: training, validation and testing

Most of the data below uses the following set sizes:

(m1 = 5000, m2 = 500, m3 = 500, n= 1000 a = 0.1, z = 60)

m1, m2 and m3 are the sizes of the training, validation and testing sets, respectively. n is the number of movies, and a is the alpha level. z is the cutoff level for the number of movie ratings (movies with < z ratings are ignored). We chose m1 to be much larger than m2 and m3 since empirical evidence seems to show placing most of the 6000 available users into the training set produces the lowest average sizes for a given coverage level.

Each regression takes in a user feature vector, which contains their gender and age. Empirical evidence seems to show that including additional information (e.g occupation, ZIP code) isn’t very helpful.

Each algorithm differs by the loss function of the regression done. Quantile Regression is not conformal in the same way as the other three, but it is still similar. Here are descriptions of each algorithm:

Least-Squares: This method aps a vector of user features to a rating for a particular movie by minimizing the mean of the squares of the rating errors.

Least-Abs-Mean: This method does the same as above, but instead it minimizes the mean absolute value instead of the mean square of the errors.

Quantile Reg: This method ses gradient boost regression (scikit-learn) to produce a 10% and 90% quantile for the value of a rating. It does not do conformal inference in the same way as the first two methods.

Class-Conformal: This method takes in a user feature vector and outputs probabilities of a rating of a 1, 2, 3, 4 and 5. In the validation set, this method found the 10% quantile for the probability of the true rating (that is, if the 10% quantile was x, then 90% of the time, the true rating will be predicted by the model to have a probability of $\geq$ x). In the testing set, this was then used to make discrete sets of possible ratings (such as {1, 2, 4}).

Least-Squares, Least-Abs-Mean and Class-Conformal were very similar. They all had close to an average size of 2.8 and coverage of 90%. Performance improves considerably when z is increased; however, since the model should be able to predict ratings in all movies, not just those with a large number of ratings, we chose to keep z at 60, which is already quite high.

An important note should be made about the sizes: these sizes are not directly comparable to some of the other methods since they describe discrete sets rather than continuous intervals. That is, these methods produced sets such as {1, 2, 4} with a size of 3, rather than an interval like [1.2, 3.9] with size 2.7. In fact, the former way generally resulted in much larger sizes. When these methods were done the latter way, we found sizes of around 1.9, which is more comparable to the sizes of the other methods.

The motivation for choosing other methods of linear regression was that least-squares seeks to minimize the sum of the squared error, while we actually seek to minimize the 90% percentile of the error. It wasn’t clear that least-squares regression is the best option. Moreover, we hypothesized that using other loss functions for regression would make the expansion of the user-feature vectors successful (including more information, such as their profession or ZIP code). Unfortunately, this did not happen, although further research needs to be done to investigate the relationship between information in the user-feature vectors and the success of a model.

## Conclusion

In terms of both size and coverage, linear regression with conformal inference was the most successful method. This was expected, since both averaging methods don’t take into account user information. Coverage ended up being quite low for both bootstrap methods, and we suspect this is a result of the bootstrap sampling changing the distribution. Both bootstrap methods were also very slow compared to the conformal inference methods. Conformal inference with averaging was reasonably successful, but did not produce intervals as small as those of conformal inference with linear regression. However, averaging was significantly faster than the other methods. For five possible ratings, conformal inference with linear regression managed to give sets of size 2.8 with 90% coverage. While this is reasonably good, there is still significant room for improvement.

## Future Directions

Fairness: The first step is to delve further into evaluating the bias in our data and seeing if it translates into the predictions we make. We currently only analyzed gender vs genre bias in our pre-existing database. We plan to use the same methods to analyze the predictions our linear regression models made and then compare them. Other potential biases like age, occupation, and location still need to be tested. These objectives also open social and philosophical questions that we hope to further discuss and examine such as whether predictions should even use gender as a factor. If movies were recommended solely by what the user previously rated high, a model could be unbiased to their gender but still recommend good results. The only problem is, this requires data a new user to the streaming service wouldn’t have. In addition, if we were to erase any gender bias in our dataset, would it significantly weaken our model? But one could argue that with bias, the recommended movies could contribute to gender stereotypes. Still, even if perfect predictions could be made without gender bias, wouldn’t it be helpful in some cases? For example, what if there is an educational documentary on feminine products and female empowerment. Would it be useful to add this to a female user’s list even if they don’t usually watch documentaries? One could argue that this should also be recommended to males as all genders need to be aware of diverse ideas and that society shouldn’t taboo these topics and further divide genders into two distinct categories. This leads into another problem: If you only feed people what they like, then it can get kind of one-sided and build a community of ignorance. Perhaps instead of being biased, a recommendation system can have some degree of randomness to keep the selection diverse. On the other hand, this will not attract revenue as many customers may not enjoy it as much. These sort of ethical questions are quite interesting to us and we hope to look into them in the future.

Computational Statistics:

While the various methods of implementing recommender systems vary in their accuracy and strength, they also vary in their speed. In our own methods, we found that the conformal inference methods were much faster than the bootstrap methods. Further analysis needs to be done to investigate the relationship between accuracy and time with our methods and other methods, such as those involving neural networks and machine learning.

Application: We plan to continue this project after the program ends by implementing the best models we researched to produce a working recommendation system on a website or other application. We’ll first need to find some simple problem that our application could help with that will attract users to use it. Perhaps it will be a website that will find similar movies for a user to watch that also uses our uncertainty quantification methods or we could use a different database entirely.

Neural Networks: We also plan to investigate modern methods of creating recommender systems, such as using neural networks. These methods are expected to be more powerful than our current methods and should provide better predictions.