🕸📝Fergus Duniho wrote on Thu, Jul 21, 2016 04:16 PM UTC:
Since some people here are mathematicians, I thought I would share my thoughts on sorting the results and ask for advice on how to do it better. I chose not to sort by mean first, because a game with a single excellent rating could end up with a higher mean than a game that got several excellent ratings and one lower rating. Given that the latter game has attracted more attention and popularity, it doesn't seem fair to count the game with one rating (that happens to be excellent) higher than it. Mode and median don't have this problem so much. I chose to go with mode first, because the same mode can be distinguished by size, and this gives an indication of popularity. If I went with median first and tried to distinguish the same medians with sample size, it wouldn't have the same effect, since a larger sample size increases the chance of there being ratings below the median.
The main problem with using mode is that a sample may have no mode or multiple modes, and then a single mode cannot be calculated. Currently, its methods for handling multiple modes are inconsistent. For two modes, it favors the lower mode if there are more Poor and BelowAverage ratings than there are Good and Excellent ratings, or the higher mode if there are more of the latter. If these are equal, it returns a value of Average. For three modes and for five modes (which also includes no modes), it returns the median of the modes. But for four modes, it returns the mean of the modes. One thought I have is to make this all consistent by always returning the median of the modes. For one mode, this would be the mode, for two, this would also be the mean of the modes, for 3 and 5, this would be the middle mode, and for 4, this would be the mean of the two middle modes. Looking over the raw scores column, I see that most games do have single modes, a small number have two modes, and I didn't notice any with three or more.
One drawback to using mode first is that not every rating has an effect on determining the mode. So, for example, results such as 1, 1, 1, 1, 4, 4, 4, 5, 5, 5 would have a mode of 1 even though the 4s and 5s are greater in number together. If I solve this problem by using mode only for a majority, that gives the same value as using median. After all, whenever the mode size is over 50%, the median value will be the mode. So I have thought of dropping mode and using median instead, or perhaps of sorting by median before mode. A median is affected by all ratings, but not with the finetune precision that a mean is. For the small samples of ratings each game has, this could be more suitable than relying on mode or mean first.
In general, I want the order to reflect both average rating and popularity. One thought was to total up the scores and sort those, but this won't work well when Poor and BelowAverage count as 1 and 2 points. Alternately, I could shift the points so that Poor is -2, BelowAverage is -1, Average is 0, Good is 1, and Excellent is 2, then sort the totals.
Since some people here are mathematicians, I thought I would share my thoughts on sorting the results and ask for advice on how to do it better. I chose not to sort by mean first, because a game with a single excellent rating could end up with a higher mean than a game that got several excellent ratings and one lower rating. Given that the latter game has attracted more attention and popularity, it doesn't seem fair to count the game with one rating (that happens to be excellent) higher than it. Mode and median don't have this problem so much. I chose to go with mode first, because the same mode can be distinguished by size, and this gives an indication of popularity. If I went with median first and tried to distinguish the same medians with sample size, it wouldn't have the same effect, since a larger sample size increases the chance of there being ratings below the median.
The main problem with using mode is that a sample may have no mode or multiple modes, and then a single mode cannot be calculated. Currently, its methods for handling multiple modes are inconsistent. For two modes, it favors the lower mode if there are more Poor and BelowAverage ratings than there are Good and Excellent ratings, or the higher mode if there are more of the latter. If these are equal, it returns a value of Average. For three modes and for five modes (which also includes no modes), it returns the median of the modes. But for four modes, it returns the mean of the modes. One thought I have is to make this all consistent by always returning the median of the modes. For one mode, this would be the mode, for two, this would also be the mean of the modes, for 3 and 5, this would be the middle mode, and for 4, this would be the mean of the two middle modes. Looking over the raw scores column, I see that most games do have single modes, a small number have two modes, and I didn't notice any with three or more.
One drawback to using mode first is that not every rating has an effect on determining the mode. So, for example, results such as 1, 1, 1, 1, 4, 4, 4, 5, 5, 5 would have a mode of 1 even though the 4s and 5s are greater in number together. If I solve this problem by using mode only for a majority, that gives the same value as using median. After all, whenever the mode size is over 50%, the median value will be the mode. So I have thought of dropping mode and using median instead, or perhaps of sorting by median before mode. A median is affected by all ratings, but not with the finetune precision that a mean is. For the small samples of ratings each game has, this could be more suitable than relying on mode or mean first.
In general, I want the order to reflect both average rating and popularity. One thought was to total up the scores and sort those, but this won't work well when Poor and BelowAverage count as 1 and 2 points. Alternately, I could shift the points so that Poor is -2, BelowAverage is -1, Average is 0, Good is 1, and Excellent is 2, then sort the totals.