0%

(待更新)推荐系统:经典算法——协同过滤(Collebrative Filtering)

数据集

经典Movielens数据集

1
2
3
4
5
6
7
8
9
10
All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

实验设计

采用K-fold交叉验证,将用户行为数据均匀分成K份,其中一份作为测试集,K-1份作为训练集。协同过滤算法只考虑物品/用户的共现关系,所以用户序列都用集合表示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 划分数据集
def SplitData(data, M, k, seed):
test = []
train = []
random.seed(seed)
for user, item in data:
if random.randint(0, M) == k:
test.append([user, item])
else:
train.append([user, item])
train_ = defaultdict(set)
test_ = defaultdict(set)
for user, item in train:
train_[user].add(item)
for user, item in test:
test_[user].add(item)
return train_, test_

评价指标

召回率Recall,准确率Precision,覆盖率Coverage,新颖度Popularity。

召回率Recall:正确推荐的商品占所有应该推荐的商品的比例,即应该推荐的推荐了多少。公式描述:对用户u推荐N个物品($R(u)$),令用户在测试集上喜欢的物品集合为$T(u)$,则

准确率Precision:正确推荐的商品占推荐的商品列表的比例,即有多少推荐对了。公式描述:

覆盖率Coverage:推荐的商品占所有商品的比例,即推荐的商品覆盖了多少所有商品。反映发掘长尾的能力。

新颖度Popularity:刻画推荐物品的平均流行度,平均流行度(Popularity)越高,新颖度越低。$Popularity(x)$定义为$x$在所有用户序列中出现的次数,出现次数越多,流行度越高。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# 评价指标:召回率、准确率
def Metric(train, test, N, all_recommend_list): # N:推荐N个物品
hit = 0
recall_all = 0 # recall 的分母
precision_all = 0 # precision 的分母
for user in train.keys():
tu = test[user]
rank = all_recommend_list[user][0:N]
for item, pui in rank:
if item in tu:
hit += 1
recall_all += len(tu)
precision_all += N
recall = hit / (recall_all * 1.0)
precision = hit / (precision_all * 1.0)
return recall, precision

# 评价指标:覆盖率
def Coverage(train, test, N, all_recommend_list): # N:推荐N个物品
recommend_items = set()
all_items = set()
for user in train.keys():
for item in train[user]:
all_items.add(item)
for item, pui in rank:
recommend_items.add(item)
coverage = len(recommend_items) / (len(all_items) * 1.0)
return coverage


# 评价指标:新颖度
def Popularity(train, test, N, recommend_res): # N:推荐N个物品
item_popularity = dict()
for user, items in train.items():
for item in items:
if item not in item_popularity:
item_popularity[item] = 0
item_popularity[item] += 1
popularity = 0
n = 0
for user in train.keys():
rank = recommend_res[user][0:N]
for item, pui in rank:
popularity += math.log(1 + item_popularity[item])
n += 1
popularity /= n * 1.0
return popularity