图算法并非新事物,在开源库中曾有许多较为简单的算法实现。近年来,研究者们积极寻找能够弥补深度学习局限性的解决方案,特别是无法解释性和因果推理的问题,于是图神经网络(GNN)逐渐成为热门话题。随着图神经网络的应用日益广泛,许多基于图的框架应运而生,例如广为人知的DGL。同时,主流的深度学习框架PyTorch和TensorFlow也开始支持图算法。
基于图数据的独特优势,越来越多的企业开始探索图数据与机器学习的结合,以优化算法并推动下一代图数据库的发展。尽管不少企业内部开发了图数据库和图分析平台,但市面上成熟的开源工具仍相对有限。对于缺乏自研能力的企业而言,如何利用图数据进行机器学习成为了关键问题。
接下来的文章将介绍一种解决方案,通过图数据库Neo4J和Scikit-Learning构建一个机器学习分类器。我们将从基础理论到具体实现进行全面讲解。
Neo4J是一种图形数据库,目前市场上主流的图数据库包括TigerGraph、Neo4J、Amazon Neptune、JanusGraph和ArangoDB。近年来,Neo4J一直占据图数据库领域的领先地位,尤其在知识图谱领域表现突出。
Neo4J主要采用Cypher语言进行图分析,安装过程非常简便,只需一键即可完成。Neo4J桌面版的下载地址为:https://neo4j.com/download/
此外,《Graph Algorithms》这本书介绍了基于Neo4J实现的案例算法,作者Amy Holder和Mark Needham也是Neo4J的员工。
图数据库在分析异构数据点之间的关系方面表现出色,例如欺诈检测或社交网络中的好友关系。在社交网络关系预测中,核心任务是识别潜在的连接关系。链路预测算法是这一任务的核心,Neo4J的图算法库支持多种链路预测算法,这使得我们能够从基础开始学习链路预测,并将数据导入Neo4J,结合Scikit-Learning实现链路预测模型。
链路预测的概念最早出现在2004年,Jon Kleinberg和David Liben-Nowell发表了关于《社交网络中的链路预测》的论文。他们提出,通过分析现有网络中的节点相似度,可以预测未来可能出现的新的关系。
链路预测不仅限于社交网络,还可以应用于多个场景,如预测恐怖组织成员之间的关系、生物网络中分子间的关系、引文网络中潜在的共同创作关系等。
链路预测的本质是对未来可能发生的事件进行预测,例如在引文网络中预测两个人是否有可能合作撰写论文。
Kleinberg和Liben-Nowell提出了一系列链路预测算法,包括共同邻居数、Adamic Adar、优先链接等。这些算法通过计算节点间的相似度来预测未来的关系。
共同邻居数:这是一种简单的度量方法,计算节点间的共同邻居数。节点间的共同邻居越多,它们之间形成连接的可能性越大。
Adamic Adar(AA算法):考虑了共同邻居的度信息,根据共同邻居的度数进行加权计算,最终得到节点间的相似度。
优先链接:根据节点的度数计算节点间的相似度,度数较高的节点更有可能形成新的连接。
Neo4J的图算法库支持六种链路预测算法:Adamic Adar、共同邻居、优先链接、资源分配、共同社区和总邻居算法。
链路预测算法可用于评估节点间是否存在潜在联系。具体应用有两种方法:
接下来,我们将详细介绍如何使用有监督学习方法构建一个链路预测模型。
在决定使用有监督学习方法后,需要解决两个关键问题:
选择哪种机器学习模型:考虑到链路预测算法计算的是节点间的相似度,因此需要选择对特征相关性不太敏感的模型,如梯度提升分类器或随机森林分类器。
如何划分训练集和测试集:由于图数据的特点,简单的随机划分可能导致数据泄露。因此,需要将图数据按时间点分割,确保训练集和测试集的网络结构相似。
我们将基于DBLP引文网络的数据集进行链路预测的实际操作,构建一个判断作者之间合作关系的机器学习模型。
首先,通过以下Cypher语句导入数据: ```cypher // 创建约束 CREATE CONSTRAINT ON (a:Article) ASSERT a.index IS UNIQUE; CREATE CONSTRAINT ON (a:Author) ASSERT a.name IS UNIQUE; CREATE CONSTRAINT ON (v:Venue) ASSERT v.name IS UNIQUE;
// 导入数据 CALL apoc.periodic.iterate( 'UNWIND ["dblp-ref-0.json", "dblp-ref-1.json", "dblp-ref-2.json", "dblp-ref-3.json"] AS file CALL apoc.load.json("https://github.com/mneedham/link-prediction/raw/master/data/" + file) YIELD value WITH value RETURN value', 'MERGE (a:Article {index:value.id}) SET a += apoc.map.clean(value,["id","authors","references", "venue"],[0]) WITH a, value.authors as authors, value.references AS citations, value.venue AS venue MERGE (v:Venue {name: venue}) MERGE (a)-[:VENUE]->(v) FOREACH(author in authors | MERGE (b:Author{name:author}) MERGE (a)-[:AUTHOR]->(b)) FOREACH(citation in citations | MERGE (cited:Article {index:citation}) MERGE (a)-[:CITED]->(cited))', {batchSize: 1000, iterateList: true} ); ```
在作者之间建立CO_AUTHOR关系,计算合作属性和年份属性:
cypher
MATCH (a1)(a2:Author)
WITH a1, a2, paper
ORDER BY a1, paper.year
WITH a1, a2, collect(paper)[0].year AS year, count(*) AS collaborations
MERGE (a1)-[coauthor:CO_AUTHOR {year: year}]-(a2)
SET coauthor.collaborations = collaborations;
根据年份分割数据,创建训练图和测试图: ```cypher // 训练子图 MATCH (a)-[r:COAUTHOR]->(b) WHERE r.year < 2006 MERGE (a)-[:COAUTHOR_EARLY {year: r.year}]-(b);
// 测试子图 MATCH (a)-[r:COAUTHOR]->(b) WHERE r.year >= 2006 MERGE (a)-[:COAUTHOR_LATE {year: r.year}]-(b); ```
安装所需库:
bash
pip install py2neo==4.1.3 pandas sklearn
创建数据库连接: ```python from py2neo import Graph import pandas as pd
graph = Graph("bolt://localhost", auth=("neo4j", "neo4jPassword")) ```
创建训练数据集: ```python
trainexistinglinks = graph.run(""" MATCH (author:Author)-[:COAUTHOREARLY]->(other:Author) RETURN id(author) AS node1, id(other) AS node2, 1 AS label """).todataframe()
trainmissinglinks = graph.run(""" MATCH (author:Author) WHERE (author)-[:COAUTHOREARLY]- MATCH (author)-[:COAUTHOREARLY*2..3]-(other) WHERE not((author)-[:COAUTHOREARLY]-(other)) RETURN id(author) AS node1, id(other) AS node2, 0 AS label """).todataframe()
trainmissinglinks = trainmissinglinks.dropduplicates() trainmissinglinks = trainmissinglinks.sample(n=len(trainexisting_links))
trainingdf = trainmissinglinks.append(trainexistinglinks, ignoreindex=True) trainingdf['label'] = trainingdf['label'].astype('category') ```
创建测试数据集: ```python
testexistinglinks = graph.run(""" MATCH (author:Author)-[:COAUTHORLATE]->(other:Author) RETURN id(author) AS node1, id(other) AS node2, 1 AS label """).todataframe()
testmissinglinks = graph.run(""" MATCH (author:Author) WHERE (author)-[:COAUTHORLATE]- MATCH (author)-[:COAUTHORLATE*2..3]-(other) WHERE not((author)-[:COAUTHORLATE]-(other)) RETURN id(author) AS node1, id(other) AS node2, 0 AS label """).todataframe()
testmissinglinks = testmissinglinks.dropduplicates() testmissinglinks = testmissinglinks.sample(n=len(testexisting_links))
testdf = testmissinglinks.append(testexistinglinks, ignoreindex=True) testdf['label'] = testdf['label'].astype('category') ```
使用随机森林分类器: ```python from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(nestimators=30, maxdepth=10, random_state=0) ```
定义函数生成特征: ```python def applygraphyfeatures(data, reltype): query = """ UNWIND $pairs AS pair MATCH (p1) WHERE id(p1) = pair.node1 MATCH (p2) WHERE id(p2) = pair.node2 RETURN pair.node1 AS node1, pair.node2 AS node2, algo.linkprediction.commonNeighbors(p1, p2, {relationshipQuery: $relType}) AS cn, algo.linkprediction.preferentialAttachment(p1, p2, {relationshipQuery: $relType}) AS pa, algo.linkprediction.totalNeighbors(p1, p2, {relationshipQuery: $relType}) AS tn """ pairs = [{"node1": pair[0], "node2": pair[1]} for pair in data[["node1", "node2"]].values.tolist()] params = {"pairs": pairs, "relType": reltype}
features = graph.run(query, params).to_data_frame
return pd.merge(data, features, on=["node1", "node2"])
trainingdf = applygraphyfeatures(trainingdf, "COAUTHOREARLY") testdf = applygraphyfeatures(testdf, "CO_AUTHOR") ```
训练模型:
python
columns = ["cn", "pa", "tn"]
X = training_df[columns]
y = training_df["label"]
classifier.fit(X, y)
评价模型: ```python from sklearn.metrics import recallscore, precisionscore, accuracy_score
def evaluatemodel(predictions, actual): accuracy = accuracyscore(actual, predictions) precision = precisionscore(actual, predictions) recall = recallscore(actual, predictions)
metrics = ["accuracy", "precision", "recall"]
values = [accuracy, precision, recall]
return pd.DataFrame(data={'metric': metrics, 'value': values})
predictions = classifier.predict(testdf[columns]) ytest = test_df["label"]
evaluatemodel(predictions, ytest) ```
运行三角计数算法:
cypher
CALL algo.triangleCount('Author', 'CO_AUTHOR_EARLY', {write: true, writeProperty: 'trianglesTrain', clusteringCoefficientProperty: 'coefficientTrain'});
CALL algo.triangleCount('Author', 'CO_AUTHOR', {write: true, writeProperty: 'trianglesTest', clusteringCoefficientProperty: 'coefficientTest'});
将新特征添加到DataFrame: ```python def applytrianglesfeatures(data, trianglesprop, coefficientprop): query = """ UNWIND $pairs AS pair MATCH (p1) WHERE id(p1) = pair.node1 MATCH (p2) WHERE id(p2) = pair.node2 RETURN pair.node1 AS node1, pair.node2 AS node2, apoc.coll.min([p1[$triangles], p2[$triangles]]) AS minTriangles, apoc.coll.max([p1[$triangles], p2[$triangles]]) AS maxTriangles, apoc.coll.min([p1[$coefficient], p2[$coefficient]]) AS minCoeff, apoc.coll.max([p1[$coefficient], p2[$coefficient]]) AS maxCoeff """ pairs = [{"node1": pair[0], "node2": pair[1]} for pair in data[["node1", "node2"]].values.tolist()] params = {"pairs": pairs, "triangles": trianglesprop, "coefficient": coefficientprop}
features = graph.run(query, params).to_data_frame
return pd.merge(data, features, on=["node1", "node2"])
trainingdf = applytrianglesfeatures(trainingdf, "trianglesTrain", "coefficientTrain") testdf = applytrianglesfeatures(testdf, "trianglesTest", "coefficientTest")
columns = ["cn", "pa", "tn", "minTriangles", "maxTriangles", "minCoeff", "maxCoeff"] X = trainingdf[columns] y = trainingdf["label"] classifier.fit(X, y)
predictions = classifier.predict(testdf[columns]) ytest = test_df["label"]
evaluatemodel(predictions, ytest) ```
通过上述步骤,我们成功构建了一个基于图数据库和机器学习的链路预测模型。希望这些内容能启发更多关于图算法和链路预测的研究与应用。