ONgDB节点相似度计算与组织机构实体消歧方案思路

 

Here’s the table of contents:

  1. 一、Jaccard相似度 - algo.similarity.jaccard - Mode.WRITE
    1. 1、创建测试数据
    2. 2、运行相似度计算
    3. 3、相似度计算结果
    4. 4、algo.similarity.jaccard.stream - Mode.READ
  2. 二、余弦相似度 - algo.similarity.cosine - Mode.WRITE
    1. 1、创建测试数据
    2. 2、运行相似度计算
    3. 3、algo.similarity.cosine.stream - Mode.READ
  3. 三、Pearson相似度 - algo.similarity.pearson - Mode.WRITE
    1. 1、创建测试数据
    2. 2、运行相似度计算
    3. 3、algo.similarity.jaccard.stream - Mode.READ
  4. 四、欧式距离 - algo.similarity.euclidean - Mode.WRITE
    1. 1、创建测试数据
    2. 2、运行相似度计算
    3. 3、algo.similarity.jaccard.stream - Mode.READ
  5. 五、重叠相似度 - algo.similarity.overlap - Mode.WRITE
    1. 1、创建测试数据
    2. 2、运行相似度计算
    3. 3、algo.similarity.jaccard.stream - Mode.READ
  6. 组织机构相似度计算
  7. 一、数据模型
    1. 1、LABEL
    2. 2、RELATIONSHIP
  8. 二、创建测试数据
  9. 三、计算组织机构相似性
  10. 四、计算组织机构相似性并生成相似关系
  11. 备注

组织机构实体消歧实现方案可借鉴这个思路实现【后续会继续记录一个组织机构消歧实现方案】

一、Jaccard相似度 - algo.similarity.jaccard - Mode.WRITE

1、创建测试数据

CREATE (a:Person {name:'Alice'})
CREATE (b:Person {name:'Bob'})
CREATE (c:Person {name:'Charlie'})
CREATE (d:Person {name:'Dana'})
CREATE (i1:Item {name:'p1'})
CREATE (i2:Item {name:'p2'})
CREATE (i3:Item {name:'p3'})
CREATE (a)-[:LIKES]->(i1),
 (a)-[:LIKES]->(i2),
 (a)-[:LIKES]->(i3),
 (b)-[:LIKES]->(i1),
 (b)-[:LIKES]->(i2),
 (c)-[:LIKES]->(i3)

2、运行相似度计算

不支持并发计算支持写入

MATCH (p:Person)-[:LIKES]->(i:Item)
WITH {item:id(p), categories: collect(distinct id(i))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {write:true,showComputations:true,similarityCutoff:0.1}) yield p25, p50, p75, p90, p95, p99, p999, p100, nodes, similarityPairs, computations RETURN *

3、相似度计算结果

生成了一条SIMILAR关系线,更新了score属性;很明确可以看到人物节点之间的相似性得分。

graph

4、algo.similarity.jaccard.stream - Mode.READ

并发计算,不支持写入

MATCH (p:Person)-[:LIKES]->(i:Item)
WITH {item:id(p), categories: collect(distinct id(i))} as userData
WITH collect(userData) as data
call algo.similarity.jaccard.stream(data,{topK:4,concurrency:4,similarityCutoff:-0.1}) yield item1, item2, count1, count2, intersection, similarity RETURN * ORDER BY item1,item2

jaccard

二、余弦相似度 - algo.similarity.cosine - Mode.WRITE

1、创建测试数据

CREATE (a:Person {name:'Alice'})
CREATE (b:Person {name:'Bob'})
CREATE (c:Person {name:'Charlie'})
CREATE (d:Person {name:'Dana'})
CREATE (i1:Item {name:'p1'})
CREATE (i2:Item {name:'p2'})
CREATE (i3:Item {name:'p3'})
CREATE (a)-[:LIKES {stars:1}]->(i1),
 (a)-[:LIKES {stars:2}]->(i2),
 (a)-[:LIKES {stars:5}]->(i3),
 (b)-[:LIKES {stars:1}]->(i1),
 (b)-[:LIKES {stars:3}]->(i2),
 (c)-[:LIKES {stars:4}]->(i3)

2、运行相似度计算

MATCH (i:Item) WITH i ORDER BY id(i) MATCH (p:Person) OPTIONAL MATCH (p)-[r:LIKES]->(i)
WITH {item:id(p), weights: collect(coalesce(r.stars,0))} as userData
WITH collect(userData) as data
CALL algo.similarity.cosine(data, {write:true,showComputations:true,similarityCutoff:0.1}) yield p25, p50, p75, p90, p95, p99, p999, p100, nodes, similarityPairs, computations RETURN *

3、algo.similarity.cosine.stream - Mode.READ

CALL algo.similarity.cosine.stream([{item:id, weights:[weights]}], {similarityCutoff:-1,degreeCutoff:0})
YIELD item1, item2, count1, count2, intersection, similarity - computes cosine distance

三、Pearson相似度 - algo.similarity.pearson - Mode.WRITE

1、创建测试数据

CREATE (a:Person {name:'Alice'})
CREATE (b:Person {name:'Bob'})
CREATE (c:Person {name:'Charlie'})
CREATE (d:Person {name:'Dana'})
CREATE (i1:Item {name:'p1'})
CREATE (i2:Item {name:'p2'})
CREATE (i3:Item {name:'p3'})
CREATE (i4:Item {name:'p4'})
CREATE (a)-[:LIKES {stars:1}]->(i1),
 (a)-[:LIKES {stars:2}]->(i2),
 (a)-[:LIKES {stars:3}]->(i3),
 (a)-[:LIKES {stars:4}]->(i4),
 (b)-[:LIKES {stars:2}]->(i1),
 (b)-[:LIKES {stars:3}]->(i2),
 (b)-[:LIKES {stars:4}]->(i3),
 (b)-[:LIKES {stars:5}]->(i4),
 (c)-[:LIKES {stars:3}]->(i1),
 (c)-[:LIKES {stars:4}]->(i2),
 (c)-[:LIKES {stars:4}]->(i3),
 (c)-[:LIKES {stars:5}]->(i4),
 (d)-[:LIKES {stars:3}]->(i2),
 (d)-[:LIKES {stars:2}]->(i3),
 (d)-[:LIKES {stars:5}]->(i4)

2、运行相似度计算

MATCH (i:Item) WITH i ORDER BY id(i) MATCH (p:Person) OPTIONAL MATCH (p)-[r:LIKES]->(i)
WITH {item:id(p), weights: collect(coalesce(r.stars,0))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson(data,{similarityCutoff:-0.1}) yield p25, p50, p75, p90, p95, p99, p999, p100, nodes, similarityPairs, computations RETURN *

3、algo.similarity.jaccard.stream - Mode.READ

MATCH (i:Item) WITH i ORDER BY i MATCH (p:Person) OPTIONAL MATCH (p)-[r:LIKES]->(i)
WITH p, i, r ORDER BY id(p), id(i) WITH {item:id(p), weights: collect(coalesce(r.stars,$missingValue))} as userData
WITH collect(userData) as data
call algo.similarity.pearson.stream(data,{similarityCutoff:-0.1}) yield item1, item2, count1, count2, intersection, similarity RETURN item1, item2, count1, count2, intersection, similarity ORDER BY item1,item2

四、欧式距离 - algo.similarity.euclidean - Mode.WRITE

1、创建测试数据

CREATE (a:Person {name:'Alice'})
CREATE (b:Person {name:'Bob'})
CREATE (c:Person {name:'Charlie'})
CREATE (d:Person {name:'Dana'})
CREATE (i1:Item {name:'p1'})
CREATE (i2:Item {name:'p2'})
CREATE (i3:Item {name:'p3'})
CREATE (a)-[:LIKES {stars:1}]->(i1),
 (a)-[:LIKES {stars:2}]->(i2),
 (a)-[:LIKES {stars:5}]->(i3),
 (b)-[:LIKES {stars:1}]->(i1),
 (b)-[:LIKES {stars:3}]->(i2),
 (c)-[:LIKES {stars:4}]->(i3)

2、运行相似度计算

MATCH (i:Item) WITH i ORDER BY id(i) MATCH (p:Person) OPTIONAL MATCH (p)-[r:LIKES]->(i)
WITH {item:id(p), weights: collect(coalesce(r.stars,0))} AS userData
WITH collect(userData) AS data
CALL algo.similarity.euclidean(data, {similarityCutoff:-0.1}) YIELD p25, p50, p75, p90, p95, p99, p999, p100, nodes, similarityPairs, computations RETURN *

3、algo.similarity.jaccard.stream - Mode.READ

MATCH (i:Item) WITH i ORDER BY id(i) MATCH (p:Person) OPTIONAL MATCH (p)-[r:LIKES]->(i)
WITH {item:id(p), weights: collect(coalesce(r.stars,$missingValue))} AS userData
WITH collect(userData) AS data
CALL algo.similarity.euclidean.stream(data,{similarityCutoff:-0.1}) YIELD item1, item2, count1, count2, intersection, similarity RETURN item1, item2, count1, count2, intersection, similarity ORDER BY item1,item2

五、重叠相似度 - algo.similarity.overlap - Mode.WRITE

1、创建测试数据

CREATE (a:Person {name:'Alice'})
CREATE (b:Person {name:'Bob'})
CREATE (c:Person {name:'Charlie'})
CREATE (d:Person {name:'Dana'})
CREATE (i1:Item {name:'p1'})
CREATE (i2:Item {name:'p2'})
CREATE (i3:Item {name:'p3'})
CREATE (a)-[:LIKES]->(i1),
 (a)-[:LIKES]->(i2),
 (a)-[:LIKES]->(i3),
 (b)-[:LIKES]->(i1),
 (b)-[:LIKES]->(i2),
 (c)-[:LIKES]->(i3)

2、运行相似度计算

MATCH (p:Person)-[:LIKES]->(i:Item)
WITH {item:id(p), categories: collect(distinct id(i))} as userData
WITH collect(userData) as data
CALL algo.similarity.overlap(data, {similarityCutoff:-0.1}) yield p25, p50, p75, p90, p95, p99, p999, p100, nodes, similarityPairs, computations RETURN p25, p50, p75, p90, p95, p99, p999, p100, nodes, similarityPairs, computations

3、algo.similarity.jaccard.stream - Mode.READ

MATCH (p:Person)-[:LIKES]->(i:Item)
WITH {item:id(p), categories: collect(distinct id(i))} as userData
WITH collect(userData) as data
call algo.similarity.overlap.stream(data,{similarityCutoff:-0.1}) yield item1, item2, count1, count2, intersection, similarity RETURN item1, item2, count1, count2, intersection, similarity ORDER BY item1,item2

组织机构相似度计算

一、数据模型

1、LABEL

  • LABEL:组织机构簇心
  • LABEL:组织机构
  • LABEL:中文名称/中文简称/中文拼音简称/英文名称/英文简称/成立日期/法人代表/总经理/其它负责人/联系人/电子邮箱/注册地址/办公地址/联系地址/注册所在城市/联系所在城市/注册所在区县/网址/联系电话/注册地址邮编/联系地址邮编/传真/统一社会信用代码/交易代码/所属行业/注册资本

2、RELATIONSHIP

  • 关联别名【中文名称/中文简称/中文拼音简称/英文名称/英文简称】
  • 关联日期【成立日期】
  • 关联人【法人代表/总经理/其它负责人/联系人】
  • 关联邮箱【电子邮箱】
  • 关联地址【注册地址/办公地址/联系地址】
  • 关联城市【注册所在城市/联系所在城市】
  • 关联区县【注册所在区县】
  • 关联网址【网址】
  • 关联电话【联系电话】
  • 关联邮编【注册地址邮编/联系地址邮编】
  • 关联传真【传真】
  • 关联信用代码【统一社会信用代码】
  • 关联交易代码【交易代码】
  • 关联行业【所属行业】
  • 关联资本【注册资本(元)】

弃用数据:货币单位/企业属性/简介/主营业务等纯文本数据与特有性不强的数据

二、创建测试数据

  • 嘉实基金
    MERGE (n0:组织机构 {name:'嘉实基金'}) SET n0:中文名称
    MERGE (n1:组织机构 {name:'嘉实JS'}) SET n1:中文简称
    MERGE (n2:组织机构 {name:'JIA SHI JI JIN'}) SET n2:中文拼音简称
    MERGE (n3:组织机构 {name:'HARVEST FUND'}) SET n3:英文名称
    MERGE (n4:组织机构 {name:'HARVEST Inc.'}) SET n4:英文简称
    MERGE (n5:成立日期 {name:'20180101'})
    MERGE (n6:人物 {name:'赵总'}) SET n6:法人代表
    MERGE (n7:人物 {name:'李总'}) SET n7:总经理
    MERGE (n8:电子邮箱 {name:'jsfund@jsfund.cn'})
    MERGE (n9:城市 {name:'北京'})
    MERGE (n10:区县 {name:'东城区'})
    MERGE (n11:网址 {name:'www.jsfund.cn'})
    MERGE (n12:联系电话 {name:'010-65215000'})
    MERGE (n13:邮编 {name:'10000'}) SET n13:注册地址邮编
    MERGE (n14:邮编 {name:'10001'}) SET n14:联系地址邮编
    MERGE (n15:传真 {name:'5000'})
    MERGE (n16:交易代码 {name:'10901'})
    MERGE (n17:所属行业 {name:'金融业'})
    MERGE (n18:注册资本 {name:'10000000'})
    MERGE (n0)-[:关联别名]->(n1)
    MERGE (n0)-[:关联别名]->(n2)
    MERGE (n0)-[:关联别名]->(n3)
    MERGE (n0)-[:关联别名]->(n4)
    MERGE (n0)-[:关联别名]->(n4)
    MERGE (n0)-[:关联日期]->(n5)
    MERGE (n0)-[:关联人]->(n6)
    MERGE (n0)-[:关联人]->(n7)
    MERGE (n0)-[:关联邮箱]->(n8)
    MERGE (n0)-[:关联城市]->(n9)
    MERGE (n0)-[:关联城市]->(n9)
    MERGE (n0)-[:关联区县]->(n10)
    MERGE (n0)-[:关联网址]->(n11)
    MERGE (n0)-[:关联电话]->(n12)
    MERGE (n0)-[:关联邮编]->(n13)
    MERGE (n0)-[:关联邮编]->(n14)
    MERGE (n0)-[:关联传真]->(n15)
    MERGE (n0)-[:关联交易代码]->(n16)
    MERGE (n0)-[:关联行业]->(n17)
    MERGE (n0)-[:关联资本]->(n18)
    
  • 嘉实
    MERGE (n0:组织机构 {name:'嘉实'}) SET n0:中文名称
    MERGE (n1:组织机构 {name:'嘉fund'}) SET n1:中文简称
    MERGE (n2:组织机构 {name:'JIA SHI'}) SET n2:中文拼音简称
    MERGE (n3:组织机构 {name:'HARVEST FUND'}) SET n3:英文名称
    MERGE (n4:组织机构 {name:'HARVEST Inc.'}) SET n4:英文简称
    MERGE (n5:成立日期 {name:'20180102'})
    MERGE (n6:人物 {name:'学军总'}) SET n6:法人代表
    MERGE (n7:人物 {name:'丹总'}) SET n7:总经理
    MERGE (n8:电子邮箱 {name:'jsfund@jsfund.cn'})
    MERGE (n9:城市 {name:'北京'})
    MERGE (n10:区县 {name:'东城区'})
    MERGE (n11:网址 {name:'www.jsfund.cn'})
    MERGE (n12:联系电话 {name:'010-65215000'})
    MERGE (n13:邮编 {name:'10001'}) SET n13:注册地址邮编
    MERGE (n14:邮编 {name:'10001'}) SET n14:联系地址邮编
    MERGE (n15:传真 {name:'5000'})
    MERGE (n16:交易代码 {name:'10901'})
    MERGE (n17:所属行业 {name:'基金业'})
    MERGE (n18:注册资本 {name:'10002000'})
    MERGE (n0)-[:关联别名]->(n1)
    MERGE (n0)-[:关联别名]->(n2)
    MERGE (n0)-[:关联别名]->(n3)
    MERGE (n0)-[:关联别名]->(n4)
    MERGE (n0)-[:关联别名]->(n4)
    MERGE (n0)-[:关联日期]->(n5)
    MERGE (n0)-[:关联人]->(n6)
    MERGE (n0)-[:关联人]->(n7)
    MERGE (n0)-[:关联邮箱]->(n8)
    MERGE (n0)-[:关联城市]->(n9)
    MERGE (n0)-[:关联城市]->(n9)
    MERGE (n0)-[:关联区县]->(n10)
    MERGE (n0)-[:关联网址]->(n11)
    MERGE (n0)-[:关联电话]->(n12)
    MERGE (n0)-[:关联邮编]->(n13)
    MERGE (n0)-[:关联邮编]->(n14)
    MERGE (n0)-[:关联传真]->(n15)
    MERGE (n0)-[:关联交易代码]->(n16)
    MERGE (n0)-[:关联行业]->(n17)
    MERGE (n0)-[:关联资本]->(n18)
    
  • 嘉实远见
    MERGE (n0:组织机构 {name:'嘉实远见'}) SET n0:中文名称
    MERGE (n1:组织机构 {name:'远见'}) SET n1:中文简称
    MERGE (n2:组织机构 {name:'JIA SHI YUAN JIAN'}) SET n2:中文拼音简称
    MERGE (n3:组织机构 {name:'HARVEST YJ'}) SET n3:英文名称
    MERGE (n4:组织机构 {name:'HARVEST YJ Inc.'}) SET n4:英文简称
    MERGE (n5:成立日期 {name:'20190102'})
    MERGE (n6:人物 {name:'M总'}) SET n6:法人代表
    MERGE (n7:人物 {name:'K总'}) SET n7:总经理
    MERGE (n8:电子邮箱 {name:'jsfundyj@jsfund.cn'})
    MERGE (n9:城市 {name:'北京'})
    MERGE (n10:区县 {name:'东城区'})
    MERGE (n11:网址 {name:'www.jsfundyj.cn'})
    MERGE (n12:联系电话 {name:'010-65315002'})
    MERGE (n13:邮编 {name:'10003'}) SET n13:注册地址邮编
    MERGE (n14:邮编 {name:'10002'}) SET n14:联系地址邮编
    MERGE (n15:传真 {name:'5001'})
    MERGE (n16:交易代码 {name:'10501'})
    MERGE (n17:所属行业 {name:'金融科技'})
    MERGE (n18:注册资本 {name:'5002000'})
    MERGE (n0)-[:关联别名]->(n1)
    MERGE (n0)-[:关联别名]->(n2)
    MERGE (n0)-[:关联别名]->(n3)
    MERGE (n0)-[:关联别名]->(n4)
    MERGE (n0)-[:关联别名]->(n4)
    MERGE (n0)-[:关联日期]->(n5)
    MERGE (n0)-[:关联人]->(n6)
    MERGE (n0)-[:关联人]->(n7)
    MERGE (n0)-[:关联邮箱]->(n8)
    MERGE (n0)-[:关联城市]->(n9)
    MERGE (n0)-[:关联城市]->(n9)
    MERGE (n0)-[:关联区县]->(n10)
    MERGE (n0)-[:关联网址]->(n11)
    MERGE (n0)-[:关联电话]->(n12)
    MERGE (n0)-[:关联邮编]->(n13)
    MERGE (n0)-[:关联邮编]->(n14)
    MERGE (n0)-[:关联传真]->(n15)
    MERGE (n0)-[:关联交易代码]->(n16)
    MERGE (n0)-[:关联行业]->(n17)
    MERGE (n0)-[:关联资本]->(n18)
    

    相似性计算2

三、计算组织机构相似性

不支持并发计算支持写入

MATCH (p:`组织机构`:`中文名称`)-[]->(i)
WITH {item:id(p), categories: collect(distinct id(i))} as orgData
WITH collect(orgData) as data
CALL algo.similarity.jaccard(data, {write:false,showComputations:true,similarityCutoff:0.1}) yield p25, p50, p75, p90, p95, p99, p999, p100, nodes, similarityPairs, computations RETURN *

四、计算组织机构相似性并生成相似关系

并发计算,不支持写入

MATCH (p:`组织机构`:`中文名称`)-[]->(i)
WITH {item:id(p), categories: collect(distinct id(i))} as orgData
WITH collect(orgData) as data
call algo.similarity.jaccard.stream(data,{topK:4,concurrency:4,similarityCutoff:-0.1}) yield item1, item2, count1, count2, intersection, similarity RETURN * ORDER BY item1,item2
  • 14860811 嘉实基金 14860812 嘉实
  • 14860811 嘉实基金 14860838 嘉实远见
  • 14860812 嘉实 14860838 嘉实远见

相似度计算结果1

备注

neo4j-graph-algorithms包相似度计算源码位置: algo/src/main/java/org/neo4j/graphalgo/similarity