Aminer
科技情报大数据挖掘与服务系统平台AMiner是由清华大学计算机科学与技术系教授唐杰率领团队建立的,具有完全自主知识产权的新一代科技情报分析与挖掘平台 。
Academic Social Network数据集
数据集地址:https://www.aminer.cn/aminernetwork
该数据的内容包括论文信息,论文引文,作者信息和作者协作。 2,092,356篇论文和8,024,869之间的引用被保存在文件AMiner-Paper.rar中; 1,712,433位作者被保存在AMiner-Author.zip文件中,4,258,615位合作关系被保存在文件AMiner-Coauthor.zip中。
FileName | Node | Number | Size |
AMiner-Paper.rar | Paper
Citation |
2,092,356
8,024,869 |
509 MB |
AMiner-Author.zip | Author | 1,712,433 | 167 MB |
AMiner-Coauthor.zip | Collaboration | 4,258,615 |
31.5 MB
|
连上补充数据一共4个数据集文件。
数据三元组转化与连接
将上述4个数据集下载到本地目录后通过Python脚本读取、处理、连接生成实体csv和关系csv文件。
脚本代码:https://github.com/xyjigsaw/Aminer2KG
脚本生成的数据包括一下几个部分:
- author2csv.py includes
- e_author.csv: author entity
- e_affiliation: affiliation entity
- e_concept.csv: concept entity
- r_author2affiliation.csv: relation between author and affiliation
- r_author2concept.csv: relation between author and concept
- author2paper2csv.py includes
- r_author2paper.csv: relation between author and paper
- paper2csv.py includes
- e_paper.csv: paper entity
- e_venue.csv: venue entity
- r_paper2venue.csv: relation between paper and venue
- r_citation.csv: relation between papers
- r_coauthor.csv: relation between authors
汇总:
文件名 | 类型 | 名称 | 数量 | 大小 |
e_author.csv |
实体 |
作者 | 1,712,432 | 70M |
e_affiliation.csv | 机构 | 624,750 | 54M | |
e_concept.csv | 知识概念 | 4,055,686 | 131M | |
e_paper.csv | 论文 | 2,092,355 | 1,495M | |
e_venue.csv | 发生地点 | 264,839 | 19M | |
r_author2affiliation.csv |
关系 |
作者-机构 | 1,287,287 | 28M |
r_author2concept.csv | 作者-概念 | 14,589,981 | 339M | |
r_author2paper.csv | 作者-论文 | 5,192,998 | 108M | |
r_citation.csv | 引用 | 8,024,873 | 167M | |
r_coauthor.csv | 合作者 | 4,258,946 | 120M | |
r_paper2venue | 论文-发生地 | 2,092,355 | 45M |
以上共5个实体类型,6个关系类型。
至此,生成了Aminer学术社交网络知识图谱三元组数据。
导入Neo4j
将上述11个csv文件放入Neo4j数据库的import文件夹中。
在Neo4j桌面端控制台一句一句执行下述CYPHER代码:
包含了实体节点导入、实体索引构建、关系导入、关系索引构建。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///e_author.csv" AS line CREATE (author:AUTHOR{authorID:line.authorID, authorName:line.authorName, pc:line.pc, cn:line.cn, hi:line.hi, pi:line.pi, upi:line.upi}) USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///e_affiliation.csv" AS line CREATE (affiliation:AFFILIATION{affiliationID:line.affiliationID, affiliationName:line.affiliationName}) USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///e_concept.csv" AS line CREATE (concept:CONCEPT{conceptID:line.conceptID, conceptName:line.conceptName}) USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///e_paper.csv" AS line CREATE (paper:PAPER{paperID:line.paperID, paperTitle:line.title, paperYear:line.year, paperAbstract:line.abstract}) USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///e_venue.csv" AS line CREATE (venue:VENUE{venueID:line.venueID, venueName:line.name}) CREATE INDEX ON: AUTHOR(authorID) CREATE INDEX ON: AFFILIATION(affiliationID) CREATE INDEX ON: CONCEPT(conceptID) CREATE INDEX ON: PAPER(paperID) CREATE INDEX ON: VENUE(venueID) USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///r_author2affiliation.csv" AS line MATCH (FROM:AUTHOR{authorID:line.START_ID}), (TO:AFFILIATION{affiliationID:line.END_ID}) MERGE (FROM)-[AUTHOR2AFFILIATION: AUTHOR2AFFILIATION{type:line.TYPE}]->(TO) USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM "file:///r_author2concept.csv" AS line MATCH (FROM:AUTHOR{authorID:line.START_ID}), (TO:CONCEPT{conceptID:line.END_ID}) MERGE (FROM)-[AUTHOR2CONCEPT: AUTHOR2CONCEPT{type:line.TYPE}]->(TO) USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///r_author2paper.csv" AS line MATCH (FROM:AUTHOR{authorID:line.START_ID}), (TO:PAPER{paperID:line.END_ID}) MERGE (FROM)-[AUTHOR2PAPER: AUTHOR2PAPER{type:line.TYPE, author_pos:line.author_position}]->(TO) USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///r_citation.csv" AS line MATCH (FROM:PAPER{paperID:line.START_ID}), (TO:PAPER{paperID:line.END_ID}) MERGE (FROM)-[CITATION: CITATION{type:line.TYPE}]->(TO) USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///r_coauthor.csv" AS line MATCH (FROM:AUTHOR{authorID:line.START_ID}), (TO:AUTHOR{authorID:line.END_ID}) MERGE (FROM)<-[COAUTHOR: COAUTHOR{type:line.TYPE, n_cooperation:line.n_cooperation}]->(TO) USING PERIODIC COMMIT 5000 LOAD CSV WITH HEADERS FROM "file:///r_paper2venue.csv" AS line MATCH (FROM:PAPER{paperID:line.START_ID}), (TO:VENUE{venueID:line.END_ID}) MERGE (FROM)-[PAPER2VENUE: PAPER2VENUE{type:line.TYPE}]->(TO) CREATE INDEX ON: AUTHOR(authorName) CREATE INDEX ON: AFFILIATION(affiliationName) CREATE INDEX ON: CONCEPT(conceptName) CREATE INDEX ON: PAPER(paperTitle) CREATE INDEX ON: VENUE(venueName) |
预览:
知识图谱嵌入
这部分将上述千万级三元组训练成嵌入数据,PyTorch-BigGraph(PBG)给出了令人满意的解决方案。PBG是一个分布式大规模图嵌入系统,能够处理多达数十亿个实体和数万亿条边的大型网络图结构。图结构分区、分布式多线程和批处理负采样技术赋予了PBG处理大型图的能力。
为了验证PBG对学者数据嵌入的有效性,本文将原始数据按照99:1:1的比例划分训练集,测试集和验证集,传入PBG后进行训练验证。
训练命令:
1 2 3 4 5 6 7 |
torchbiggraph_import_from_tsv --lhs-col=0 --rel-col=1 --rhs-col=2 new_config.py rel9811/train.txt rel9811/valid.txt rel9811/test.txt torchbiggraph_train new_config.py -p edge_paths=rel9811/train_p torchbiggraph_eval new_config.py -p edge_paths=rel9811/test_p -p relations.0.all_negs=true -p num_uniform_negs=0 torchbiggraph_export_to_tsv new_config.py --entities-output entity_embeddings.tsv --relation-types-output relation_types_parameters.tsv |
训练参数:
名称 | 释义 | 值 |
num_epoch | 训练代数 | 20 |
num_uniform_negs | 规范负样本个数 | 500 |
num_batch_negs | 批训练负样本个数 | 500 |
batch_size | 批训练大小 | 10000 |
loss_fn | 损失激活函数 | softmax |
lr | 学习率 | 0.1 |
num_partitions | 分区个数 | 1 |
dimension | 嵌入维度 | 50 |
operator | 嵌入方法 | TransE |
由于PBG使用CPU进行分布式计算,不需要使用GPU。因此实验在多核服务器上进行,其基本配置如下:处理器为Xeon(R) E5‐2630 v3 @ 2.40GHz, 内存为256G, DDR4。整个实验在三小时内完成了数据嵌入及评价,嵌入结果如下所示:
名称 | 释义 | 结果 |
Hits@1 | 预测直接命中率 | 0.6702 |
Hits@10 | 预测在Rank10内命中率 | 0.8179 |
Hits@50 | 预测在Rank50内命中率 | 0.8884 |
MRR | Mean Reciprocal Rank 搜索评价 | 0.7243 |
AUC | Area Under Curve, ROC曲线下面积 | 0.9674 |
以上就完成了三元组的嵌入。
项目代码:https://github.com/xyjigsaw/Aminer2KG
更多内容访问OmegaXYZ
网站所有代码采用Apache 2.0授权
网站文章采用知识共享许可协议BY-NC-SA4.0授权
© 2020 • OmegaXYZ-版权所有 转载请注明出处
你好,请问你这个问题解决了吗?
博主,那个部分知识图谱嵌入的结果方便发给我一份吗,感觉笔记本无法撼动,哈哈哈哈 邮箱:2721599586@qq.com
请问Aminer这篇文章 按代码行一行行输入,没有得出知识图谱,neo4j构建最后的预览图是怎么得出来的?
需要先用脚本生成csv文件并存储到import文件夹,你那边报的什么错误
我是按步骤操作的。但是最后没有形成知识图谱,可以点击标签是一个一个节点。请问输入cypher语句就可以得出来这个知识图谱吗?需不需要再输入查询语句?
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM “file:///r_paper2venue.csv” AS line
MATCH (FROM:PAPER{paperID:line.START_ID}), (TO:VENUE{venueID:line.END_ID})
MERGE (FROM)-[PAPER2VENUE: PAPER2VENUE{type:line.TYPE}]->(TO)
这几个语句我输入之后输出框里是(no change,no records),不知道是不是这里出了问题。谢谢!
可以直接得出,不需要查询语句。你试着点击一下关系,检查一下r开头的csv文件里面是不是存储了数据。
好像没有生成关系,csv里是有数据的。
有以下warning:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (to))
match (from:AUTHOR{authorID:row.START_ID}),to:AFFILIATION{AFFILIATIONID:row.END_ID})
我试了下,我这里是没有问题的。
好的谢谢 这篇文章对我帮助很大,最后请问一下你的neo4j版本?(突然发现我们是本科校友!0.0)
巧啊,方便加Q:644327005
我的问题跟你一样,我点击Property Keys里的相关索引没有数据
USING PERIODIC COMMIT 5000
这句话在linux服务器上好像会出现一些问题。这个在前面加个:auto就可以解决了。我跟楼上一样生成不了关系。
问题解决了!原来是csv文件里的列头名与cypher代码不一致导致!
求助!我整理了一遍代码的header和csv文件里的header,仍旧一直遇到导入关系不成功的问题