FREYA | WP2 User Story 7: As a data center, I want to see the citations of publications that use my repository for the underlying data, so that I can demonstrate the impact of our repository. | |
---|---|---|
It is important for repositories of scientific data to monitor and report on the impact of the data they store. One useful proxy of that impact are secondary citations, i.e. citations of publications which use the deposited data. This notebook focuses on visualisation of these citations by means of a force-directed graph.
This notebook uses the DataCite GraphQL API to retrieve the citations of the following different datasets:Goal: By the end of this notebook, for a given list of datasets, you should be able to display:
%%capture
# Install required Python packages
!pip install gql requests pyvis jsonpickle
# Prepare the GraphQL client
import requests
from IPython.display import display, Markdown
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport
_transport = RequestsHTTPTransport(
url='https://api.datacite.org/graphql',
use_json=True,
)
client = Client(
transport=_transport,
fetch_schema_from_transport=True,
)
Define the GraphQL query to find all publications including co-authors for Dr Sarah Teichmann:
# Generate the GraphQL query to retrieve up to 100 researchers matching query "John and Smith"
query_params = {
"ids" : ["10.5061/dryad.234","10.15468/n6ftyd","10.1594/pangaea.314690"]
}
query = gql("""query getDatasetCitations($ids: [String!]) {
datasets(ids: $ids) {
nodes {
id
titles {
title
}
citationCount
citations {
nodes {
id
publisher
titles {
title
}
citationCount
}
}
}
}
}
""")
Run the above query via the GraphQL client
import json
data = client.execute(query, variable_values=json.dumps(query_params))
# Get total citation counts for each dataset in the query
datasets = data['datasets']
tableBody=""
for dataset in datasets['nodes']:
id = dataset['id']
doi = "/".join(id.split("/")[3:])
titles = []
for title in dataset['titles']:
titles.append(title['title'])
citationCount = dataset['citationCount']
tableBody += "[%s](%s) | [**%s**](%s/%s)\n" % (', '.join(titles), id, citationCount, "https://search.datacite.org/works",doi)
if tableBody:
display(Markdown("| Dataset | Citation Count|\n|---|---|\n%s" % tableBody))
Dataset | Citation Count |
---|---|
Effects of varying food-availability on ecology and distribution of smallest benthic organisms in sediments of the arctic Fram Strait during POLARSTERN cruise ARK-XV/2, supplement to: Schewe, Ingo; Soltwedel, Thomas (2003): Benthic response to ice-edge-induced particle flux in the Arctic Ocean. Polar Biology, 26(9), 610-620 | 295 Data from: Towards a worldwide wood economics spectrum | 44 rmca-albertine-rift-cichlids | 206
Plot an interactive force-directed graph of connecting the datasets to their citations (first-degree) and the citations of those citations (second-degree).
from pyvis.network import Network
import pandas as pd
from IPython.display import IFrame
import math
# Colour swatch for the network nodes
dataset_node_colour = "#FB8072"
citation_node_colour = "#80B1D3"
got_net = Network(height="750px", width="100%", bgcolor="#ffffff", font_color="black", notebook=True)
got_net.options.edges.inherit_colors(False)
# set the physics layout of the network
got_net.barnes_hut()
# ------------------------------
# Initialise intermediate data structure to store: (src, trg) -> citation count of the target, where:
# src - dataset or citation; trg - citation
srcTrg2Count = {}
# Initialise intermediate data structure to store: src --> Set of connected trg's
# Note that the number of connected trgs will determine the colour of each src
src2OtherTrgs = {}
datasets = data['datasets']
# Populate srcTrg2Count
allNodes = set()
for node in datasets['nodes']:
nodeSet = set()
datasetDOI = "/".join(node['id'].split("/")[3:])
nodeSet.add(datasetDOI)
for citation in node['citations']['nodes']:
citationDOI = "/".join(citation['id'].split("/")[3:])
citationCount = citation['citationCount']
nodeSet.add(citationDOI)
if datasetDOI not in src2OtherTrgs:
src2OtherTrgs[datasetDOI] = set()
src2OtherTrgs[datasetDOI].add(citationDOI)
if citationDOI not in src2OtherTrgs:
src2OtherTrgs[citationDOI] = set()
src2OtherTrgs[citationDOI].add(datasetDOI)
srcTrg2Count[(datasetDOI, citationDOI)] = citationCount
nodes = sorted(list(nodeSet))
allNodes.update(nodes)
# Populate data structures needed for the graph
sources, targets, weights = [], [], []
for tuple in srcTrg2Count:
if srcTrg2Count[tuple] >= 0:
sources.append(tuple[0])
targets.append(tuple[1])
weights.append(srcTrg2Count[tuple])
edge_data = zip(sources, targets, weights)
for e in edge_data:
src = e[0]
dst = e[1]
w = e[2]
src_node_size = 5 * math.log2(len(src2OtherTrgs[src]) * 5000)
got_net.add_node(src, src, title="Dataset: %s;" % src, color=dataset_node_colour, size=src_node_size)
# We're adding 1 below to make edges representing 0 citations of the target appear in the force-directed graph
dst_node_size = 10 * math.log2((w+1) * 10)
got_net.add_node(dst, dst, title="Citation: %s; Number of citations: %d;" % (dst, w), color=citation_node_colour, size=dst_node_size)
got_net.add_edge(src, dst, value=1)
neighbor_map = got_net.get_adj_list()
# add neighbor data to node hover data
for node in got_net.nodes:
node["title"] += " Neighbours:<br>" + "<br>".join(neighbor_map[node["id"]])
got_net.show("out.html")
display(Markdown("N.B. Click on the plot, then use down/up mouse scroll to zoom in/out respectively.<br>When zoomed in, you will notice the DOI label against each node.<br>Click on any node to see the list of 'neighbour' citations, and on the citation node to also see the number of its citations."))
IFrame(src="./out.html", width=1000, height=800)
N.B. Click on the plot, then use down/up mouse scroll to zoom in/out respectively.
When zoomed in, you will notice the DOI label against each node.
Click on any node to see the list of 'neighbour' citations, and on the citation node to also see the number of its citations.