FREYA | WP2 User Story 6: As a researcher, I am looking for more information about another researcher with a common name, but don’t know his/her ORCID ID. |
---|---|
It is important to be able to locate a researcher of interest even though their ORCID ID is unknown. For example, a reader of a scientific publication may wish to find out more about one of the authors, whereby the publisher has not cross-referenced that author's name to ORCID.
This notebook uses the DataCite GraphQL API to disambiguate a researcher name via a funnel approach:
Goal: By the end of this notebook, you should be able successfully disambiguate a researcher name of interest.
%%capture
# Install required Python packages
!pip install gql requests
# Prepare the GraphQL client
import requests
from IPython.display import display, Markdown
from gql import gql, Client
from gql.transport.requests import RequestsHTTPTransport
_transport = RequestsHTTPTransport(
url='https://api.datacite.org/graphql',
use_json=True,
)
client = Client(
transport=_transport,
fetch_schema_from_transport=True,
)
Define the GraphQL query to find all publications including co-authors for Dr Sarah Teichmann:
# Generate the GraphQL query to retrieve up to 100 researchers matching query "John and Smith"
query_params = {
"query" : "John AND Smith",
"max_researchers" : 100,
"query_end_cursor" : ""
}
query_str = """query getResearchersByName(
$query: String!,
$max_researchers: Int!,
$query_end_cursor : String!
)
{
people(query: $query, first: $max_researchers, after: $query_end_cursor) {
totalCount
pageInfo {
hasNextPage
endCursor
}
nodes {
id
givenName
familyName
name
affiliation {
name
}
}
}
}
"""
Run the above query via the GraphQL client
import json
found_next_page = True
# Initialise overall data dict that will store results
data = {}
# Keep retrieving results until there are no more results left
while True:
query = gql("%s" % query_str)
res = client.execute(query, variable_values=json.dumps(query_params))
if "people" not in data:
data = res
else:
people = res["people"]
data["people"]["nodes"].extend(people["nodes"])
pageInfo = people["pageInfo"]
if pageInfo["hasNextPage"]:
if pageInfo["endCursor"] is not None:
query_params["query_end_cursor"] = pageInfo["endCursor"]
else:
break
else:
break
List in tabular format affilitions and the corresponding researcher names. This allows the user to select one of the affiliations to use in a more detailed query (see below) that also retrieves publications.
# Collect names and affiliations for the researchers found
# Test if fieldValue matches (case-insensitively) a Solr-style query (with " AND " representing the logical AND, and " " representing the logical OR)
def testIfPresentCaseInsensitive(solrQuery, fieldValueLowerCase):
for orTerms in solrQuery.split(" AND "):
present = False
for term in orTerms.split(" "):
if term.lower() in fieldValueLowerCase:
present = True
break
if not present:
return False
return True
people = data['people']
af2Names = {}
totalCount = 0
for node in people['nodes']:
id = node['id']
name = node['name']
# TODO: Remove if we manage to search only individual fields
if not testIfPresentCaseInsensitive(query_params['query'], name.lower()):
continue
totalCount += 1
for af in node['affiliation']:
affiliation = af['name']
if affiliation not in af2Names:
af2Names[affiliation] = set()
af2Names[affiliation].add(name)
tableBody = ""
for af,names in sorted(af2Names.items()):
tableBody += af + " | " + ', '.join(names) + "\n"
display(Markdown("Total number of researchers found: **%d**<br>The list of researchers by affiliation is as follows:" % totalCount))
display(Markdown(""))
display(Markdown("| Affiliation | Researcher Names |\n|---|---|\n%s" % tableBody))
Total number of researchers found: 210
The list of researchers by affiliation is as follows:
Affiliation | Researcher Names |
---|---|
American Chemical Society | John Smith American Science and Engineering, Inc. | Henry John Peter Smith Bank Street College of Education | John Smith Bedford Institute of Oceanography | John Smith Beecham Pharmaceuticals | John Smith Birkenhead High School Academy | John Arthur Smith Bureau of Ocean Energy Management, Pacific OCS Region | John Smith CU Sports Medicine and Performance | John-Rudolph Smith Charles Sturt University - Wagga Wagga Campus | John Smith Church of Norway | John Arthur Smith City College of New York | John Smith Del Rosario Colorado School of Mines | John Smith Cornell University | John-David Smith CottonInfo | John Smith Drew University | John Smith East Carolina University | John Smith Fairleigh Dickinson University | John Smith Federation of Liberian Youth - FLY | John Solunta Smith Jr Fire Risk Assessment Network | John Smith Flagburn Health Center | John Smith Fluent Technology | John Smith George Washington University | John Smith Georgia State University | John Smith GlaxoSmithKline Plc | John Smith Lipscomb University | John Smith London University | John Smith Louisiana State University | John F. Smith MSG Software (USA), Inc. | Henry John Peter Smith Manhattan College | Henry John Peter Smith Michigan State University | John Smith Millersville University | John Smith NASA Langley Research Center | John Smith New South Wales Department of Primary Industries Agriculture | John Smith Northeastern University | Henry John Peter Smith Northwestern University | John F. Smith Nova Scotia Health Authority South Western Nova Scotia | John Smith OCS Energy Consultant | John Smith Ohio State University | John R. Smith Oxford University Press | John Arthur Smith Peking University | John Solunta Smith Jr Pennsylvania State University | John Smith Proof Read My File | John Smith RMIT University City Campus | John Smith Retired | John Arthur Smith Rutgers New Jersey Medical School | John Smith Del Rosario Rutgers University Camden | John Smith Sample invited position | John Smith Sigma Xi the Scientific Research Society | John Smith TPE Associates Inc | Henry John Peter Smith Technical Support | John Smith Tennessee Technological University | John Smith The New School for Social Research | John Smith The University of St Andrews | Christopher John Smith Tufts University | Henry John Peter Smith Ulster Univeristy | John Smith Ulster University | John Smith University College London | John Smith University at Buffalo | John Smith University of Arizona | Smith, John E. 3rd University of California Davis | John R Smith University of Cambridge | John Arthur Smith University of Central Missouri | John Smith University of Colorado | John Smith University of Colorado Boulder | JOHN SMITH, John Smith University of Liverpool | John Arthur Smith, Quintin-John Smith University of Michigan | John R. Smith University of Missouri Columbia | John Smith University of Ottawa | John Smith University of Oxford | Christopher John Smith University of Pennsylvania | John F. Smith, John Smith University of St Andrews | Christopher John Smith University of Strathclyde | John Smith University of Toledo | John-David Smith University of Toronto | John Smith University of Virginia | Smith, John E. 3rd University of York | John Smith Vanderbilt University | John Smith Virginia Commonwealth University | John Lee Smith Visidyne, Inc. | Henry John Peter Smith Yale University | John Smith
# Generate the GraphQL query to retrieve all researchers matching query "John and Smith" and affiliation "University of Arizona", now with works
name_query = "John AND Smith"
affiliation_query = "\"University of Arizona\""
query_params1 = {
"query" : name_query + " AND " + affiliation_query,
"max_researchers" : 10,
"query_end_cursor" : ""
}
query_str = """query getResearchersByName(
$query: String!,
$max_researchers: Int!,
$query_end_cursor : String!
)
{
people(query: $query, first: $max_researchers, after: $query_end_cursor) {
totalCount
pageInfo {
hasNextPage
endCursor
}
nodes {
id
givenName
familyName
name
affiliation {
name
}
works(first: 3) {
nodes {
id
publicationYear
publisher
titles {
title
}
creators {
id
name
affiliation {
id
name
}
}
subjects {
subject
}
}
}
}
}
}
"""
Run the above query via the GraphQL client
import json
found_next_page = True
# Initialise overall data dict that will store results
data1 = {}
# Keep retrieving results until there are no more results left
while True:
query = gql("%s" % query_str)
res = client.execute(query, variable_values=json.dumps(query_params1))
if "people" not in data1:
data1 = res
else:
people = res["people"]
data1["people"]["nodes"].extend(people["nodes"])
pageInfo = people["pageInfo"]
if pageInfo["hasNextPage"]:
if pageInfo["endCursor"] is not None:
query_params["query_end_cursor"] = pageInfo["endCursor"]
else:
break
else:
break
from textwrap import shorten
# Collect all relevant details for the researchers found
tableBody=set()
people = data1['people']
for node in people['nodes']:
id = node['id']
firstName = node['givenName']
surname = node['familyName']
name = node['name']
# TODO: Remove if we manage to search only individual fields
if not testIfPresentCaseInsensitive(name_query, name.lower()):
continue
orcidHref = ""
if id is not None and id != "":
orcidHref = "["+ name +"]("+ id +")"
affiliations = []
for affiliation in node['affiliation']:
affiliations.append(affiliation['name'])
works = ""
if 'works' in node:
for work in node['works']['nodes']:
titles = []
for title in work['titles']:
titles.append(shorten(title['title'], width=50, placeholder="..."))
creators = []
cnt = 0
for creator in work['creators']:
cnt += 1
# Restrict display to the first author only
if (cnt > 1):
creators[-1] += " et al."
break
if creator['id'] is not None:
creators.append("[" + creator['name'] + "](" + creator['id'] + ")")
else:
creators.append(creator['name'])
works += '; '.join(creators) + " (" + str(work['publicationYear']) + ") ["+ ', '.join(titles) +"]("+ work['id'] + ") *" + work['publisher'] + "*<br>"
tableBody.add(firstName + " | " + surname + " | " + orcidHref + " | " + '<br>'.join(sorted(affiliations)) + " | " + works)
display(Markdown("| First Name | Surname | Link to ORCID | Affiliations | Works | \n|---|---|---|---|---|\n%s" % '\n'.join(tableBody)))
First Name | Surname | Link to ORCID | Affiliations | Works |
---|---|---|---|---|
John E | Smith | Smith, John E. 3rd | University of Arizona
University of Virginia | Smith, John Edward (2020) CS_216516.sf3 Harvard Dataverse
Smith, John Edward (2020) human N2Aus PKA phosphorylation Harvard Dataverse
Lostal, William et al. (2019) Titin splicing regulates cardiotoxicity... American Association for the Advancement of Science (AAAS)