Funding Data in REST API

Funding Data in REST API

Tags

Funder data ingest

graph LR; skos(Elsevier SKOS File); validate(Validation); registry-db(Registry DB); registry-api(Registry API
http://data.crossref.org/fundingdata/registry); rest-api(REST API
https://api.crossref.org/v1/funders); matching(Work Funder Matching); repo(GitHub Repo); skos --> validate; validate -- Ingest and merge --> registry-db; registry-db --> registry-api; registry-api -- periodic ingest --> rest-api; registry-api -- manual --> repo; registry-db --> matching;

This covers the ingestion workflow from the registry to the REST API.

For now, links will point to the develop branch of the rest api repository.

The funder registry contains funder data. The url for that is stored here: [:location :cr-funder-registry] in the defaults file. The start up task, :update-funders defined here will run a function update-funders defined here which checks a file name set here: [:res :funder-update], currently called funder-update.date in the defaults file. The file contains the date the last time the funder index was updated. If running for the first time, there will be no date. There is a function, write-last-funder-update in the schedule file that writes the current date/time to funder-update.date which is set when the last funder update to the rest api funder index is done. A cron job update-funders-hourly-work-trigger defined here checks hourly against the funder-update.date file, and if the time of the funder registry modification (which is defined as a url in this hashmap [:location :cr-funder-registry] in the defaults file) is after the time the funder index was last updated , the funder index is updated and the funder-update.date file is modified with the time this happens. The whole process repeats with the cron job listed above.

The data from the registry is in rdf. Using the cron job, the rest api parses and processes the rdf file. It creates a default model using the Jena library against the rdf file. It then parses the rdf file using Jena’s API and creates a JSON file that is pushed to the backend. In the backend, the json is stored as type funder in the index named funder. The mappings for the funder index are listed here.

RDF Relationships

The funder file that is indexed in the Rest API is in RDF and is stored here. The following relationships are stored in json documents in the funders index of Elasticsearch. The namespaces listed in the colon before the relationship are the elements belonging to that particular namespace.

Namespace url
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns
skos http://www.w3.org/2004/02/skos/core
skosxl http://www.w3.org/2004/02/skos/core
svf http://data.crossref.org/fundingdata/xml/schema/grant/grant-1.2
dct http://purl.org/dc/terms
Function Relationship ES Field ES Field Definition
get-labels skosxl:prefLabel, skosxl:literalForm primary-name preferred form of the name
get-labels skosxl:altLabel, skosxl:literalForm name Alternate form of the name
broader skos:broader parent,ancestor parent: Immediate ancestors of the resource, defined in the skos:broader relationship. For example, for resource 100000154, the parent is the resource listed in the skos:broader relationship of that resource, which is 100000076.
ancestor-ids - ids of all resources that are above the current one in hierarchy. The function broader gets all objects that have the skos:broaderrelationship defined, going up the hierarchy. For example, for resource 100000154, its immediate parent, 100000076 is listed as an object for the skos:broader predicate. However, if we go to the resource 100000076, 100000001 is listed as the object for the skos:broader predicate. So, ancestor-ids gets the ids of all the objects which have the skos:broader predicate listed for all related resources to the current resource. Therefore the ancestor-ids for 100000154 are “100000001”,“100000076”.
narrower skos:narrower child,descendant child and descendant are the exact opposite of the parent, ancestor-ids. Child and descendant-ids are however the exact same result. They are the objects of the predicate skos:narrower all of which are defined in the resource. So, for resource 100000030, the child and descendant-ids are the same, which is [“100000125” “100000130” “100005195” “100005217” “100005218” “100005220” “100005222” “100005224” “100005258” “100005260” “100005262” “100005265” “100006087” “100006088” “100006089” “100006090” “100013154” “100014009”].
res->doi rdf:resource doi Stores the funder prefix/uri of the resource
res->id rdf:resource id, _id Stores the id of the resource which is also the doi without the prefix
affiliated svf-el:affilWith affiliated field contains the id of any organizations affiliated with the resource. This is indexed but there is no code to do anything about it in displaying it.
replaces dct:replaces replaces Contains the id of the organization that the current resource replaces
replaced-by dct:isReplacedBy replaced-by Contains the id of the organization that the current resource is replaced by.
build-hierarchy skos:broader, skos:narrower hierarchy Builds hierarchy of all associated resources with the current resource. This generates a tree starting with the top level ancestor of the current resource, the siblings of the parent of the ancestor, siblings of the current resource. The hierarchy is used in the Metadata Search for facet views of the funders. For example in the case of the resource 100000152, the hierarchy is listed as following:
"hierarchy":{"100000001":{"100010608":{"name":"Office of Inspector General","id":"100010608"},"100000179":{"name":"Office of the Director","id":"100000179","more":true},"name":"National Science Foundation","100005716":{"name":"National Science Board","id":"100005716"},"100000081":{"name":"Directorate for Education and Human Resources","id":"100000081","more":true},"100000083":{"name":"Directorate for Computer and Information Science and Engineering","id":"100000083","more":true},"100005441":{"name":"Office of Budget, Finance and Award Management","id":"100005441","more":true},"100000076":{"name":"Directorate for Biological Sciences","id":"100000076","more":true,"100000154":{"name":"Division of Integrative Organismal Systems","id":"100000154"},"100000153":{"name":"Division of Biological Infrastructure","id":"100000153"},"100000152":{"name":"Division of Molecular and Cellular Biosciences","id":"100000152"},"100000156":{"name":"Division of Emerging Frontiers","id":"100000156"},"100000155":{"name":"Division of Environmental Biology","id":"100000155"}},"100000084":{"name":"Directorate for Engineering","id":"100000084","more":true},"id":"100000001","100000088":{"name":"Directorate for Social, Behavioral and Economic Sciences","id":"100000088","more":true},"100005447":{"name":"Office of Information and Resource Management","id":"100005447","more":true},"100000086":{"name":"Directorate for Mathematical and Physical Sciences","id":"100000086","more":true},"100000085":{"name":"Directorate for Geosciences","id":"100000085","more":true}}}
build-hierarchy skos:broader, skos:narrower hierarchy-names Generates a flattened hashmap of the id and names of all resources associated with the current resource generated from the hierarchy. This contains the top level ancestor of the current resource, the siblings of the parent of the ancestor, siblings of the current resource, and a more signifier indicating that there are more resources (children/descendants) associated with a sibling resource. For example in the case of the resource 100000152, hierarchy-names contain the id and name of the top level ancestor 100000001, the parent 100000076 and siblings of the parent and the siblings of 100000152. This field is used in the Metadata Search for facet views of the funders.
get-country-literal-name svf:country country country after getting the geoname uri associated with the resource
ancestors and res->id level The value represents where the current resource is in the hierarchy. It counts the number of the resource’s ancestors and adds by 1. This seems to be used in the Crossref Metadata Search service for funder indentation in the facets

Note on parents in the funder relationship

Note that while in RDF a funder can have multiple parents, Cayenne currently assumes at most only one parent per funder. For a given funder, the first object of the broader relationship is currently ingested as the only parent, and the remaining parents from RDF are dropped by the ingesting process. This affects all fields in Cayenne.

Funder data structure

NB: The hierarchy, hierarchy-names, and descendants parts of the funder structure are used in various bits of functionality in the Crossref Metadata Search service.

The document that’s indexed in elastic search is as follows:

{
  "_index": "funder",
  "_type": "funder",
  "_id": "100000076",
  "_version": 1,
  "found": true,
  "_source": {
    "hierarchy-names": {
      "100005441": "Office of Budget, Finance and Award Management",
      "100000155": "Division of Environmental Biology",
      "100010608": "Office of Inspector General",
      "100000156": "Division of Emerging Frontiers",
      "100000084": "Directorate for Engineering",
      "100000153": "Division of Biological Infrastructure",
      "100014072": "National Coordination Office",
      "100000001": "National Science Foundation",
      "100000076": "Directorate for Biological Sciences",
      "100014074": "Integrative and Collaborative Education and Research",
      "100000179": "Office of the Director",
      "100000081": "Directorate for Education and Human Resources",
      "100000085": "Directorate for Geosciences",
      "100000154": "Division of Integrative Organismal Systems",
      "100014073": "National Nanotechnology Coordinating Office",
      "100000083": "Directorate for Computer and Information Science and Engineering",
      "100014411": "Center for Unmanned Aircraft Systems",
      "100005447": "Office of Information and Resource Management",
      "more": null,
      "100005716": "National Science Board",
      "100000088": "Directorate for Social, Behavioral and Economic Sciences",
      "100014591": "BioXFEL Science and Technology Center",
      "100014071": "Large Facilities Office",
      "100000152": "Division of Molecular and Cellular Biosciences",
      "100000086": "Directorate for Mathematical and Physical Sciences"
    },
    "primary-name": "Directorate for Biological Sciences",
    "replaced-by": [],
    "parent": "10.13039/100000001",
    "name": [
      "BIO",
      "BIO/OAD"
    ],
    "descendant": [
      "100000152",
      "100000153",
      "100000154",
      "100000155",
      "100000156"
    ],
    "doi": "10.13039/100000076",
    "level": 2,
    "token": [
      "directorate",
      "for",
      "biological",
      "sciences",
      "bio",
      "bio/oad"
    ],
    "id": "100000076",
    "affiliated": [],
    "replaces": [],
    "child": [
      "100000154",
      "100000153",
      "100000152",
      "100000156",
      "100000155"
    ],
    "hierarchy": {
      "100000001": {
        "100014074": {
          "name": "Integrative and Collaborative Education and Research",
          "id": "100014074"
        },
        "100010608": {
          "name": "Office of Inspector General",
          "id": "100010608"
        },
        "100000179": {
          "name": "Office of the Director",
          "id": "100000179",
          "more": true
        },
        "100014411": {
          "name": "Center for Unmanned Aircraft Systems",
          "id": "100014411"
        },
        "name": "National Science Foundation",
        "100005716": {
          "name": "National Science Board",
          "id": "100005716"
        },
        "100000081": {
          "name": "Directorate for Education and Human Resources",
          "id": "100000081",
          "more": true
        },
        "100000083": {
          "name": "Directorate for Computer and Information Science and Engineering",
          "id": "100000083",
          "more": true
        },
        "100005441": {
          "name": "Office of Budget, Finance and Award Management",
          "id": "100005441",
          "more": true
        },
        "100000076": {
          "name": "Directorate for Biological Sciences",
          "id": "100000076",
          "more": true,
          "100000154": {
            "name": "Division of Integrative Organismal Systems",
            "id": "100000154"
          },
          "100000153": {
            "name": "Division of Biological Infrastructure",
            "id": "100000153"
          },
          "100000152": {
            "name": "Division of Molecular and Cellular Biosciences",
            "id": "100000152"
          },
          "100000156": {
            "name": "Division of Emerging Frontiers",
            "id": "100000156"
          },
          "100000155": {
            "name": "Division of Environmental Biology",
            "id": "100000155"
          }
        },
        "100000084": {
          "name": "Directorate for Engineering",
          "id": "100000084",
          "more": true
        },
        "id": "100000001",
        "100014073": {
          "name": "National Nanotechnology Coordinating Office",
          "id": "100014073"
        },
        "100014591": {
          "name": "BioXFEL Science and Technology Center",
          "id": "100014591"
        },
        "100000088": {
          "name": "Directorate for Social, Behavioral and Economic Sciences",
          "id": "100000088",
          "more": true
        },
        "100005447": {
          "name": "Office of Information and Resource Management",
          "id": "100005447",
          "more": true
        },
        "100014071": {
          "name": "Large Facilities Office",
          "id": "100014071"
        },
        "100000086": {
          "name": "Directorate for Mathematical and Physical Sciences",
          "id": "100000086",
          "more": true
        },
        "100000085": {
          "name": "Directorate for Geosciences",
          "id": "100000085",
          "more": true
        },
        "100014072": {
          "name": "National Coordination Office",
          "id": "100014072"
        }
      }
    },
    "country": "United States",
    "ancestor": [
      "100000001"
    ]
  }
}

and the response from the api is as follows:

{
  "status": "ok",
  "message-type": "funder",
  "message-version": "1.0.0",
  "message": {
    "hierarchy-names": {
      "100014074": "Integrative and Collaborative Education and Research",
      "100000153": "Division of Biological Infrastructure",
      "100010608": "Office of Inspector General",
      "100000179": "Office of the Director",
      "100014411": "Center for Unmanned Aircraft Systems",
      "100005716": "National Science Board",
      "100000081": "Directorate for Education and Human Resources",
      "100000083": "Directorate for Computer and Information Science and Engineering",
      "100005441": "Office of Budget, Finance and Award Management",
      "100000001": "National Science Foundation",
      "100000152": "Division of Molecular and Cellular Biosciences",
      "100000076": "Directorate for Biological Sciences",
      "100000084": "Directorate for Engineering",
      "100014073": "National Nanotechnology Coordinating Office",
      "100000155": "Division of Environmental Biology",
      "more": null,
      "100014591": "BioXFEL Science and Technology Center",
      "100000088": "Directorate for Social, Behavioral and Economic Sciences",
      "100005447": "Office of Information and Resource Management",
      "100014071": "Large Facilities Office",
      "100000086": "Directorate for Mathematical and Physical Sciences",
      "100000085": "Directorate for Geosciences",
      "100000156": "Division of Emerging Frontiers",
      "100000154": "Division of Integrative Organismal Systems",
      "100014072": "National Coordination Office"
    },
    "replaced-by": [],
    "work-count": 0,
    "name": "Directorate for Biological Sciences",
    "descendants": [
      "100000152",
      "100000153",
      "100000154",
      "100000155",
      "100000156"
    ],
    "descendant-work-count": 0,
    "id": "100000076",
    "tokens": [
      "directorate",
      "for",
      "biological",
      "sciences",
      "bio",
      "bio/oad"
    ],
    "replaces": [],
    "uri": "http://dx.doi.org/10.13039/100000076",
    "hierarchy": {
      "100000001": {
        "100014074": {},
        "100010608": {},
        "100000179": {
          "more": true
        },
        "100014411": {},
        "100005716": {},
        "100000081": {
          "more": true
        },
        "100000083": {
          "more": true
        },
        "100005441": {
          "more": true
        },
        "100000076": {
          "more": true,
          "100000154": {},
          "100000153": {},
          "100000152": {},
          "100000156": {},
          "100000155": {}
        },
        "100000084": {
          "more": true
        },
        "100014073": {},
        "100014591": {},
        "100000088": {
          "more": true
        },
        "100005447": {
          "more": true
        },
        "100014071": {},
        "100000086": {
          "more": true
        },
        "100000085": {
          "more": true
        },
        "100014072": {}
      }
    },
    "alt-names": [
      "BIO",
      "BIO/OAD"
    ],
    "location": "United States"
  }
}