Description of dataset and data model
Last updated
Last updated
The first part of the chapter describes the original data model used, provided by the AMICa pathfinder project. This explanation is an essential starting point of the analysis since the representation of the engineering system under analysis is a reflection of the way the original data was modeled.
In what regards the original sources of data, the AMICa pathfinder project focused on gathering the following types of data as a representation of the biofuels research ecosystem:
Research Projects: Through the European Commission's community research and development information service (CORDIS), several research projects were retrieved.
Patents: the OECD REGPAT Database provides patent data linked to geographical regions.
Scientific Publications: the Crossref database was used as a source of scientific publications.
Industry Facilities and Organizations: with the goal of accessing the names and specifications of organizations and facilities, several sources were used such as ETIP Bioenergy, Biofuels Digest, reegle and more.
Knowledge graph and data reconciliation services: the Global Research Identifier Database and DBpedia were used to enrich and reconcile the original data
After compiling information from the sources above, all of the data had to be pre-processed using the open source software OpenRefine with the goal of finally building the graph database in Neo4j.
Originally, the data extracted from the above sources came in a format that was, to say the least, hard to work with. A mix of explicit information and expert knowledge related to biofuels research. With the goal of extracting the knowledge structure behind such data, it was decided that each knowledge asset (patent, publication, project) can be looked at as a combination of the following categories of terms:
Feedstocks: the raw material used to fuel a machine or industrial process (eg. wood, sugar, corn, waste, etc.)
Processing Technologies: The processes that feedstocks undergo in order to achieve a desired output, which in the case of biofuels are mainly chemical processes (fermentation, gasification, catalysis, etc.).
Outputs: The result of feedstocks undergoing a certain process, or in other words, the result. (Biogas, ethanol, biodiesel, etc.)
It is quite clear that each piece of knowledge extracted from the database will very likely contain one of the above types of terms. For example, there might be 4 patents in the database that contain the term “waste” and 17 scientific publications that contain the term “fermentation”.
However, a bag-of-words approach to this data might be too simplistic. At the end of the day, feedstocks, technologies and outputs possess an intricate relationship between them. And these relationships is information which should be preserved. Through the development of the project, the project participants understood that the interesting aspect of these terms was to study the combinations of these terms:
“A key finding of our meetings and literature review is that exploring all relevant combinatorial possibilities between potential feedstocks (e.g. microalgae), processing technologies (e.g. microwave-assisted transesterification) and outputs (e.g. biodiesel), is a difficult but crucial task in the development of new sustainable biofuels.”- Amica technical brief
This ‘alternative’ approach is more useful in the project context, remember, one of the goals of the project was to understand if there might be untapped areas of research. Let us take the following as an example: we are in the presence of feedstock F1 and feedstock F2 and processing technology PT. There is a high frequency of assets with the F1/PT pair, but a low frequency of the F2/PT pair. This might mean that F2/PT is an untapped area of research. Such conclusions would be much harder to reach if the focus was not in the combinations.
One of the main reasons why a graph database was adopted for this project was its capacity of preserving these important relationships while, at the same time, maintaining an acceptable response time and scalability. Some of the elements that make for this database follow.
The first group of elements, or the nodes are the asset nodes. These come directly from the database; there are 5 types of asset nodes: patents, projects, facilities, organizations, and publications. Each one of these types of nodes contains attributes such as year, owner or abstract.
The second group of elements are asset attributes, these are loose elements that possess an intricate relationship with asset nodes. Some of these are countries, locations, types or asset owners. These attributes can be understood as metadata for the asset nodes.
The final group of elements, and perhaps the most relevant to for the project in scope, is the group of terms. Process terms are the various outputs, processing technologies and feedstock terms that appear in assets.
All of the above groups have relationships to each other, for example, a patent might contain several feedstock terms and output terms. Since an extensive description of these relationships would be out of the scope of this thesis, the following illustration provides a simplistic overview of the different relationships these nodes possess with each other, as well as the attributes of each node.