Macro Level: The biofuel research system

Jupyter Notebook

1.Characterisation of years

Capability Matrices

As a first result, the whole database will be considered without any sort of filtering. The functions built using python, allow the rendering of two different capability matrices: a normalized version and an absolute version. Here, one can notice some columns and rows of the matrices that have a higher number of technological assets. These correspond to biofuel related terms that are more frequently used. In general, the structures of both matrices are similar, which is expected since the normalized version is proportional to the absolute one.

As a validation of this approach, the clustered version of this capability matrix was also produced, although visualization is tough, one can notice two different areas of higher density particularly in the top left and bottom left part of the matrix. While investigating the clustered term pairs, related terms from a scientific perspective stand out, some examples of clusters follow:

  • Pal waste, food waste, organic waste, municipal solid waste, industrial waste.

  • Sawdust, woody biomass, wood waste.

  • Beverage waste, garden waste, brewery waste, biodegradable waste.

  • Mixed prairie grass, cereals/sugar, corn/barley, grain, agriculture, agricultural waste.

  • Sugarcane, cellulosic ethanol, corn, cellulosic biomass, yeast.

  • Rice straw, wheat straw.

From the general clustering, the algorithm tends to accurately separate terms that are somehow related. These relations are a result of the composition of certain feedstocks, the type of derivatives of certain raw materials, or the proximity of certain outputs.

Capability Matrices of years

Taking a year as a unit of analysis, the first step produced is the characterization of a certain year in terms of its technological capability. This methodology was first described in section 3. In this matrix, each row and column represent a term of the dictionary of biofuel relevant terms, each value of that matrix takes the number of documents queried that possess both the term in the column and the term in the row. As a point of departure, all of the documents related to the year 2017 were queried. As a result, the following matrix is produced:

Here, the visualization is rather challenging because of the large size of matrix, remember, there are a total of 352 rows and columns. Another interesting matrix to produce, because it allows for a fairer comparison of years between themselves is the production of a normalized capability matrix, where the above matrix is divided by the total number of documents present in the database for that year. In the case of 2017, where there are a total of 670 documents (patents, publications, projects), the normalized version of the matrix takes the following shape and characteristics:

  • Shape: 352x352

  • Max Value: 0.23

  • Min Value: 0.00

  • Mean: 2.47x10e-4

One important thing to point out is the number of zero values in the matrices above. Moreover, there appear to be “areas” where the value is higher.

Capability Lists

After being capable of reproducing a capability matrix for each year, this same matrix was transformed into a vector, or list, utilizing the function previously described in the methodology. Below, a visualization of two years, particularly 2012 and 2013 is provided. In this visualization, each row corresponds to the normalized capability lists of each one of the years.

While visualizing these lists, which can also be referred to as “spectrums of capability”, one can observe the same behavior as the corresponding capability matrices with a wide variety of empty entries, however, in this case, there seem to be a lot of interesting entries where both 2012 and 2013 have a relatively high number of documents.

2.Year correlation matrix

After successfully visualizing the capability matrix of a certain year, and characterizing it in terms of an adjacency matrix, or list, the next goal lies in successfully comparing different years between themselves, to do this, we must first choose exactly which years to include in this comparison. In the database, the number of documents over time is not regular, on the contrary, there is a large amount of years with low-to-no documents. The figure below represents the number of documents per year in the database. Until 1997, the number of documents is almost or even equal to zero, and therefore, it was chosen that this year-to-year comparison would focus on the year of 1997 until 2018.

To compare two years between themselves, the Pearson correlation index was used (as described in the methodology), between their capability lists. For example, the years 2012 and 2013 (as seen in the figure above) have a Pearson correlation of 0.90, which indicates a very high relation between them. Applying the same methodology to every year between 1997 and 2018, a year correlation matrix can be produced, as seen in the figure below. In this figure the lighter the color, the higher the similarity, in terms of research. For example, the year 2003 is more similar to 2000 than to 2002. It can be observed that recent years are on average more similar than less recent years. However, there seems to be some exceptions.

After producing the year correlation matrix, clustering was applied to the matrix as a way of identifying “clusters” of years that are more related between themselves. To do this, hierarchical clustering with average distance was the chosen methodology. The application of this methodology led to the figure below where the clustering algorithm orders the matrix in a way that years that are more similar appear closer together. Moreover, one can also notice a dendrogram as a visual aid to that same algorithm. In general, more recent years (2010 onwards) form a cluster of their own. The results of this clustering confirm that from a year-to-year correlation perspective the period 2010-2017 shows few changes, the period 2005-2009 shows a medium level of changes and all the rest of the years are characterized by relatively large year-to-year changes.

3.Correlation of years over time

Following the comparison of all of the years between themselves, it is interesting to understand if the relationships and similarities between those years are at all connected to a chronological timeline e.g. are consecutive years more connected between themselves? To help in assess this question the correlation of consecutive years was also studied. The figure below represents how one year is correlated with the previous year. For example, 2005 has a 0.5 correlation with 2004, on the other hand, 2006 has a 0.35 correlation with 2005. One can observe a tendency of rise of the correlation of years over time. In other words, more recent years are more related to each other. On the other side, before 2007, the correlation between years follows a less obvious pattern.

4.Comparing Years

The final result of the Macro level analysis concerns the comparison of two years, and particularly the understanding of the intrinsic differences that may result in a high or low similarity between years. Taking as a point of departure, and as a proof of concept, the capability matrices of the years 2017 and 2010, the first step was to build their capability matrices. Due again, to the high number of terms and term pairs, the visualization and understanding just from the visualization of the matrices side by side is rather poor.

In order to try to visualize the differences and areas of the matrix that differ from one year to the other, taking the normalized capability matrices of both years, the difference between these two was also produced. Knowing that these years have a relatively high Pearson correlation (e.g. they are similar in terms of capability), the matrix of differences serves as a simple way of directly comparing two years.

To understand and compare the years at a term-pair level, the tables of the most frequent terms pairs for the years of 2010 and 2017 were produced. When observing these tables some factors stand out:

  • The number of documents in 2010 is far superior to the number of documents for the year of 2017.

  • The pair ethanol-fermentation is the top term pair on both years appearing in at least 17% of all of the technological assets of both years.

  • In general, the most used term-pairs are made of output-processing technology terms (e.g. ethanol-hydrolysis) and output-feedstock terms (e.g. waste-ethanol or sugar-ethanol).

  • Some pairs seem to diminish in importance, for example sugar-fermentation is important in both years but sugar-ethanol is not present in the top 10 for the year 2017.

Top term pairs for 2010:

Top term pairs for 2010:

Finally, as a way of directly comparing these years, a table of the term pairs and their evolution from 2010 to 2017 was created. Here, the main question that was sought to answer was: If term pair A-B was in x% of assets in year X, what was that same percentage in the year X+Y? Moreover, what were the term pairs that most differed in terms of usage between these two years? The table below is a possible representation of that same question, in it, one can see the term pairs that differed the most greatly between these two years in terms of usage. For example, the term pair “bio-oil-pyrolysis” appears in only 0.26% of the documents in the year of 2010, against 1.15% in 2017:

  • The term pairs with the most important differences in usage are not necessarily the term pairs with the most usage in each of the years. With the exception of the pair “ethanol-fermentation”.

  • Most term pairs contain output terms such as “ethanol”, “biodiesel”, “biogas” etc.

  • There is a relative balance between the number of term pairs that decreased in usage and the number of term pairs that increased in usage.

Top term-pairs with the most important differences in usage between 2010 and 2017.

After analyzing the evolution of the correlation of years over time and providing visualizations that allow the comparing of two years, the study re-focuses on seeing how the usage of different biofuel-related terms evolve over time. To this, the same framework of term division is respected: biofuel related terms can be feedstocks, processing technologies, or outputs. To study their evolution over time and due to the inconsistency in terms of the volume of documents over time, one should focus on the normalized quantity of terms rather than their absolute values. Three different graphs were produced, each related to one type of term. The terms chosen to represent each group were selected due to their high occurrence in each group.

The same behavior can be generally observed across the three types of terms, until the year 2000, the normalized quantity of terms is rather “turbulent”, while after that year there seems to be a more regularized behavior across the normalized usage of different terms. Moreover, there seem to be some spikes in terms such as “sugar” or “ethanol”, which means that all of the documentation related to that particular year in the database contains that same term.

6.Contextual Relationships


Taking the evolution of the price of the barrel of oil in $US from the following source, which is inflation adjusted, it was decided to compare how its evolution compared with the relative presence of terms over the years in the database of assets. As a first visual tool, a double axis plot with the normalized usage of three example outputs (biogas, bioplastic, and butanol), and the price of oil from 1990 to 2017 was produced.

One can first observe that there seems to be a rise in both the term usage over the years, and at the same time a regular augmentation of the price of oil until about 2014. Moreover, there are patterns that appear in the evolution of the price of oil that seem to repeat themselves in the usage of terms, such as the period of the usage of the term “biogas” between 2000 and 2005, and the evolution of the price of oil between 2003 and 2008. But a chronological visualization has drawbacks; there is no way of consistently comparing all of the terms and their correlation with the price of oil. To achieve this comparison, the evolution of the price of oil was compared to all of the different terms in the database. For each term, the Pearson correlation between its usage in every year and the evolution of the price of oil was calculated. As a result, a ranking of the terms with the highest positive correlation with the price of oil can be observed in the following table. Here, the top 10 terms with the highest correlation with the price of oil are presented. For example, the evolution of the usage of the term “butanol” has an 85% correlation with the evolution of the price of oil, the term “bioplastic” a correlation of 80%. Moreover, a table with the terms with the most important negative correlations was also produced and can be consulted in the repo for this project.

Top 10 terms with the highest positive correlation with the price of oil from 1990 to 2017:

When looking at the table of results above, one can observe that the majority of terms that are more “influenced” by the price of oil, the majority of them are in fact output terms, such as butanol, biodiesel, biobutanol, etc. Moreover, the very low values of p-value in the Pearson correlation are indicative of the high level of confidence of the relationships expressed in the table.


The second example was based on a more traditional asset, the price of sugar. To do this, the exact same approach as the price of oil was applied but now taking the evolution of the price of sugar over time from the DataBank. As previously noted, the general behavior over time in the double axed chart is rather poor in expressing the relationship between the price of the kilo of sugar in $US and terms such as “sugar”, “sugarcane”, or “wood”. For this same reason, a table with the top 10 terms with the most important Pearson correlation index was produced.

When observing the term ranking, some interesting observations can be made. Firstly, almost all of the terms in this top ranking are in fact feedstock terms, raw materials used for biofuel production. Secondly, the two terms with the highest correlation with the price of sugar are highly related to it, particularly “sugarcane”, and “cellulosic sugars”. Finally, there is a presence of flowering plants such as “jatropha”, and “sorghum” that also have an important Pearson correlation index with the price of the kilo of sugar.

Top 10 terms with the highest correlation with the price of the kilo of sugar from 1990 to 2017:

Last updated