In the final part of the complementary analysis, two unsupervised learning methods will be applied to the dataset at the meso level. In this approach, instead of focusing on term-pairs, the methodology will be based on simple term frequency. By simplifying the approach, one can possibly compare this approach with the one used throughout the analysis.
The first step of the analysis is the creation of a matrix, X where each row represents a country, and each column a term related to biofuels. Each entry of that matrix corresponds to the number of times that particular term was used in documentation located in that. Finally, each row of that matrix was normalized in relation to the documents produced by that country. For example, the country of the US was divided by the total number of documents located in it. As a result, X has the shape (137, 352) which corresponds to 137 countries and 352 terms.
After creating this matrix, the t-sne algorithm was applied to it. The t-Distributed Stochastic Neighbor Embedding algorithm serves as a way of visualizing high dimensional data, in a two or three dimensional way (source). For example, matrix X has a total of 352 dimensions, this number of elements is simply not possible to visualize. The algorithm applies a mathematical transformation that turns this matrix into a (137, 2) or (137,3) matrix, turning the visualization possible. This algorithm can be fed with two arguments, the number of iterations and the perplexity. In general, the author of the algorithm advises the use of a perplexity parameter between 30-100. In the graphs below, the application of this algorithm with different values of perplexity can be observed. In them, each data point corresponds to one country, and the color-scale provides context as to the “richness” of that country (using the GDP per capita). The application does not result in any type of noticeable clustering, however, one can observe that from a perplexity of about 30, countries with a higher GDP per capita appear to be centered in the plot, while countries with lower GDP per capita, appear scattered around them.
The final result to be presented in this section is the result of application of hierarchical clustering in two different ways: Using term frequency first, and using term-pair frequency after. For both applications hierarchical clustering using the average distance was applied.
In the first figure, the dendogram that resulted from the application of hierarchical clustering to the term frequency matrix X is presented. When cutting the dendrogram where cluster have a size of 20 or less countries, the group that contains Denmark also contains the following countries: Austria, Belgium, Czech Republic, El Salvador, Finland, France, Germany, Greece, Hungary, Italy, Netherlands, Norway, Poland, Russia, Spain, Sweden, Switzerland, and the United Kingdom.
In the second figure, the same algorithm was applied to a matrix containing term-pairs - as in the main analysis presented in this section. When cutting the dendogram at the same level, Denmark’s cluster also contains: Canada, Finland, France, Germany, India, Italy, Netherlands, China, Poland, Portugal, South Korea, Spain, Sweden, Taiwan, the United Kingdom and the US.
When observing both dendogram, one can notice that generally, the term-pair application seems to have more balanced clustering: Clusters are more equivalent in sizes, and the distance of the clustering (y-axis) is well distributed.