Tools and data management
Last updated
Last updated
In this part of the thesis, some of the more technical high level tools are introduced, these are mainly made of programming languages, libraries and protocols that were essential in reaching the goals for the analysis.
The first tool that is worth mentioning is the graph database where all of the patents, scientific publications and projects were stored, work developed in the context of the AMICa pathfinder project at DTU. Graph databases are a special type of database that use network structures composed of nodes and edges, to represent and store data. This is opposed to the relational database model where data is separated into different tables. The advantage of “graph databases” lies in the relationships, and the fact that these are explicit.
The database used is managed in Neo4j, an open source graph database management system that allows not only to easily run a database server in a machine, but also allows the direct interaction and querying of the data.
To query the data, Neo4j requires the usage of a programming language know as Cypher. Cypher can be understood as the graph database equivalent of SQL, and allows for fast relational queries to the data. As a simple example to the cypher language, let us retrieve all of the technological assets located in Denmark:
A snippet of the response follows:
The main programming language used to analyze and handle the data was Python, more particularly its 2.7 version. This was not only because of the familiarity of the author with it but also because of its power and flexibility with data handling.
As a way of interfacing with the Neo4j database, the py2neo open source library was used. This is a particularly convenient way of interacting with the original database through the python environment, since it allows one to write cypher queries directly in the python console and to extract data in a convenient format (numpy matrix, pandas dataframe).
For handling the data, an extensive list of python libraries was used, the most important ones follow:
Numpy: An essential package for scientific computing that includes a very good matrix object implementation known as the numpy matrix.
Pandas: Provides an easy to read and handle data structure that is especially useful for table visualizations.
Math: Some mathematical functions are not available out-of-the-box with the python language.
Itertools: A library that allows for efficient looping cycles.
For visualizations, three toolkits were recurrently used:
Matplotlib: An easy to use tool for producing visualizations such as graphs, bar plots, and others.
Seaborn: Based on matplotlib, seaborn provides a more visually appealing and statistics focused visualization library.
Plotly: Used to create dynamic - web based - visualizations.
If you wish to download a comprehensive list of the libraries used please visit this link.
To present the data in a narrative format jupyter notebooks. These notebooks constitute an interesting way of presenting not only the code, but also allow a narrative from for the analysis, which is written in markdown. Notebooks are a popular tool in data science and an easy way of presenting data science procedures.
To store the code for the analysis, GitHub was used as a tool for keeping everything in cloud storage. Moreover, by creating a repository for this project, this means that the code is available 24/7 for anyone that wants to consult it or request any modifications.