Technology

Main technology themes of the project are cloud-based infrastructures, containers and linked data.

In summary, the project aims to provide easily deployable components for building your own custom linked data services on top of infrastructure-as-a-service platform.

Data model

High level view of the platform's data can be divided into three parts: external data, internal data and disseminated data. Data sources are part of the external data. These can be come in all the shapes and sizes, the only requirement is that data can be represented as a graph. There are no restrictions on the data models used by external data.

Internal data refers to the data that is handled within the platform and it is presented as a set of graphs. Internal data consists of provenance and working data. Provenance data describes the processes that are used to create and modify internal data. This includes for example descriptions of workflow pipelines and their respective steps as well as services and source code that is used to implemented those steps. Provenance data also includes descriptions that are generated based on the execution of workflows. It can be used to answer questions about where, how and when certain graph was created. Finally, working data is the data on which the processes operate, which is further divided into manually mapped, or ingested, and automatically generated datasets.

Workflow and provenance data are generated automatically and based on platform specific models. Data model for the working data, on the other hand, can and must be decided by platform administrators.

Disseminated data represents the subset of internal data that is being distributed in some shape and format outside the platform. Same internal data can be for example distributed in two different formats using two different standards.

Components

Workflow component

Workflow component works as the central point for managing, scheduling and monitoring processes that handle ingestion, enrichment and dissemination of the platform's working data.

Ingestion processes handle the extraction of the data from its external source and transform (and load) it to the platform's internal graph representation of the working data. Enrichment processes are used to generate new data based on the current state of the working data. Implementions of enrichment processes can be anything from simple link generation based on existing data to machine learning based approaches.

Separating different concern of data processing with the shared internal data storage allows one to re-use workflows on different data on demand basis.  

Graph component

Graph management related components are in charge of storing and distributing the internal data of the platform which is stored exclusively in graph format. It also maintains the provenance of the data, which is the key feature of the platform.

Components provide efficient update and query interfaces for platform's internal use and a UI for visualizing the state of the internal data for platform administrators.

distribution component

This component handle the distribution of public subset of the working data and they can be APIs or applications. Possible implementations of component include SPARQL endpoints, REST(ful) document based APIs, simple data dumps, CKAN data registry and platform provenance browser.

Messaging service

Platform components communicate with each other using dedicated messaging service. The usage of such component allows for a implementation of persistent, fire-and-forget type of messaging and publish/subscribe communication within the platform.

Service discovery

Service discovery provides the backbone for dynamic composition of the platform components. The idea is that one should be able add new internal services (e.g. NER plugin for workflows) without having to restart the whole platform.

Platform deployment

Platform Deployment includes the necessary configurations and tools for setting up and maintaining the whole ATTX component based platform or individual components. This component also addresses issues related to load-balancing and high-availability configurations.