6.3. The data pipeline and technologies used

The data platform and the other technological solutions of the reference architecture will form an uninterrupted data pipeline that gradually refines the raw data from information resources into information and forecasting models. Data is refined in the different components of the data platform as follows:

Figure 10. The data utilisation process

Source systems. Data is imported into the data platform from source systems via APIs in accordance with the City’s API policies. Datasets are saved in the data platform’s data lake, for example, as files that form distinct, logical entities. For example, the data collected on a single day can be stored as a single file.

Data refinement. Data content is read one file at a time and refined for further processing. Refinement includes verifying the accuracy of the data, the correction of possible errors, combination with previous files to form larger datasets and, if necessary, the aggregation, anonymisation or pseudonymisation of the data. The refinement phase encompasses all the tasks necessary to model the data into a form that makes it ready for the next phase, which is the analytics phase. This processed data is often called a data product. Data products are stored in the data platform’s dedicated area for refined data. The tools used are the data refinement tools offered by the data platform and scripting languages such as Python.

Analytics and machine learning. Data produces value via utilisation. This utilisation can include the creation of reports and analyses out of data products or the use of machine learning and AI models to create forecasting models or learning algorithms, for example. The resulting output can be saved as new data products. The methods and tools used for analysis and modelling are chosen based on the use case. For example, open source programming languages Python and R offer extensive libraries for data processing and modelling. The City’s aim is to also publish developed algorithms as open source, allowing interest groups to verify the functionality and accuracy of the models used.

Distribution. The datasets produced as the output of the analytics phase are saved in the data sharing layer of a relational database or in some other appropriate format that makes it possible to transfer them via APIs to the next phase, which is data utilisation.

Data utilisation. The output of the analytics phase can be utilised in many ways, and the same output set can also be utilised in several different ways simultaneously. The output can simply be published in the form of a report, but nowadays it is becoming increasingly common to instead create interactive graphical visualisations that increase the level of abstraction of the data and tell stories that help readers to understand phenomena and trends. In addition to reporting and visualisation, data can be utilised by sharing it via APIs to external institutions for research use or to be utilised by various applications. For example, open data can be utilised in construction project plans in conjunction with the pricing of building permits to encourage the preparation of unified plans for specific street sections or areas.

Luonnos