The Importance of Data Quality for a Data Exchange Platform

The global economy is ever-increasingly reliant on the data that drives it and the ecosystem surrounding data is rapidly growing in size and scope. Data is used to frame and enact policies, drive better governance, run businesses and perform data science. India Urban Data Exchange (IUDX) forms part of this data ecosystem by facilitating seamless exchange of data and in particular the data that drives smart cities in India.

Data hosted on the IUDX platform comes from diverse sources such as data from IoT sensors, processed data from sensor clouds, crowd sourced data from citizen applications and data entered manually by data engineers etc. As of now IUDX hosts over 100 different datasets, with data from 21 cities and spread over 8 different domains and is continuously growing by the day. IUDX is a data exchange platform that acts as an intermediary between a data provider, such as a local urban body, and a data consumer, such as an application developer. IUDX facilitates seamless and authorised data exchange between these two entities by providing a platform based on open APIs and open data models. In such a data ecosystem, the quality of data has a bearing on what can be done with this data, and subsequently, the service provided to the public. This means that monitoring and assessment of the quality of data is of paramount importance to improving the quality of the overall data ecosystem.

Although a data exchange is not a provider/creator of data nor a consumer of data it is strategically positioned to assess the quality of the data available on the platform. Such a data quality assessment framework would benefit:

  • Data Providers: It would enable detection of sources of errors in data such as sensor failure and outage, incorrect deployment, vandalism, calibration errors etc. This would lead to improvement in the available data and hence better revenues for the data providers.
  • Data Consumers or Application Developers: It will enable consumers/app developers to quickly assess the quality of data to be used in their applications. It would also lead to better applications eventually leading to better outreach and revenues for the application developers.
  • Application Users: It would help the application users as they will benefit more by using applications created with higher quality data. This includes policy makers that use the data to draft policies.
  • City Administrators: It would help the city administrators in monitoring the state of installations by assessing the underlying quality of data.

A Data Quality Assessment (DQA) tool is therefore an important component for a data exchange platform. A DQA tool would operate on a layer above the main components of the IUDX framework, i.e., the authorisation, resource access and catalogue services. That is, the DQA tool would be taking inputs from the resource server which hosts the data. Note that IUDX architecture allows distributed resource servers which implies that the data access service may be hosted by data providers themselves or any other external entities. The DQA should be able to assess data from such resource servers as well and hence it is better for the DQA to operate at the output of a resource access service rather than at the input. This modular design allows for flexibility in deployment of the tool, meaning that while it can be plugged into the extant architecture, it can also be easily integrated into other platforms. The image below presents two methods of using the DQA tool. Stream A shows how a data provider would use the IUDX resource server and stream B shows how you would be able to use the tool using an external server to host data.

Fig.1: DQA Tool Modularity

The output of the DQA tool is a data quality report that provides scores for different quality metrics that are being evaluated. The report would be available in both machine readable, such as JSON etc., and human readable formats. This report provides information at a glance to the end user and enables them to make a more informed decision on selection/usage of the data. For example, it may provide inputs to a data science engineer about the pre-processing tools that may be required to be used with a given dataset. Apart from specific metrics the report may cover general statistical evaluations for a given dataset. Currently, IUDX provides data quality assessment for temporal data sources. The data quality metrics will depend upon the type of data sources that are being evaluated. The metrics are selected to provide an unambiguous and objective assessment of domain agnostic characteristics of these data sources. A sample data quality assessment report for air quality monitoring resources is available here.

To summarise, at IUDX we are working to provide the user with metrics to measure data quality so that one can make an informed choice while developing applications. This also helps the data provider to have a look at the quality of data being generated by their systems and take necessary remedial actions if needed. An upcoming blog will provide the detailed descriptions of the metrics used in the data quality assessment for the temporal data sources.

More links:

Github Repository for IUDX Data Quality Assessment tool

Sample DQA Report in PDF format

IUDX Catalogue

More about IUDX

Names of the authors:

Novoneel Chakraborty, YLT Fellow

Jyotirmoy Dutta, Senior Research Fellow