Defining Metrics to Assess the Quality of Data

20 December Blog

Defining Metrics to Assess the Quality of Data

IUDX
0 Comments

In the previous post, we explored the reasons that a Data Quality Assessment (DQA) framework is important for a data exchange platform such as IUDX. We looked at how such a DQA framework/tool would be modularly designed and positioned in the existing IUDX architecture, and the benefits to both upstream and downstream stakeholders. In this post, we will discuss the framework in the context of a data exchange platform.

On what basis can data quality be judged? This question becomes especially pertinent when one considers the relevance of different aspects or features of data vis-à-vis different application domains. As an example, the WHO may collect data about the number of positive cases of a certain virus, and their primary focus would be on the count, gender, demographic, etc of the patients. They would not necessarily be as concerned with the timeliness of the data as long as it falls within certain acceptable bounds. However, for a time critical application such as ambulance location data, the timeliness of the data becomes one of the primary data quality parameters. These different features of data quality, such as timeliness or outliers, are classified as different data quality dimensions in the existing literature.

One of the primary types of data hosted on IUDX is of a temporal nature, i.e. data where every data packet is accompanied by a timestamp. This includes, but is not limited to, data collected from sensors, data from vehicle trackers, etc. Thus, our primary focus is on evaluating data quality for such temporal datasets. The data quality metrics that we choose for this evaluation must remain domain agnostic, since that will generalise applicability of the assessment framework to all temporal data sources irrespective of the application domain. With this context, the five DQ dimensions that are mentioned below, are relevant for our analysis. These five dimensions, which can be applied and tailored for the widest variety of use-cases, are:

1. Timeliness
2. Consistency
3. Uniqueness
4. Validity
5. Completeness

These dimensions provide a reference point for us to further define quantifiable metrics applicable to the class of temporal datasets available on IUDX. We want to refine these metrics in such a way that each one of these is normalizable to a quantity between 0 and 1, with 1 representing the highest quality. The timeliness of data packets is of particular importance for temporal data. It is also important to understand if the data packets are missing any important data attributes or whether the packets contain undefined data attributes. Other important parameters that we want to measure are whether the data adheres to the specified data formats and structure, and the presence of duplicate data. Thus, we arrive at the following data quality metrics that form the basis of our Data Quality Assessment framework:

1. Inter-Arrival Time Regularity: This metric conveys how uniform the time interval is between the receipt of two consecutive packets.
2. Data Source Uptime: This metric evaluates whether the data is actively received. An outage of reception may possibly point to an anomalous event, such as device downtime, etc.
3. Duplicate Detection: This metric evaluates the percentage of duplicate data packets that are present in the dataset.
4. Absence of Unknown Attributes: This metric checks whether there are any additional attributes present in the dataset apart from the list of required attributes defined in the schema.
5. Adherence to Attribute Formats: This metric assesses the adherence of the data to its expected format as defined in the data schema.
6. Adherence to Mandatory Attributes: This metric checks whether all the required attributes defined in the schema are present in the dataset.

Fig.1: The Data Quality Metrics

These metrics serve to cover the majority of the quantifiable attributes that are associated with temporal data. At IUDX we are working to provide the user with metrics to measure data quality so that one can make an informed choice while developing applications. This also enables the data provider to receive feedback on the quality of data being generated by their systems and take necessary remedial actions if needed. To make this process more accessible for the end-user, IUDX has developed a tool that generates PDF reports that delineate the quality of a dataset based on the metrics defined above. We are also working to integrate the data quality reporting within the IUDX catalogue. In the future, data quality could be offered as a service, which means that an exchange consumer could request data quality assessment on demand.

	Shri S S Rajasekhar	Head Applications at Regional Remote Sensing Centres, NRSC / ISRO.
	Shri Pankaj Mishra	Deputy Surveyor General, NIGST, SOI, Hyderabad
	Shri Sanjeev Jha	Lead Architect – Government, AWS
	Shri Sumit Sen	Chief Executive of the GISE Hub at IIT Bombay
	Shri Prateep Basu	Co-Founder and CEO, SatSure
	Prof PP Majumdar	Professor. Department of Civil Engineering, IISc Bengaluru
	Dr Abhay Sharma	CTO, IUDX

	Ms Ramadevi Lanka	Director, Emerging Technologies, ITE&C Department, Govt.of Telangana
	Shri Naveen Kumar V	Founder of NaPanta®\| Serial Entrepreneur \| Digital Expert in Agri Ecosystem \| REX Karmaveer Global Fellow \| SLPian \| tagged as Social Business Torch Bearer for India
	Shri Timmana Gouda	Founder CEO, WhatsLoan
	Shri Vineet Singh	Building impactful products at Digital Green
	Shri Nipun Mehrotra (moderator)	Co-Founder & CEO, The Agri Collaboratory, Co-creating Digital Public Goods for Agriculture – in Open Source with the Ecosystem & Government

	Shri Mathew Chacko	Partner, Spice Route Legal
	Shri Parminder Jeet Singh	Independent Digital Researcher
	Ms Saranya Gopinath	Director, Government Affairs & Public Policy at Razorpay
	Ms Ramadevi Lanka	Director, Emerging Technologies, ITE&C Department, Govt.of Telangana
	Shri Amlan Mohanty	Independent Technology Lawyer & Policy Advisor
	Shri Rahul Matthan	Partner, Trilegal
	Ms Anjula Gurtoo (Moderator)	Professor – Department of Management Studies Chairperson – Centre for Society and Policy Indian Institute of Science

	Shri Narayan Mishra	CTO & Co-Founder at TUMMOC
	Shri Anucheth, M N	Joint Commissioner of Police, Traffic, Bengaluru City
	Dr Sanjay Kolte	CEO, Pune Smart City Development Corporation Limited
	Shri Rajesh Krishnan	Chief Executive Officer, ITS Planners and Engineers Private Limited
	Shri Munish Moudgil	Special Commissioner (Revenue) BBMP
	Shri Suresh Kumar (moderator)	VP & Head – Platform Deployments & Applications, IUDX

Defining Metrics to Assess the Quality of Data

Important links

Policies

Subscribe to Newsletter

Geospatial Data: Infrastructure, Policies and Applications for Public Good

Harnessing the power of data for transforming agriculture

Challenges in creating data policy and governance guidelines in the context of data for public good

How data is driving service delivery efficiency and citizen convenience in the urban setting

Defining Metrics to Assess the Quality of Data

Related Posts

Profiling Java Applications using Async Profiler and Flame Graphs

Accelerating Application development using the IUDX Sandbox

IUDX and the Transformational Role of Data in Urban Sectors

Technology and the changing nature of governance

Are you holding onto a valuable asset – a High Value Dataset?

Important links

Policies

Subscribe to Newsletter