Differential Privacy for Smart Cities

07 July Blog

Differential Privacy for Smart Cities

IUDX
0 Comments

Many of the keystone applications running a “smart” city will require data sharing between parties with heterogenous interests. As an example, let’s imagine an “intelligent” transport management system (ITMS) that might optimise end-to-end transit times, measure and mitigate air pollution, manage city-wide transport carbon emissions, minimize traffic congestion and signal waiting times, support planning and building new infrastructure, etc. This system would use data – regarding trip origins and destinations, estimated arrival times, traffic densities, currently ongoing roadworks, air pollution etc – shared between individuals, cab aggregators, last mile transport providers, public bus transport, city planners, etc.

Privacy aspects emerge as soon as data is to be shared. While it is possible to anonymise data before release to another party (or the public), how can we ensure that downstream analyses, especially leveraging powerful machine learning and computational ability, cannot be used to compromise the privacy of individuals and businesses? Privacy concerns also arise in the context of data that is “held” (and not shared) by companies. Breaches of such data expose businesses to liability and risk when they violate the confidentiality of individuals – even if they are held in “anonymised” format.

“Differential Privacy” (DP), considered by some to be the “gold standard” for privacy, addresses these types of issues. It is a framework that allows for datasets and analyses to be released in a manner that preserves individual privacy in a quantified way, within a “budget”. This is through the addition of controlled “noise” to the data or analysis before release. Noise is usually the bane of information systems – consider the unpleasantness of interference in a telephone call! – but in DP is harnessed for a useful purpose – namely, to mask the identity of the individuals who were involved in the creation of the dataset.

Figure 1 – Privacy-protecting noise can be added at various points in the workflow of a typical survey research process. Without this, an attacker can potentially violate privacy at any stage, including from published aggregated results. From Evans et al, “**Differentially Private Survey Research**”

This “noise” can be added at various points in the process of creating and processing of data, as illustrated in the figure above depicting a citizen survey. Let’s say that the survey in question is for a research study regarding driving habits. There may be a question asking if a particular respondent has ever violated a red-light signal. Some respondents might have issues with this fact about them being known, and might be concerned about revealing potentially incriminating information. The challenge then is – how do we collect this information, say towards researching and developing better traffic management outcomes – while assuring respondents that their privacy will be protected? “Randomized response” (RR) is a technique by which the “true” response of a survey participant is masked so as to offer them “plausible deniability” if required. While RR has been in use since the 1960s, it can be viewed as a specific instance of a “mechanism” for Differential Privacy. Like all “mechanisms” that implement DP, RR can be performed within a privacy budget, related to the degree to which the true responses are perturbed.

Even when issues of plausible deniability are not important, there can be thorny issues of privacy violation associated with seemingly innocuous data releases. Latanya Sweeney, in her landmark article “Only You, Your Doctor, and Many Others May Know” described how she successfully carried out several “re-identification attacks”, including the identification of medical records of the then-Governor of the US State of Massachusetts. She accomplished this by using a publicly available voter list in combination with a publicly available dataset of medical information about state employees. The medical dataset had been “de-identified” according to the best practices of the time – including removal of names and addresses, etc., but birth date, gender and ZIP (postal) code remained. Sweeney found that the Governor’s demographics were unique in both datasets; she was able to consequently locate his specific medical records despite its de-identification.

It should be emphasised that both datasets utilised for the attack were released and available with legitimate intent. Voters’ lists are commonly released publicly in many jurisdictions in order to aid transparency, allow public scrutiny to correct errors and omissions, etc. Further, the hospital dataset was issued for research purposes by the Massachusetts Group Insurance Commission, and contained patient diagnoses, along with some demographic information. Since the identity of individual patients was not present in the dataset, the publication was considered “harmless”.

Figure 2 – Example of a re-identification attack – public medical information may be matched to news stories to identify patient medical records. From Sweeney, “**Only You, Your Doctor, and Many Others May Know**”

In her article Sweeney describes other attacks, including automated approaches combining different sources of readily available information to perform re-identification. Her research was suppressed for a decade in the academic world, for fear of misuse without an appropriate mitigation. Despite this, it was hugely influential in the policy world, and had a direct impact on the US health privacy policy, known as the Health Insurance Portability and Accountability Act of 1996 (HIPAA).

Sweeney herself identifies Differential Privacy, created in 2006 by Cynthia Dwork and her collaborators, as the tool that “guarantees limited re-identification”. As mentioned previously, DP allows for datasets and analyses to be released while mitigating re-identification through the controlled application of noise before the query result is released. The specific “randomization mechanism” used depends on the type of query, the form of the data released (is it the whole table? Is it aggregated statistics? Is it an image? etc) and the intended downstream application(s).

There are many different emerging applications of the framework. The first large-scale application of DP was in the ‘Disclosure Avoidance System’ of the 2020 US Census. The US Census Bureau is obligated by law to protect the confidentiality of individuals and businesses; while the department has a long history of protecting released data through various measures, ensuring a balance between “confidentiality” and “accuracy” of released statistics is seen as a challenging task, especially in a landscape of high-performance computing, big data, AI, and emerging quantum computing technologies.

	Shri S S Rajasekhar	Head Applications at Regional Remote Sensing Centres, NRSC / ISRO.
	Shri Pankaj Mishra	Deputy Surveyor General, NIGST, SOI, Hyderabad
	Shri Sanjeev Jha	Lead Architect – Government, AWS
	Shri Sumit Sen	Chief Executive of the GISE Hub at IIT Bombay
	Shri Prateep Basu	Co-Founder and CEO, SatSure
	Prof PP Majumdar	Professor. Department of Civil Engineering, IISc Bengaluru
	Dr Abhay Sharma	CTO, IUDX

	Ms Ramadevi Lanka	Director, Emerging Technologies, ITE&C Department, Govt.of Telangana
	Shri Naveen Kumar V	Founder of NaPanta®\| Serial Entrepreneur \| Digital Expert in Agri Ecosystem \| REX Karmaveer Global Fellow \| SLPian \| tagged as Social Business Torch Bearer for India
	Shri Timmana Gouda	Founder CEO, WhatsLoan
	Shri Vineet Singh	Building impactful products at Digital Green
	Shri Nipun Mehrotra (moderator)	Co-Founder & CEO, The Agri Collaboratory, Co-creating Digital Public Goods for Agriculture – in Open Source with the Ecosystem & Government

	Shri Mathew Chacko	Partner, Spice Route Legal
	Shri Parminder Jeet Singh	Independent Digital Researcher
	Ms Saranya Gopinath	Director, Government Affairs & Public Policy at Razorpay
	Ms Ramadevi Lanka	Director, Emerging Technologies, ITE&C Department, Govt.of Telangana
	Shri Amlan Mohanty	Independent Technology Lawyer & Policy Advisor
	Shri Rahul Matthan	Partner, Trilegal
	Ms Anjula Gurtoo (Moderator)	Professor – Department of Management Studies Chairperson – Centre for Society and Policy Indian Institute of Science

	Shri Narayan Mishra	CTO & Co-Founder at TUMMOC
	Shri Anucheth, M N	Joint Commissioner of Police, Traffic, Bengaluru City
	Dr Sanjay Kolte	CEO, Pune Smart City Development Corporation Limited
	Shri Rajesh Krishnan	Chief Executive Officer, ITS Planners and Engineers Private Limited
	Shri Munish Moudgil	Special Commissioner (Revenue) BBMP
	Shri Suresh Kumar (moderator)	VP & Head – Platform Deployments & Applications, IUDX

Differential Privacy for Smart Cities

Important links

Policies

Subscribe to Newsletter

Geospatial Data: Infrastructure, Policies and Applications for Public Good

Harnessing the power of data for transforming agriculture

Challenges in creating data policy and governance guidelines in the context of data for public good

How data is driving service delivery efficiency and citizen convenience in the urban setting

Differential Privacy for Smart Cities

Related Posts

Sustainable and Equitable Urban Transport with IUDX

Data Privacy: What is it, Why do we need it and How can it be achieved?

Exploring India with IUDX Based Tourist Guide

Are you holding onto a valuable asset – a High Value Dataset?

Profiling Java Applications using Async Profiler and Flame Graphs

Important links

Policies

Subscribe to Newsletter