India is home to more than a billion people who live across 29 States, speak more than a thousand languages and practice as many religions as are known to mankind. We are divided into States, districts, sub-districts, blocks, gram panchayats, municipalities and so on.

This diversity in terms of number of people, and their cultural, economic and social ideologies, is the basis for enormous unstructured and incoherent data, giving birth to a million datasets.

This data has the potential to solve numerous problems faced by us as a nation today. Some of the major scams unearthed have been courtesy the RTI that has brought previously inaccessible data to the public domain.

With more and more data being brought forward, it has got to bring transparency to governance. However, to utilise this power, it is imperative that we democratise this data.

How to democratise

Democratisation in the real sense would mean that anybody equipped with an everyday device such as a mobile phone, tablet or a basic computer can draw interesting and insightful analysis from these datasets. However, those who have got their hands dirty in data analysis realise that a large amount of time is spent in cleaning, standardising and linking these datasets.

Experts put this number at 70 per cent of the total time. Data in the current state is unstructured, spread across multiple formats, disaggregated at several levels and is fraught with errors.

The first and foremost hurdle is the absence of a data ecosystem in India. Still, a large amount of data development activity is taken up solely by the Government. However, their incentives are not in line with the task of clean and complete data development.

There is scarce involvement of the private sector in data compilation, generation and maintenance. In the developed nations, data is considered premium and we find several reputable organisations involved in the task of data curation. To get this system right, the private sector should be incentivised to take up this task in a big way.

The second hurdle is the fact that most of the data collected is basically extrapolated from the data available from earlier years. Hence, to ensure that the data collected passes through the lenses of reviewers and does not raise any eyebrows, a simple compounding of 10-15 per cent is done year on year.

However, if the base year data is faulty then the data extrapolated on top of it will automatically be faulty. Hence it is necessary that there is a paradigm shift in the mindset of reviewers and we enable the public at large to challenge these datasets.

The third hurdle is the process of data collection, which is done by multiple agencies at different times over several years. This introduces inconsistency in the data terminologies and methodologies deployed to collect the data. The data collection process is still primarily pen- and paper-based.

Improve methodologies

This introduces an additional layer that is both costly and a source of several errors during the process of digitisation.

An immediate solution to this could be to deploy the latest data collection methodologies via mobile phones, tablets and computers powered by the internet and telecom. The mobile network penetration level is more than 75 per cent at the national level and with the right technology infrastructure this data collection can be made seamless.

The last hurdle is the insignificant use of technology in the activity of data cleaning and processing. There is a lot of scope in using the latest advancements in technology such as data mining techniques using machine learning, logical implementations and natural language processing powered by the Big Data technologies and cloud deployment.

These can help clean and disseminate this data to a wide audience with ease. The combination of the right techniques will help us deliver coherent data ready for analysis.

If the democratisation of data is brought about, it will empower us to ask more and more questions from this data which, in turn, will point to the hidden loopholes and bring about true transparency in governance.

Policymakers will be able to drive relevant and impactful policy based on true and correct data.

The writer is the co-founder and chief operating officer of InnovAccer Inc

comment COMMENT NOW