Which state in India has the highest average deposits per bank branch? Which state gets the most CSR investment? Which districts in India are the most malnourished? Where are the most number of missing girl babies? How have the trends in cultivation changed over the last five decades? Which sector and the state are leading our GDP?
Unless we know answers to such questions, how will we know what our policy, governance and social impact initiatives should work on? You can’t manage what you can’t measure.
In the complex world of wicked social problems and tricky policy conundrums, data provides an objective, scientific and dispassionate means to understand problems and then find solutions
Let’s appreciate the gargantuan nature of data collection, management and analysis in a country such as ours—vast, populous, ancient, complex, multi-lingual, democratic, newly literate, and with fragile infrastructure. Imagine the data management for 1.3 billion Indians scattered over 3.2 million sq km, with a variety of tongues, varying literacy and livelihoods, food choices, financial conditions, agricultural practices, health parameters; Then, industries and companies—the businesses they are in, the taxes they file, the numbers of people they employ, the value they add and much more. Sounds daunting? Well, this is just the tip of the databerg that we as a country maintain. It is worth recognising the enormous work done to put in place data and statistical mechanisms that produce and maintain large databases. Thanks to this, we at least have numbers to fight over today.
Government data is meant to provide an accurate picture of the state of the nation. This data is important to multiple stakeholders including businesses and the civil society groups. Government data is the most comprehensive data set available, is free (especially with the Open Data Policy), is the recognised official data of the country and is immense.
Does that mean that all is well on the data front? No, there are huge challenges to solve. Let us look at the entire data management exercise in four key stages:
Data collection: While this is being done well, two things warrant attention here:
1. Technology tools and training to the field staff that actually measures and records data
2. Technology enabled automated recording of data
Database management: This needs the most reform and action. A central data architecture that guides data table definitions, adoption of common standards, protocols and uniform vocabulary across ministries, departments and central/state governments is sadly missing. Silos of data where no datasets sync with each other make meaningful analysis extremely difficult.
Data analysis: With the changing economic structure, role of technology, lower costs, new ways of engagement, many of the historical models of data interpretation and analysis need re-examination.
Publication: Most ministries publish reports with varying degrees of usability. While Niti Aayog has been bringing data and analysis from varied sources together, there needs to be uniformity in data presentation and analysis.
Considering that the users of data are from all over, usability and user-friendliness become critical aspects to examine. What can be done to improve effective usage?
Provide metadata: Data libraries such as RBI’s provide extensive explanation on each of their data tables, however, generally speaking, there are no explanations on the fields mean, the measurement, sometimes even the units.
Mention time period: There are many datasets where there is no mention of what time period the data pertains to. In the absence of this information, data becomes pretty much unusable.
Explain the changing base: How do you compare data across time periods when the basis of measurement changes? There are multiple grey areas—changing state boundaries, divergent definitions of rural-urban, changing methodology of measuring and many more. Comparability of data over time is lost in such instances unless appropriate explanation of the change is provided.
Better codification: How does one derive the industry classification code from the Company Identification Number from the famed MCA-21 database? Going by the codification, Infosys—the IT behemoth—would be a health and social work company. No large scale meaningful analysis is possible when the codification logic no longer works. Similar issues can be found across datasets.
Uniformity in presentation formats: Data publication in the current form is done by various departments across ministries but without following common guidelines. Each report follows a different format and becomes its own silo. Sometimes a report that has been following one format for a few editions changes the format and the data fields abruptly with no explanation on how to reconcile. Suddenly you are left with data tables that no longer sync.
Eliminate human error: It is difficult to imagine that large-scale manual data entry from the filed reports still takes place. Multiple spelling errors and even mistakes in placing a decimal point are repeatedly found in a dataset. An unheard of company was found to be spending a few thousand crores in CSR when it was found that a decimal point in the wrong place had elevated a few lakhs spent in CSR to a few thousands of crores.
None of the above are difficult problems to fix. As Nandan Nilekani says, India will be data rich before it becomes economically rich. Using this unprecedented availability of data effectively and productively is the need of the hour. This requires the definition of a coherent data architecture, database design, standards and protocols in collection, sharing, integrating and publishing, apart from being able to easily perform analytics on the data.
India’s IT companies can be used to rein in the data chaos before it gets out of hand. The recent news that the government is initiating the revamp of its official statistics is a welcome step in the right direction. The data that the mainstream business uses— specially on money matters such as taxes, banking and stocks—gets much more attention and therefore speedier improvements, but unfortunately the data that is used by the development sector—such as health, water, gender, climate—is lagging behind in improvements in quality.
While a huge push is underway on defining the privacy and policy for individual’s data for their own empowerment, can we also pay attention to defining standards and architecture for the aggregated data for our nation’s empowerment?
The writer is a Head of India Data Insights, a Sattva Consulting initiative.