The State of Data Quality in 2022: Growth and Complexity
83% of organizations make decisions largely based on data, according to a survey conducted by TDWI. Organizations also treat data quality very seriously: 97% said they treat it somewhat or extremely important. That is excellent news, but the tasks remain pretty challenging.
In addition to evolving data and flows, we see organizations depending more on multiple data sources, either in the cloud or on-prem. This might result from mergers and acquisitions, sustained growth over time, or growing adoption of data democratization trends and new frameworks, like the data mesh. Such developments pressure governance and engineering teams to find and extract high-quality data in an ever-growing ocean of assets.
As a result, organizations are looking to or are already prioritizing the implementation of data-centric roles, processes, and tools to ensure the delivery of high-quality data. This is great, but there is a catch…
Adopting or even maintaining a manual approach to data quality just doesn’t cut it anymore in 2022. We are explaining why in this blog post.
If you want to get all the details, go ahead and download our whitepaper on automated data quality management.
Manual data quality: how it works and why it’s not sustainable
Data quality has had many iterations, but a traditional, more manual approach is still prevalent in many organizations, including Fortune 500 companies. Manual implementations are dependent on tools and technologies that require significant coding effort. By combining SQL rules and spreadsheets to document processes, organizations increase the dependency on a large number of tables consisting of data sources, attributes, and different standard DQ rules.
Usually, this process looks like this:
- A business team and data stewards write down the requirements (business/DQ rules) in a spreadsheet.
- Developers would then implement those business requirements using SQL or other programming languages in various business systems.
- When business requirements change, the business updates the spreadsheet, and developers re-implement rules in all relevant systems.
- When a data source is added to the data governance program, an analyst has to understand the data inside and communicate to the development team which rules need to be implemented.
Scaling is impossible in this environment. Adding a new data source means adding more developers and maintaining more code, which can significantly impact the overall budget allocated to your DQM strategy. With code-based data quality, everything is done by developers, and it can take a full working day to deploy a single rule.
A manual approach is heavily dependent on the following factors, and even if you tackle them one at a time (practically impossible), costs will still increase substantially:
- Initial rule implementation
- Making a change in business requirements
- A schema change in a system
- Connecting a new data source
What is automated data quality?
“AI-driven DQM automation refers to the use of embedded machine learning models to handle one or more DQM functions reliably, repeatedly, and accurately without needing direct human oversight and assistance.” – James Kobielus, TDWI
Automated, metadata-driven data quality is an important iteration in the evolution of data-focused organizations. It sits on the backbone of a data catalog, maintaining an up-to-date version of enterprise metadata. It combines AI and a rule-based approach to automate all aspects of data quality: configuration, measurement, and providing data. From our findings, the adoption rate of AI-driven/Metadata DQM automation is around 40% among survey respondents.
This adoption rate is also linked to the size and overall generated revenue of each company:
- Over $1 billion dollars earnings: 50% adoption rate
- $100 million – $999 million dollars earnings: 36% adoption rate
- Under $100 million: 23% adoption rate
Automated data quality pays off
Based on our latest State of Data Quality survey and daily interactions with clients, it is clear that automating specific processes can help organizations struggling with data quality. Companies that implement a wide stack of tools to modernize their DQM systems are more successful in managing data quality.
The bottom line is:
Organizations successful in data quality management have, on average, 70% of their processes automated. Moreover, these same organizations are more likely to invest in automation and modernize their data stack. This is a virtuous circle.
How does automated data quality work?
Compared to a traditional, more manual approach, metadata-driven DQ is a complex process that can ultimately be automated. But how can companies achieve automation?
The simple answer is to follow this four-step process:
- Catalog your data by connecting data sources and discovering data domains such as names, addresses, product codes, etc.
- Set up data quality rules for validation and standardization and map them to specific business domains
- Automate metadata discovery: in order to maintain an accurate understanding of data domains, you should deploy data profiling and classification.
- Revise: continuously review data domain definitions, data quality rules, and AI suggestions for newly discovered data.
The components of an automated data quality solution
Data Catalog
A data catalog is the primary building block of metadata-driven DQM because it stores connections to data sources, collects metadata from them, and creates an index of data assets. By using automated algorithms and AI, the catalog keeps metadata up to date and infers new metadata. It’s also where authorized users can quickly access a company’s most current and reliable business information.
Central Rule Library
In a metadata-driven system, data quality monitoring and enforcement are done by applying reusable data quality rules. Ideally, this is a central rule library, a collaborative ecosystem where business and technical users can define and enforce these rules in no-code or low-code environments.
Business Glossary
Business glossaries let organizations document their most essential business terms or metadata assets, agree on their meaning, and are crucial for implementing automation. Being connected to both the data catalog and the central rule library plays a pivotal role in storing rules for detecting business domains and creating mappings of DQ rules to those domains.
Data Profiling
Data profiling is the computational part of the system. It detects changes in metadata, assigns business domains to data assets, and calculates statistics about data.
The benefits of automated data quality
The need for scalability and future-proofing your DQM strategy revolves around the growing complexity and volume of data. You need a system that can scale and process anything from terabytes to exabytes of data and accommodate the addition of more and more sources. Here are the most important benefits of automating the delivery of high-quality data with active metadata:
Automation saves time
Automation itself is the main benefit because it saves so much time. As soon as the data assets connect to the platform, the system will run automated data profiling, classification, and discovery processes. The information is then immediately assessed based on newly discovered or updated metadata.
Reusability
All configurations, rules, and subroutines are reusable and centrally defined in a rule library. You don’t have to configure the same rule again; the rule itself should only carry the logic, not the connection to the data source.
Less Resource Intensive
An automated DQ process is reliant on fewer people. There are no more separate data quality rules to be reconfigured by an ever-growing team of developers. It means the end of the manual classification of data, like linking tables to domains and adding notes or endless code maintenance.
Scalable & Future Proof
Because data is growing in complexity and volume, organizations need a system that can scale and process any current and future data sources and data types.
Flexible Delivery
Data quality powered by metadata allows consuming results on any level: from a single table to a business domain or a data source. On top of that, it enables the delivery of results in batches, real-time, or streaming modes.
Key Takeaways
Automation is the future of data management, and metadata plays a critical role. Whether you’re building a data platform that’s evaluation dependent, like a data fabric, or just starting with data quality management implementations, using a metadata-driven DQ system is a more practical approach. It will help you save time and let you deploy repeatable data quality processes faster.