Integration of Data Quality Metrics with Machine Learning Models for Robust Decision-Making Systems

Robinson Ethan Clark

Authors

Robinson Ethan Clark Data-Centric AI Reliability Engineer, USA. Author

Keywords:

Data Quality, Machine Learning, Robustness, Decision-Making, Data Drift, Model Monitoring, Data-Centric AI

Abstract

The efficacy of Machine Learning (ML) models is fundamentally constrained by the quality of the data upon which they are trained and deployed. This paper explores the critical integration of systematic Data Quality (DQ) metrics into the ML pipeline to build robust decision-making systems. We argue that moving beyond ad-hoc data cleaning to a continuous, metrics-driven assessment of data across its lifecycle—encompassing dimensions such as accuracy, completeness, consistency, and timeliness—is essential for improving model performance, fairness, and reliability. We present a framework for embedding DQ checks at key stages: pre-modeling (data profiling), in-training (monitoring for drift and anomalies), and post-deployment (feedback loops). Through conceptual diagrams and analysis, we demonstrate how quantified DQ scores can be used to trigger automated remediation, weight training instances, enrich feature sets, and provide interpretable diagnostics for model predictions. This integration fosters greater trust in AI systems by making their dependence on data quality explicit and manageable.

References

[1] Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing surveys (CSUR), *46*(4), 1-37.

[2] Gentyala, R. (2025). Ethical Artifacts: Engineering Verifiable Audit Trails for Human-in-the-Loop Decisions in ML Data Pipelines. Journal of Scientific and Engineering Research, 12(10), 240–251.

[3] Khurana, U., Samulowitz, H., & Turaga, D. (2017). Feature engineering for predictive modeling using reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, *31*(1).

[4] Ng, A. (2021). MLOps: From Model-centric to Data-centric AI. Retrieved from https://www.deeplearning.ai/wp-content/uploads/2021/06/MLOps-From-Model-centric-to-Data-centric-AI.pdf

[5] Gentyala, R. (2025). Bridging the semantic divide: A framework for cross-functional literacy between data and machine learning engineers. European Journal of Advances in Engineering and Technology, 12(4), 91–100.

[6] Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2017). Data management challenges in production machine learning. Proceedings of the 2017 ACM International Conference on Management of Data, 1723-1726.

[7] Gentyala, R. (2025). Mapping imperfections to instruments: A unified taxonomy for data engineering in behavioral economics. International Journal of Data Engineering Research and Development (IJDERD), 2(1), 10–30. https://doi.org/10.34218/IJDERD_02_01_002

[8] Redman, T. C. (1996). Data quality for the information age. Artech House.

[9] Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). " Why should I trust you?" Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135-1144.

[10] Saito, S., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one, *10*(3), e0118432.

[11] Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., & Grafberger, A. (2018). Automating large-scale data quality verification. Proceedings of the VLDB Endowment, *11*(12), 1781-1794.

[12] Gentyala, R. (2025). Benchmarking Prompt Architectures: A Quantitative Study of Contextual and Decomposed Prompting for Complex ETL Code Generation. ISCSITR - International Journal of Computer Science and Engineering (ISCSITR-IJCSE), 6(3), 39–60. https://doi.org/10.63397/ISCSITR-IJCSE_2025_06_03_004

[13] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., ... & Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in neural information processing systems, *28*.

[14] Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of management information systems, *12*(4), 5-33.

Integration of Data Quality Metrics with Machine Learning Models for Robust Decision-Making Systems

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite