Design and Implementation of Data Engineering Workflows to Support Machine Learning and Artificial Intelligence in Real-Time Decision Systems

Authors

  • Rossi Osei Andersen, Research Scholar, China. Author

Keywords:

Data Engineering, Machine Learning Pipelines, Real-Time Decision Systems, Data Workflows, Artificial Intelligence Infrastructure

Abstract

The rapid growth of machine learning (ML) and artificial intelligence (AI) applications has created a strong demand for robust data engineering workflows capable of handling real-time data processing and decision-making. Modern organizations increasingly rely on real-time analytics pipelines to transform large volumes of streaming and batch data into actionable insights. However, designing scalable, reliable, and efficient data engineering architectures that support ML-driven decision systems remains a significant challenge. This study explores the design and implementation of data engineering workflows tailored for real-time ML and AI systems. The paper discusses architectural components, workflow orchestration strategies, data ingestion mechanisms, and model deployment pipelines required for real-time decision-making environments.

The research also examines existing literature on data pipelines, big data processing frameworks, and AI infrastructure. A conceptual architecture and workflow model are proposed to integrate data ingestion, transformation, feature engineering, and model serving within a scalable environment. The results highlight how automated pipelines and distributed processing frameworks significantly improve decision latency and model performance. The study contributes to the understanding of how well-designed data engineering workflows can enhance operational efficiency and enable reliable real-time AI systems.

References

[1] Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., & Zimmermann, T. (2019). Software engineering for machine learning: A case study. Proceedings of the IEEE/ACM International Conference on Software Engineering, 291–300.

[2] Gentyala, R. (2021). The Silent Interruption: Assessing the Impact of an AI Driven Sepsis Alert on Emergency Clinician Cognitive Load and Point-of-Care Efficiency. IACSE - International Journal of Computer Technology (IACSE-IJAIA), 2(1), 7–79.

[3] Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171–209.

[4] Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.

[5] Kelleher, J. D., & Tierney, B. (2018). Data science. MIT Press.

[6] Gentyala, R. (2021). Bridging the Semantic Gap: A Lightweight Ontological Framework for Real-Time Harmonization of Consumer Wearable Data with FHIR-Based EHR Systems. IACSE - International Journal of Computer Technology (IACSE-IJCT), 2(1), 24–77.

[7] Kreps, J., Narkhede, N., & Rao, J. (2011). Kafka: A distributed messaging system for log processing. Proceedings of the NetDB Conference, 1–7.

[8] Gentyala, R. (2022). Beyond the Algorithm: A Longitudinal Analysis of Data Heterogeneity and Clinician Trust as Determinants of Predictive Tool Adoption and Patient Outcomes in Personalized Medicine. International Journal of AI, BigData, Computational and Management Studies, 3(2), 137-168. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V3I2P114

[9] Lakshman, A., & Malik, P. (2010). Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2), 35–40.

[10] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., & Young, M. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 2503–2511.

[11] Gentyala, R. (2023). Anticipating Clinical Decay: A Meta-Learning Framework for Proactive Drift Detection and Feature Attribution in Deployed Healthcare AI . International Journal of Emerging Trends in Computer Science and Information Technology, 4(3), 198-216. https://doi.org/10.63282/3050-9246.IJETCSIT-V4I3P121

[12] Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., & Rasin, A. (2010). MapReduce and parallel DBMSs: Friends or foes? Communications of the ACM, 53(1), 64–71.

[13] Zaharia, M., Das, T., Li, H., Shenker, S., & Stoica, I. (2012). Discretized streams: Fault-tolerant streaming computation at scale. Proceedings of the ACM Symposium on Operating Systems Principles, 423–438.

[14] Gentyala, R. (2024). The Trust Threshold: How Public Perception of AI Harm Moderates the Impact of FinTech Innovation on Systemic Banking Stability . International Journal of Artificial Intelligence, Data Science, and Machine Learning, 5(3), 169-190. https://doi.org/10.63282/3050-9262.IJAIDSML-V5I3P118

[15] Zaharia, M., Xin, R., Wendell, P., Das, T., Armbrust, M., Dave, A., & Stoica, I. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65.

Downloads

Published

2024-12-21

How to Cite

Design and Implementation of Data Engineering Workflows to Support Machine Learning and Artificial Intelligence in Real-Time Decision Systems. (2024). International Journal of Computing Science and Systems (IJCSS), 5(1), 22-28. https://ijcss.com/index.php/about/article/view/IJCSS_0501004