As digital transformation takes hold, a new generation of applications that depend on a distributed data stack is going into production. These apps require a radically new approach to deal with much higher data volumes and real-time data pipelines. This article is inspired from the book The Digital Transformation by Thomas M Siebel.
Abstraction to Models to AI
Abstraction is a powerful concept. Consider an example- for calculating the current in a circuit, one does not generally use Maxwell equations- instead one uses an abstraction in the form of resistors, capacitors and inductors (models) derived from these Maxwell equations. In traditional software engineering too, strong abstraction using modular design, encapsulation help create maintainable code in which it is easy to make isolated changes and improvements. In computer networks, abstraction into the five-layer model (Physical, Datalink, Network, Transport and Application) helps to design robust systems and it drove to create the internet as we know today.
An AI application consists of many subsystems and subprocesses dealing with data collection, data mapping, resource management, serving infrastructure etc. In fact, the actual AI/ML code is only a tiny fraction of the entire system (Fig 1). A plethora of systems revolve around the core AI/ML code with the aim of gleaning actionable information out of raw data. An abstraction around these multiple systems helps to segregate and work on them independently. Out of these abstractions evolves models (think resistors, capacitors) – creating a model driven architecture. A model driven architecture provides abstraction layers without having to worry about detailed mechanics of computational process like ETL, pipeline management, load balancing, encryption etc. (think Maxwell equations).
The abstraction of an AI application consists of a) Data Collection and Ingestion b) Data Storage c) Data Processing and Analytics d) Data Output, Reporting and Visualization
The 1st Layer - Data Collection and Ingestion
AI is only as good as the data served to it and there is absolutely no dearth of data. However, often this data is siloed and unstructured. Also, this data collection layer should be able to handle streaming and batch data sources.
The data collection layer of an AI stack is composed of software that interfaces with physical devices, as well as web-based services which supply third-party data, from databases like SAP, PI server to weather APIs.
The interface point of this abstraction is the Data Aggregator. This is a single point of access for all kind of data from various sources – machine data, employee records etc. Often building such an aggregator is a challenging system since it involves interfacing multiple system like ERP system, PI system, External Data etc. Identifying these siloed data sources and integrating them using various tools is often the first daunting step in building an AI pipeline.
The 2nd Layer- Data Storage
ML/AI workloads require efficient data storage to yield rapid results. For learning purposes, many parts of the workflow are repeated over time which requires low latency and high throughput. Often the data is required to be written to disk. Low latency and high throughput ensures that the system CPU/GPU (“The compute”) are always fed to their utmost capacity and training process is faster.
Many of the solutions available for such a system are based on a two-tier architecture system – A flash based, parallel, scale out file system for active data processing and Storage for capacity and long-term retention. Often in a traditional organization having a private network, storage is designed for capacity and long term retention and not for a parallel, flash based system. When building an AI pipeline, it becomes imperative to access how much of inhouse storage can be used for AI/ML applications and the ROI towards provisioning new flash-based storage facilities.
Another important aspect is the software stack for storage in AI applications. Traditionally, relational databases like SQL is used for most application for storage which does not always scale well for AI applications. Key value stores like Cassandra, HBase, Elasticsearch are more suited for iterative applications. Often, there is no “one size fits all” database that is optimized for all these data types. This results in the need for a multiplicity of database technologies including but not limited to relational NoSQL, DFS, key-value stores, graph data bases.
The beauty of a model driven paradigm, the details of databases or nuances of data does not matter. We might have a model called relational database, that, in turn, serves as a placeholder that might incorporate any relational database system like Oracle, Postgres, Aurora, Spanner or SQL Server. Similarly, a key-value store model might contain Cassandra, HBase, Cosmos DB, or DynamoDB. If for instance, one has to migrate from Oracle Server to Postgres Server, one simply needs to change the end API point.
The 3rd Layer - Data Processing and Analysis
All the buzzwords like machine learning, recommender systems, deep learning, image recognition, sentiment analytics, natural language processing lies within the gamut of data processing and analysis. Most consider this to be the central component of an AI stack, however, as mentioned before, the actual AI/ML code is only a tiny fraction of the entire system.
Many of the ideas of machine learning/deep learning has been around for decades. However, these are yielding big results due to increased data availability and cheaper access to computation. The figure shows how the performance of models increases with the amount of data available. Using this data and with cheap access to computation, we are today able to builder larger neural networks and other sophisticated models. The increase in computational power comes with the deployment of powerful CPU, GPUs and TPUs aka “The Compute”. The flexibility, power and self-learning capabilities of these algorithms (without needing to create “hand engineered features”) together with an increased amount of data is what really differentiates the current wave of artificial intelligence.
Data Processing pipeline forms an integral part of this abstraction. This typically involves
From a model based paradigm, the above data processing pipeline and AI algorithms can be built as a ‘Microservice and Application’ which can be accessed through APIs, deployed on a private, public cloud or run “on the metal” in a private date case or in the case of edge analytics, at the point of data collection itself (within the data capture hardware).
These algorithms are trained on platforms often called ‘Integrated Development Tools’ like Spark, Jupyter Notebooks, KuberFlow, TensorFlow etc. Many companies like AWS, Google cloud offer a host of pre-trained AI services for computer vision, natural language processing, recommendations, and forecasting. For example, Amazon Sage Maker is used to quickly build, train and deploy machine learning models at scale; or build custom models with support for all the popular open-source frameworks.
The 4th Layer - Data Output, Report and Visualization
This encompasses the part of the stack that communicates insights from the data in the form of dashboards, charts and other graphics. An AI pipeline needs to enable a varied and rich suite of data reporting and visualization tools like Excel, Tableau, Oracle BI,alteryx etc. Often each functional department within an organization has their own template and specification for reports and data visualization. Abstracting the visualization layer helps to convert these templates and specification to real time dashboards using products and frameworks like Tableau, ReactJS, AngularJS. An AI stack must be able to incorporate these myriad products and frameworks.
The application suite sits on top of the AI suite filtering requisite actionable intelligence to dashboards. Typically, these applications are packaged as SAAS (Software as a Service) as shown below.
The Complete AI Stack
Often when companies start developing ML application, each functional department based on the use case try to create an pipeline with a combination of micro-services from AWS, Google etc . Thomas M Siebel, in his book The Digital Transformation calls this the “Do it Yourself,AI”. This built it yourself approach requires numerous integrations of underlying components that are not designed to work together, resulting in a degree of complexity that overwhelms the best development teams. Often a better suited approach is a structure, modular design – a modular pipeline aggregating all company data sources. Such a structured approach avoids duplication of efforts across multiple functional departments.
One such example of a model-based AI stack is the C3 AI Appliance Stack. On the left we have the data collection and ingestion abstraction. The storage and compute form part of the storage abstraction of the AI pipeline. The Data Processing and Analysis abstraction is split as stream services, batch services, data services, UI services etc. Lastly, on the right we have the data output, report and visualization abstraction.
By following such a modular, layered, abstracted approach to building AI pipelines, one can pick and choose microservices from multiple vendors. Once a microservice becomes obsolete, one can simply unplug such services and plug in a new one. Such a modular approach frees us from vendor lock in or getting locked into proprietary technology and encourages us to use open source technologies wherever available. Such a model based pipeline keeps AI applications ticking with better performance, precision and economic benefit.