Monitoring AI Pipelines Output As Product

Hila Fox
Machines talk, we tech.
5 min readApr 23, 2021

--

Our Mission

Augury’s mission is to guarantee “Machine Health” for manufacturing plants and we do that by monitoring machines. On each machine, we install an in-house IoT device with sensors that samples specific metrics and uploads this data to the cloud continuously. This data is passed through our AI pipelines to create specific diagnoses on the machine’s health. This AI-driven diagnosis is then passed to our customers.

If it wasn’t clear, the AI driven diagnosis is the core of our product. This enables us to give quick quality feedback to our customers with as little manual work as possible.

Where do I come in the picture

I’m part of a squad which consumes the AI driven events from our pipelines and distributes it in our ecosystem for different usages. For us, these events are not just a technical concept, but the product we build our features around. On top of distributing the insight to our customers, we also distribute the insight to our in-house analysts. The analysts use this data to make the clear cut decisions and communicate it with our customers. They use tailor made graphs for their analysis and their insight for training our ML models.

Our main flow consumes the events, decides which event is relevant to which product, and moves it onwards. This means that we are dependent on liveliness of the pipelines (consumer — supplier). The pipeline’s liveliness is in the ownership of Algo and the Data teams. But as a product squad we want visibility on our products’ liveliness.

Types of monitoring

According to google SRE book we have 2 types of monitoring:

  • White box monitoring — Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics.
  • Black box monitoring — Testing externally visible behavior as a user would see it.

To monitor our interaction between the 2 domains, we need to monitor the output of the AI pipelines as a blackbox. In my domain (the downstream), we don’t really care if events are not coming because of a CPU problem in one of the pipelines, or if a change was made and now one of the output variables is exporting the wrong information. All we care about is how it affects our end product and data we produce from it.

Decide on metrics

Our downstream events are pretty consistent, meaning that we can expect specific load in X amount of time. Also, we have a main flow with well defined steps. Each step has product meaning and also a technical output, meaning an async event, a log or even persisting data. So I started by adding a new metric that reported to prometheus, this metric was a simple counter that added 1 in each step to a tag with the steps name. To improve our visibility further I added an extra tag with “pipeline name” which helped us differentiating between issues with different pipelines.

The first step was called “init” and it meant “flow consumed event”, Lets say the rest of the steps are called steps A-Z. This enabled me to see the patterns in our consumption rates and which pipeline is “heavier”. Other steps helped us understand the patterns in how we distribute that data in the system between the different pipelines.

It smells like coupling

I’m sure you can smell it, but we have coupling between the Algo team and our flow. Between the Algo team and the data team, we have over 10 people working on the pipelines or their output. This means that a bug in their domain has an impact on our domain. It also means that a bug in the product can mess up the end goal of an AI diagnosis.

Moreover the event we consume is packed with data, meaning that each change in the schema must create a change in our domain. In the flow we make decisions according to data that is generated in the pipelines.

Due to this coupling we work closely with the Algo and Data team to keep our quality high from all sides. It’s pretty common for Algo initiatives to be finalized in our domain and also that initiatives from our side will require the help of their expertise. This means that we should all be minded to the performance and quality of the end product that is produced by the AI events.

Read more on how we plan to improve our consistency between the domains using Protobf

Being proactive

After the metrics were in place it was time to create dashboards and alerts in Grafana. Because the actions that impact the performance most are the pipelines and changes in them, we decided that the best way to approach this issue is by creating a dashboard per pipeline with specific alerts to its scale. Alerting in case we have too much or not enough events in each expected step. This will enable each data scientist to monitor his projects specifically and define clear ownership. All our alerts are reporting to a slack channel which is monitored by the Algo, Data and my squad. This way everyone involved can engage in a specific channel.

Taking it to the next step

To enable better debugging we added deployment tags on our pipelines and relevant services, and also added a report to slack on each deployment. All these actions make it easy to identify what happened when and who made the change.

To Summarize

In my opinion our Algo and Data teams have one of the most important jobs in the company: creating and maintaining our insights life line. While they monitor for the queue backlogs, CPU consumption and more, I am verifying the distribution in the system and that the insights are reaching our customers and analysts.

Sometimes, coupling is inevitable. In order to be on top of things as they happen and make sure we can scale well, we worked on building the right dashboards and notification channels based on the metrics that are actually the output of our AI engine.

--

--

Hila Fox
Machines talk, we tech.

Software Architect @ Augury. Experienced with high scale distributed systems and domain driven design.