Where Software Architecture Meets Ownership

Hila Fox
Machines talk, we tech.
6 min readSep 18, 2021

--

In the past few weeks, I have led an initiative that has caused me to tackle some important questions. These questions relate to software architecture and ownership between squads or even fleets (R&D groups in the Augurian world). But before we dive into the details, I want to share what I do and how the domain I work in is structured. BTW the diagrams in this post are kept simple to avoid over-complication.

How it is today

Currently, Augury employs me as a backend developer and squad leader. Our squad is responsible (amongst other things) for the consumption of our AI engine insights (which we call ‘detections’) and dispatching them to relevant consumers. Our current architecture looks like this:

And this is how the ownership is currently structured, meaning the health fleet (the R&D group my squad is in) is the owner of the detection management and all other features. Specifically, my squad is the owner of the detection management service.

The core of Augury is the AI engine and the insights it generates. Our customers are paying us for these insights and how we alert them on issues, which means that the recall and precision of this entire flow are crucial for us. We are always looking for ways to improve it and make it scalable as the company grows.

Now, this all looks pretty reasonable. But the fact is that the detection management service has become responsible for more features, which made it complicated with several responsibilities.

A new opportunity

One step forward, the AI fleet saw a great opportunity in enriching and improving our detection flow by adding new logic layers, and now this is when it gets complicated.

As you can see in the diagram above, our detection flow consists of several steps, and I have listed the important ones. We:

  • Validate new event — meaning that the AI engine is recognizing something we were not aware of yet.
  • Smart logic layer — we have logic that relies on different entities/states and filters out irrelevant detections.
  • Persist and send event — if a detection passes all of the steps, meaning it’s relevant, we will persist it in the service as a source of truth and send out a “detection.propagated” event so all the consumers can act accordingly.

The AI fleet saw potential in the intelligent logic layer (this is very simplistic, but this is enough for not overcomplicating the post). As you can see, it’s right in the middle of it all. So what’s the problem?

What’s the problem?

The AI fleet wants to expand the smart layer in the detection manager, and this holds some challenges:

  1. Technology — the product fleets (including Health fleet) in Augury are developing their backend services in Go. While the AI fleet, due to its nature, writes code mainly in Python.
  2. Expandability — there are a lot of plans for this layer. Adding more complexity to a service with several responsibilities will prevent us from expanding this new feature in the future.
  3. Autonomy & Ownership — we want the AI fleet to take full ownership of what they are developing. As long as it’s not their domain and ownership, it will always be halfway.

Rethinking our current architecture

As the company grows, we have started seeing the detections coming out of the AI engine as a cross-cutting concern. Since so many product features are consuming it and potentially many fleets, this led us to think that maybe it’s incorrect that a product fleet will hold part of the detection flow and that perhaps the AI fleet should own it. All of the above makes sense because the AI fleet is a platform fleet.

Considering DDD guidelines, we can try and define how the fleets are interacting with each other. We can argue that the “propagated detections” are an aggregate (even a root aggregate). This aggregate can generate a domain event in a bounded context that I’ll attempt to name, maybe “detection generation”? Anywhoo, it became clear that once we figured that the detection flow, including the aggregate’s source of truth (persistency), should be owned by the AI fleet. We want it in a different clean microservice written in Python.

I am familiar with the domain. Once I mapped out our smart logic layer to understand the dependencies. I could use our DDD guidelines to define bounded contexts and aggregates. I figured out that 60% of it is already the new expansion layer the AI fleet wants to add. well, if it walks like a duck…

Why not just write the code in the microservice? Why migrate existing logic?

I think this is an excellent question (of course, I asked it :) ). Let’s take another look at the detection management service (I have added it below). I will like to suggest some possible solutions and also why they are not the right fit for us:

  1. Just take out the other responsibilities and expand this service — as I said, this is a service in Go, which means that our DS and DE peeps will have more challenges working with it.
  2. Add the new logic as a microservice between the engine and the detection manager — it won’t work. The new layer should handle only detections that the first phase has already validated. Also, we structure our microservice with DDD. Creating a new microservice will usually occur around cross-cutting concerns or an aggregate, and we don’t have either of these.
  3. Create a microservice per phase — even though it sounds exciting, this will be considered over-engineering for the situation. We still have too many unknowns to take this step, even though sometimes it seems like what we will need in the future.

The architecture we decided on

So we decided to keep the detection manager only with the “other responsibilities”, write a new microservice in Python that will have the “detection flow” and also will be able to support all of the new logic we want to add. On top of that, the new service will be the one to propagate the “detection.propagated” domain event. All the features will consume it directly from the AI engine.

What have I learned?

Creating a situation in which a squad/fleet can take ownership over a technical element is complex. Sometimes the quickest solution is not the one that will give you the long-term impact you need.

After creating the design for this initiative, I have decided to push for a deeper conversation between the two fleets around the interfaces and the boundaries we share. This topic is vital because the uncertainty surrounding this topic can drive us to make the wrong architectural decisions.

On top of ownership optimization, there was another interesting topic that I didn’t touch on — the issue of resources and roles that the initiative will require. For example, because the detection manager is also exposing an API for the detections, our front end consumes it. There are some adjustments that I chose not to make since we have no frontend assigned to this initiative, some technical debt we decided to live with for now.

Closing words

  • Always #question_it, maybe the current architecture is not the one you need for tomorrow.
  • Optimize for long term solutions that enable ownership
  • Don’t over-engineer solutions you are not 100% positive you need
  • Take the time to understand the boundaries between different teams
  • Perform research on domains you intend to touch. You might find surprising things.

Thank you for reading. I hope you liked it. Feel free to share thoughts and ideas around the topics I have shared.

--

--

Hila Fox
Machines talk, we tech.

Software Architect @ Augury. Experienced with high scale distributed systems and domain driven design.