Abnormal AI Innovation: Inside the Fault-Tolerant…

Abnormal AI Innovation: Inside the Fault-Tolerant Scoring Engine

How Abnormal engineered a resilient, self-healing AI detection platform that maintains high precision even when dependencies fail.

Albert Liu, Edward Li

August 12, 2025

7 min read

In the world of high-throughput real-time data processing, the complexity needed to detect increasingly sophisticated threats is the enemy of stability. Modern detection platforms powered by AI are not monolithic applications but sprawling ecosystems of interconnected microservices. They orchestrate a constant flow of data from dozens of sources—databases, internal APIs, and third-party lookups—to enrich events and make intelligent decisions.

This creates the core challenge of partial failure and dependency. What happens when a critical service experiences an outage? What if a database that stores employee information goes down, or a service that provides domain reputation data times out? For a complex email security ML system, the options are to either shut down the detection pipeline—allowing malicious messages to hit customers' inboxes—or continue operating in a highly degraded state, where safe messages might be incorrectly filtered out of customer inboxes. As we scale our infrastructure and our web of dependencies grows, this fragility slows velocity, forcing engineers into a reactive posture of firefighting instead of focusing on innovation.

In the first post of this series, we introduced a way to explicitly model dependencies with a Signals DAG (Directed Acyclic Graph) built with our internal framework, Compass. The second post discussed how the Signals DAG was used to build high-scale aggregation systems. This post explores another critical use case for Compass: enabling system resiliency in our real-time scoring engine. The same dependency graph used for aggregate features is the key to building a truly fault-tolerant detection platform, because to gracefully handle a failure, we must first understand exactly what depends on the failing component.

The Pivot: A Paradigm Shift to Fault-Tolerant Scoring

At Abnormal AI, our mission to stop the world's most advanced email attacks requires an AI platform that operates at immense scale and with unwavering reliability. Our customers expect our protection to be always on, like a power plant. The fragility of a conventional detection pipeline is simply unacceptable when fighting a smart and dynamic human adversary. Beyond monitoring and alerts, we needed a “backup generator” for the entirety of our detection logic to gracefully handle outages.

To solve this, we fundamentally re-architected a core piece of our infrastructure to create the Fault-Tolerant Scoring (FTS) framework. This isn't a patch or a temporary fix; it's a new, secure-by-default paradigm for our real-time detection engine. It's a cornerstone of our AI-vs-AI strategy, designed to make our defenses more robust and resilient than the attacks they are built to stop. The FTS framework ensures that even when parts of our distributed system fail, the overall system continues to function with high precision, making best-effort decisions and automatically healing itself.

How the Fault-Tolerant Scoring Framework Functions

The FTS framework is built on several core principles that work in concert to contain the "blast radius" of an outage and maintain the integrity of our detection verdicts.

Intelligent Failure Propagation
At the heart of FTS is the concept of treating failure as a first-class data type. Previously, a failed lookup for an auxiliary signal¹ might return misleading default values or raise an exception that stops the scoring process entirely. Now, leveraging the Signals DAG maintained by Compass, our system intelligently handles these failures. When an upstream data source like our EmployeeDB or a Whois API lookup fails, the system propagates a special failure attribute to all downstream dependencies. This explicitly modeled graph allows us to map out all affected attributes and quarantine the failure before it corrupts healthy parts of the service.
Dynamic Model and Rule Evaluation
With failure now an explicit state, our decisioning components can intelligently adapt. Our models and our rules engine, REEL, now perform a crucial pre-check before executing. If any of a model's required input features are marked with the special failure state, the model is intentionally skipped. This dynamic evaluation is amplified by our multi-model strategy. Abnormal maintains a diverse portfolio of detectors, including robust models with fewer dependencies that can function even when some data sources are unavailable. This allows the system to make a best-effort decision using high-integrity data from the healthy parts of the service, which is infinitely better than making a wrong decision based on corrupted inputs.
Best-Effort Decisioning and Automated Retry Protocol
After making a best-effort decision from available, healthy signals, the FTS framework takes one final, crucial action. If a high-confidence attack is detected, we remediate it immediately—no waiting. For all other verdicts (e.g., Safe, Spam, Suspicious), we know the score was generated under degraded conditions. Instead of committing to this verdict, the system raises a FAULT_TOLERANT_EXCEPTION that acts as a signal to automatically re-queue the message in our asynchronous retry queue. Once the upstream dependency is healthy again, the message is resubmitted from the queue for a full-fidelity, complete rescore, ensuring no message is left behind.

The Impact: A Paradigm Shift in Detection Resiliency

Implementing the Fault-Tolerant Scoring framework has transformed our engineering and detection posture from fragile to antifragile, producing a cycle of increasing reliability and faster innovation.

Drastic Reduction in Outage-Driven False Positives: The impact on stability has been profound. In simulations where our auxiliary signals1—such as lookups to the EmployeeDB, domain reputation services (Whois), and historical sender statistics (Corpus Stats)—failed simultaneously, our False Discovery Rate (FDR) on safe messages plummeted from a catastrophic 73.4% to just 1.7%. This shift mitigates the risk of massive FP surges, protecting customer workflows and freeing our engineers from stressful, all-hands-on-deck firefighting.
Sustained High-Precision Attack Detection: Even in a degraded state, FTS continues to stop attacks. For example, during a simulated outage of our core behavioral analytics signals (RSA/BSA), the system still achieved a 57.6% recall on high-confidence attacks with a near-zero FDR of 0.17%. By harnessing the available signals, we ensure the best possible decision is made, delivering immediate protection to our customers.
Automated, Self-Healing System: With automated retry, the platform now heals itself, no manual intervention required. Messages processed during an outage are automatically re-evaluated once the system returns to a healthy state, ensuring complete coverage. This has eliminated an entire class of manual recovery efforts, amplifying the impact of our engineers.
Accelerated Engineering and Detection Velocity: By fortifying the platform at its foundation, we have empowered our detection engineers and data scientists to innovate with confidence. They can build more powerful and complex detectors without the cognitive load of worrying about upstream fragility. This directly accelerates our ability to develop and ship new protections to customers at a more scalable pace.

Conclusion

Building a truly AI-native defense platform requires more than just powerful models; it demands deep investment in the underlying engineering that makes those models reliable. The Fault-Tolerant Scoring framework embodies this principle. By treating failure not as an unexpected error but as an explicit state to be managed, we have created a more resilient, reliable, and ultimately more effective security platform.

If you're passionate about solving meaningful engineering challenges, we encourage you to browse open roles on our careers page.

To explore the email security solutions powered by Abnormal’s AI-native innovation, schedule a demo today.

Schedule a Demo

¹Auxiliary signals are external signals we incorporate into an event to provide additional data for our models to make a decision.

A foundational project like this is the result of deep collaboration across multiple engineering teams. Special thanks to Fang Deng, Jethro Kuan, Micah Zirn, Winston Lee, Justin Young, Dan Shiebler and the rest of the Abnormal Detection team for contributing to the ideas and the system described in this blog post.