HomeBlogFraud Risk Score Explained — How It's Calculated and What to Do With It

Fundamentals2026-05-209 min read

Fraud Risk Score Explained — How It's Calculated and What to Do With It

A fraud risk score is a probability estimate, not a verdict. Here's what goes into it, why two systems disagree on the same order, and how to operationalize it.

Almost every fraud system in ecommerce — Shopify's native analysis, Stripe Radar, Signifyd, Riskified, and every fraud app you've heard of — outputs a risk score. The score is the single most-used data point in fraud operations, the first thing merchants look at, and the trigger for most automated decisions.

It's also one of the most consistently misused signals in the entire fraud toolkit.

This guide unpacks what a risk score actually represents, where the inputs come from, why two scores can disagree wildly on the same order, and how to turn a score into a decision that doesn't cost legitimate revenue.

What a risk score actually represents

A fraud risk score is a probability estimate that a given order will turn out to be fraudulent. Different systems normalize the number differently — Shopify uses categorical labels (Low / Medium / High), Stripe Radar uses 0-100, fraud apps often use 1-10 or 0-1000 — but the underlying claim is the same.

A "high risk" score of 90 doesn't mean the order is 90% likely to be fraud. It means: "based on the patterns this model learned from past fraud, the signals on this order match fraud cases at the 90th-percentile." The actual fraud rate of orders scored at 90 depends on the model's training data, the merchant's traffic mix, and a hundred contextual factors.

In practice, the actual fraud rate of "high risk" orders varies dramatically by merchant. We've seen Shopify stores where 35% of high-risk orders are confirmed fraud — and stores where less than 8% are. Same score system. Completely different outcomes. Because the score is calibrated against the system's overall training set, not against your specific traffic.

This is the first thing to internalize: your store's risk score is informative, but its decision threshold is yours to calibrate.

What goes into the score

Risk-scoring models combine signals from several categories. Different systems weight them differently, but the menu of inputs is roughly consistent:

Payment-network signals

AVS (Address Verification System) result
CVV check result
Card BIN reputation (first 6 digits identify the issuing bank)
Decline history on the card
Time-since-card-issued vs. time-of-purchase

These come from the payment processor and card network. Strongest individual signals when they fail; weakly informative when they pass. A sophisticated fraudster will pass AVS/CVV.

Identity signals

Email reputation (age, frequency-of-use, breach-exposure)
Phone number quality (carrier type, country prefix consistency)
Shipping address validity
Billing-shipping match

Signal whether the identity attached to the order is genuine.

Behavioral signals

Time-on-site before checkout
Checkout completion speed
Navigation pattern
Prior cart abandonment
Device fingerprint
Browser characteristics

Signal whether the order looks like a human shopping or a bot/automation.

Network signals

IP reputation (residential / hosting / proxy / VPN / TOR)
IP-billing-country consistency
ISP / ASN

Signal where the buyer is physically and through what kind of infrastructure.

Velocity signals

Recent orders from this IP / device / email / card / shipping address
Velocity across cross-merchant networks

Signal whether the order is part of a pattern or isolated.

Cross-merchant signals

Email / card / device associated with fraud or disputes on other merchants
Shared blocklist hits

The bigger the underlying network, the more useful. Shopify operates a meaningful network across the platform.

A modern risk-scoring model takes dozens to hundreds of these signals, weights them via statistical or machine-learning approach, and outputs a score.

Why two systems disagree on the same order

If you've wondered why Shopify's risk score and your fraud app's risk score disagree, the answer is training data.

Each system's model was trained on its own historical fraud data. Shopify's model has seen orders from across the entire Shopify platform — millions of stores, every conceivable vertical, geographies from everywhere. A specialized fraud app (Shieldy, for instance) has trained on a narrower dataset that may resemble your traffic more closely.

The narrower-but-relevant model often outperforms the broader-but-generic one for your specific case — because the patterns of fraud in your vertical aren't necessarily the patterns of fraud across all retail.

When systems disagree, the higher score isn't automatically right. The question is which signals each system picked up on:

If your fraud app flags device-cluster signals that Shopify isn't seeing, and you operate in a vertical with active fraud rings (dropshipping, free-product giveaways), the app's signal is the one to weight more
If Shopify flags an order high-risk that your fraud app didn't catch, check whether it's a cross-merchant signal (card flagged on another store) — those are valuable

The biggest mistake: treating high-risk as "cancel"

A common error — especially in newer fraud-ops teams — is treating "high risk" as synonymous with "cancel." The math punishes this.

If 25% of your high-risk orders are actually fraud and 75% are false positives, then auto-cancelling all high-risk orders means you're cancelling three legitimate orders for every fraudulent one you stop. At even modest AOV, the false-positive loss exceeds the fraud-prevented gain within a few months.

The operational rule that consistently works: high risk = review, not cancel. Build a queue. Assign someone to triage. Make per-order decisions based on the underlying indicators, not the aggregate score.

For stores doing more volume than a person can triage, this still applies — but the review gets partially automated through Shopify Flow or equivalent:

High risk score → auto-tag, alert team, pause fulfillment
Within X hours, team reviews and either releases, contacts customer, or cancels
Cancellations documented with indicator-level reason → training data for future tuning

A surprisingly small team — even one person spending 30 min/day — can handle the high-risk queue for stores doing thousands of orders/month if the workflow is well-structured.

Tuning your threshold

If you've operated a fraud system for at least 60 days, you should be tuning your action thresholds based on actual outcomes.

The exercise:

For the last 60 days of high-risk orders, code each outcome:

Confirmed fraud — chargeback fired, customer admitted, or strong evidence after review
False positive — order was legitimate (paid, received, didn't dispute)
Unknown — cancelled in review without subsequent confirmation either way

Calculate your true-positive rate:

True positive rate	Recommendation
Below 15%	Auto-cancel is destructive — manual review only
15-30%	Mixed — auto-cancel high-confidence subset only
30-50%	Auto-cancel high-risk is profitable on basic math
Above 50%	Auto-cancel is clearly beneficial; consider tightening to catch more

The right threshold also depends on AOV and margin. A store with $500 AOV at 40% margin can absorb a higher false-positive rate (because per-fraud loss is large). A store with $30 AOV at 20% margin needs a tighter threshold.

Most stores converge on a two-threshold system:

A "review" threshold (orders pause for human triage)
An "auto-cancel" threshold (orders blocked without review)

The auto-cancel threshold is set conservatively — only highest-confidence signals — and the review queue handles everything in between.

Reading indicators, not just scores

For experienced fraud-ops teams, the aggregate score becomes less useful than the underlying indicators. After enough orders, certain combinations carry specific implications:

Indicator combination	What it usually means
AVS mismatch + IP outside billing country + freight-forwarder ZIP	Triangulation pattern
Clean AVS + clean CVV + first-order + high-value cart	Likely not fraud, but likely to dispute (friendly fraud risk)
Mass velocity + single IP + low cart values + payment declines	Card testing — block IP immediately
High-volume small-value + same device + different emails	Fraud ring — block at device level
Travel hotel WiFi + saved card from different country	Almost certainly a real customer traveling

These pattern-based decisions outperform threshold-based decisions, but require enough volume and review experience to learn the patterns. The score is where you start; the indicators are where you graduate to.

When the score is wrong (systematically)

Risk scores are wrong all the time. There are systematic reasons:

Travelers — billing in one country, IP in another. Triggers cross-geography signals. Real customers.
Privacy-conscious customers — using VPNs for legitimate privacy reasons. Trigger anonymizing-service signals.
Mobile customers on iCloud Private Relay — rotating IPs, obscured signals. Real customers, often high-AOV.
New-but-legitimate customers in emerging markets — banking/identity infrastructure less mature → higher baseline signals
Paid-campaign one-time buyers — no history, no account context. Look risk-elevated even when legitimate

These systematic biases point to the same conclusion: don't use the score as a verdict, especially when you're trying to grow into new segments or markets.

How Shieldy uses risk scoring

Shieldy Fraud Filter combines Shopify's native risk score with additional signals:

Device fingerprinting — catches repeat fraudsters across rotating IPs/emails
IP-history tracking — flags IPs with prior chargebacks on your store
Cross-merchant blocklist — IPs/emails flagged across the Shopify ecosystem
Anonymization detection — beyond Shopify's basic VPN/proxy check
Behavioral signals — checkout velocity, navigation patterns

The result: where Shopify might flag an order as Medium-risk based on a single indicator, Shieldy can confirm it as High (across multiple signals) or release it as actually low (because the customer has good device history on your store).

You configure separate thresholds for review vs. auto-cancel in Shieldy — and the false-positive rate at each threshold is surfaced in the dashboard so you can tune.

A practical close

A risk score is a tool for prioritizing attention, not making decisions. The merchants who handle fraud well don't trust the score. They use it to prioritize which orders deserve closer review, then make decisions based on indicators, customer history, order shape, and the cost of being wrong.

Merchants who handle fraud poorly either ignore the score (absorb obvious fraud) or auto-act on it (lose more in false positives than they save). Both extremes are expensive.

Tune your thresholds against measured outcomes. Read the indicators, not just the score. Layer additional signals — that's where Shieldy adds value beyond Shopify's native filter.