Fraud Risk Score Explained — How It's Calculated and What to Do With It
A fraud risk score is a probability estimate, not a verdict. Here's what goes into it, why two systems disagree on the same order, and how to operationalize it.

Almost every fraud system in ecommerce — Shopify's native analysis, Stripe Radar, Signifyd, Riskified, and every fraud app you've heard of — outputs a risk score. The score is the single most-used data point in fraud operations, the first thing merchants look at, and the trigger for most automated decisions.
It's also one of the most consistently misused signals in the entire fraud toolkit.
This guide unpacks what a risk score actually represents, where the inputs come from, why two scores can disagree wildly on the same order, and how to turn a score into a decision that doesn't cost legitimate revenue.
What a risk score actually represents
A fraud risk score is a probability estimate that a given order will turn out to be fraudulent. Different systems normalize the number differently — Shopify uses categorical labels (Low / Medium / High), Stripe Radar uses 0-100, fraud apps often use 1-10 or 0-1000 — but the underlying claim is the same.
A "high risk" score of 90 doesn't mean the order is 90% likely to be fraud. It means: "based on the patterns this model learned from past fraud, the signals on this order match fraud cases at the 90th-percentile." The actual fraud rate of orders scored at 90 depends on the model's training data, the merchant's traffic mix, and a hundred contextual factors.
In practice, the actual fraud rate of "high risk" orders varies dramatically by merchant. We've seen Shopify stores where 35% of high-risk orders are confirmed fraud — and stores where less than 8% are. Same score system. Completely different outcomes. Because the score is calibrated against the system's overall training set, not against your specific traffic.
This is the first thing to internalize: your store's risk score is informative, but its decision threshold is yours to calibrate.
What goes into the score
Risk-scoring models combine signals from several categories. Different systems weight them differently, but the menu of inputs is roughly consistent:
Payment-network signals
- AVS (Address Verification System) result
- CVV check result
- Card BIN reputation (first 6 digits identify the issuing bank)
- Decline history on the card
- Time-since-card-issued vs. time-of-purchase
These come from the payment processor and card network. Strongest individual signals when they fail; weakly informative when they pass. A sophisticated fraudster will pass AVS/CVV.
Identity signals
- Email reputation (age, frequency-of-use, breach-exposure)
- Phone number quality (carrier type, country prefix consistency)
- Shipping address validity
- Billing-shipping match
Signal whether the identity attached to the order is genuine.
Behavioral signals
- Time-on-site before checkout
- Checkout completion speed
- Navigation pattern
- Prior cart abandonment
- Device fingerprint
- Browser characteristics
Signal whether the order looks like a human shopping or a bot/automation.
Network signals
- IP reputation (residential / hosting / proxy / VPN / TOR)
- IP-billing-country consistency
- ISP / ASN
Signal where the buyer is physically and through what kind of infrastructure.
Velocity signals
- Recent orders from this IP / device / email / card / shipping address
- Velocity across cross-merchant networks
Signal whether the order is part of a pattern or isolated.
Cross-merchant signals
- Email / card / device associated with fraud or disputes on other merchants
- Shared blocklist hits
The bigger the underlying network, the more useful. Shopify operates a meaningful network across the platform.
A modern risk-scoring model takes dozens to hundreds of these signals, weights them via statistical or machine-learning approach, and outputs a score.
Why two systems disagree on the same order
If you've wondered why Shopify's risk score and your fraud app's risk score disagree, the answer is training data.
Each system's model was trained on its own historical fraud data. Shopify's model has seen orders from across the entire Shopify platform — millions of stores, every conceivable vertical, geographies from everywhere. A specialized fraud app (Shieldy, for instance) has trained on a narrower dataset that may resemble your traffic more closely.
The narrower-but-relevant model often outperforms the broader-but-generic one for your specific case — because the patterns of fraud in your vertical aren't necessarily the patterns of fraud across all retail.
When systems disagree, the higher score isn't automatically right. The question is which signals each system picked up on:
- If your fraud app flags device-cluster signals that Shopify isn't seeing, and you operate in a vertical with active fraud rings (dropshipping, free-product giveaways), the app's signal is the one to weight more
- If Shopify flags an order high-risk that your fraud app didn't catch, check whether it's a cross-merchant signal (card flagged on another store) — those are valuable
The biggest mistake: treating high-risk as "cancel"
A common error — especially in newer fraud-ops teams — is treating "high risk" as synonymous with "cancel." The math punishes this.
If 25% of your high-risk orders are actually fraud and 75% are false positives, then auto-cancelling all high-risk orders means you're cancelling three legitimate orders for every fraudulent one you stop. At even modest AOV, the false-positive loss exceeds the fraud-prevented gain within a few months.
The operational rule that consistently works: high risk = review, not cancel. Build a queue. Assign someone to triage. Make per-order decisions based on the underlying indicators, not the aggregate score.
For stores doing more volume than a person can triage, this still applies — but the review gets partially automated through Shopify Flow or equivalent:
- High risk score → auto-tag, alert team, pause fulfillment
- Within X hours, team reviews and either releases, contacts customer, or cancels
- Cancellations documented with indicator-level reason → training data for future tuning
A surprisingly small team — even one person spending 30 min/day — can handle the high-risk queue for stores doing thousands of orders/month if the workflow is well-structured.
Tuning your threshold
If you've operated a fraud system for at least 60 days, you should be tuning your action thresholds based on actual outcomes.
The exercise:
For the last 60 days of high-risk orders, code each outcome:
- Confirmed fraud — chargeback fired, customer admitted, or strong evidence after review
- False positive — order was legitimate (paid, received, didn't dispute)
- Unknown — cancelled in review without subsequent confirmation either way
Calculate your true-positive rate:
| True positive rate | Recommendation |
|---|---|
| Below 15% | Auto-cancel is destructive — manual review only |
| 15-30% | Mixed — auto-cancel high-confidence subset only |
| 30-50% | Auto-cancel high-risk is profitable on basic math |
| Above 50% | Auto-cancel is clearly beneficial; consider tightening to catch more |
The right threshold also depends on AOV and margin. A store with $500 AOV at 40% margin can absorb a higher false-positive rate (because per-fraud loss is large). A store with $30 AOV at 20% margin needs a tighter threshold.
Most stores converge on a two-threshold system:
- A "review" threshold (orders pause for human triage)
- An "auto-cancel" threshold (orders blocked without review)
The auto-cancel threshold is set conservatively — only highest-confidence signals — and the review queue handles everything in between.
Reading indicators, not just scores
For experienced fraud-ops teams, the aggregate score becomes less useful than the underlying indicators. After enough orders, certain combinations carry specific implications:
| Indicator combination | What it usually means |
|---|---|
| AVS mismatch + IP outside billing country + freight-forwarder ZIP | Triangulation pattern |
| Clean AVS + clean CVV + first-order + high-value cart | Likely not fraud, but likely to dispute (friendly fraud risk) |
| Mass velocity + single IP + low cart values + payment declines | Card testing — block IP immediately |
| High-volume small-value + same device + different emails | Fraud ring — block at device level |
| Travel hotel WiFi + saved card from different country | Almost certainly a real customer traveling |
These pattern-based decisions outperform threshold-based decisions, but require enough volume and review experience to learn the patterns. The score is where you start; the indicators are where you graduate to.
When the score is wrong (systematically)
Risk scores are wrong all the time. There are systematic reasons:
- Travelers — billing in one country, IP in another. Triggers cross-geography signals. Real customers.
- Privacy-conscious customers — using VPNs for legitimate privacy reasons. Trigger anonymizing-service signals.
- Mobile customers on iCloud Private Relay — rotating IPs, obscured signals. Real customers, often high-AOV.
- New-but-legitimate customers in emerging markets — banking/identity infrastructure less mature → higher baseline signals
- Paid-campaign one-time buyers — no history, no account context. Look risk-elevated even when legitimate
These systematic biases point to the same conclusion: don't use the score as a verdict, especially when you're trying to grow into new segments or markets.
How Shieldy uses risk scoring
Shieldy Fraud Filter combines Shopify's native risk score with additional signals:
- Device fingerprinting — catches repeat fraudsters across rotating IPs/emails
- IP-history tracking — flags IPs with prior chargebacks on your store
- Cross-merchant blocklist — IPs/emails flagged across the Shopify ecosystem
- Anonymization detection — beyond Shopify's basic VPN/proxy check
- Behavioral signals — checkout velocity, navigation patterns
The result: where Shopify might flag an order as Medium-risk based on a single indicator, Shieldy can confirm it as High (across multiple signals) or release it as actually low (because the customer has good device history on your store).
You configure separate thresholds for review vs. auto-cancel in Shieldy — and the false-positive rate at each threshold is surfaced in the dashboard so you can tune.
A practical close
A risk score is a tool for prioritizing attention, not making decisions. The merchants who handle fraud well don't trust the score. They use it to prioritize which orders deserve closer review, then make decisions based on indicators, customer history, order shape, and the cost of being wrong.
Merchants who handle fraud poorly either ignore the score (absorb obvious fraud) or auto-act on it (lose more in false positives than they save). Both extremes are expensive.
Tune your thresholds against measured outcomes. Read the indicators, not just the score. Layer additional signals — that's where Shieldy adds value beyond Shopify's native filter.
Protect your Shopify store today
Install Shieldy free — block fraud, bots, and VPNs in under 5 minutes.
Install on Shopify — Free


