We Tested 6 AI Tools on Financial Data Accuracy

We gave 6 AI tools real financial data. Tome turned $2.1M into $21M. Kimi fabricated 78% of all numbers. Here's exactly what each tool did with your data.

We Tested 6 AI Tools on Financial Data Accuracy

We gave six AI presentation tools the same financial data. Specific numbers: starting ARR of $500K, ending ARR of $2.1M, TAM of $8.4B with a Gartner 2025 citation, ACV of $45, monthly churn of 12%, NPS of 72. We also included a detailed mortgage scenario with a $525,000 purchase price, a 30-year fixed rate at 6.75%, and a 15-year comparison.

Then we checked what the tools did with those numbers.

Tome output $21M for a $2.1M ending ARR. Same bug turned an $8.4B TAM into $84B. Gamma invented three competitors that don't exist and assigned them fabricated market caps. Kimi 2.5 fabricated approximately 47 of the 60 numbers in its final deck. LayerProof dropped the mortgage prompt entirely, in both test runs.

The tools we tested: Gamma, Beautiful.ai, Tome, Canva Magic Studio, Kimi 2.5, and LayerProof.

None handled both test scenarios without errors. Every single one invented at least some numbers. The differences are in the type of failure, how detectable it is, and how dangerous it looks in a finished presentation.

The test setup

Two scenarios. First: a SaaS investor pitch with a specific metric set. Starting ARR $500K, ending ARR $2.1M over 18 months, YoY growth 340%, ACV $45, monthly churn 12%, NPS 72, TAM $8.4B citing Gartner 2025.

Second: a mortgage scenario. Purchase price $525,000, 15% down, 30-year fixed at 6.75%, $4,200/year property tax, $1,800/year homeowner's insurance, PMI at 0.55% annually. Monthly PITI, amortization at years 5, 10, 15, and 30, total interest over the loan life, and a 15-year fixed comparison at 5.95%.

Two testers ran each tool and recorded results in a shared tracker. Every output number was checked against the inputs. Fabricated figures were flagged regardless of whether they sounded plausible.

What each tool did with the numbers

Gamma passed 7 of 9 SaaS input values. The one distortion: it changed $45 ACV to $45K without flagging the change, then added the phrase "demonstrates enterprise appeal" as if explaining the number it had just invented. On the competitor slide, it generated three companies. Acme Solutions Inc. with a $15B market cap. Global Systems Co. with $22B. Innovate Hub with $9B. None of these companies exist. All three market caps are fabricated.

Gamma.app slide showing fabricated competitor companies with invented market caps
Gamma.app slide showing fabricated competitor companies with invented market caps

The Gamma failure is easy to miss. Its hallucinations don't contradict each other within the deck. The fictional competitors appear on a clean slide with no indication that the data came from nowhere. If you don't search for the company names, the slide looks fine.

Beautiful.ai passed 5 of 9 inputs. It made the same $45 to $45K inflation. It also changed NPS 72 to "72%." NPS is a score, not a percentage. That unit error matters. For competitors, it used real companies: Asana listed at $4.1B (actual market cap at the time: around $1.6B), Monday.com at $7.4B (actual: around $3.8B), and ClickUp at $1B (ClickUp is private, last valued at $4B in 2021).

Beautiful.ai slide showing real companies with incorrect market cap figures
Beautiful.ai slide showing real companies with incorrect market cap figures

Real company names attached to wrong financials are harder to catch than fictional companies. Someone reviewing the deck might search "Acme Solutions Inc." and find nothing. Nobody double-checks Asana's market cap mid-review unless they know to look.

Tome passed 5 of 9 inputs. Its failure mode is different from any other tool: a decimal point parsing bug. It output $21M for an ending ARR that was clearly $2.1M. It output $84B for a TAM that was clearly $8.4B. Both errors follow the same pattern: the decimal shifts one place, producing a 10x inflation. Tome then repeated these wrong numbers consistently across its own slides. The deck is internally coherent. It is also 10x off on two of its three biggest financial claims.

Tome slide showing $84B TAM instead of $8.4B, a 10x decimal parsing error
Tome slide showing $84B TAM instead of $8.4B, a 10x decimal parsing error

A pitch deck showing $21M ARR when the actual figure is $2.1M misrepresents the business by a factor of ten. That is not a rounding error.

Canva Magic Studio was the most accurate on SaaS metrics. It preserved all 8 core input values exactly, in both trials. The only miss: it didn't show the Gartner attribution for the TAM figure. On competitors, it used real companies (Salesforce, HubSpot, Zendesk) with approximately correct market caps.

The problem: Canva dropped the mortgage scenario in both trials. No error message. No indication that the second prompt wasn't followed. It generated a SaaS pitch deck from the mortgage content and moved on.

Kimi 2.5 (also called Moonshot) preserved 8 of 9 core SaaS inputs. Then it filled the remaining slides with roughly 47 fabricated numbers. Of approximately 60 total numbers extracted from the output, about 12 matched the inputs. One was incorrectly derived. The other 47 were invented.

The fabricated figures included an LTV of $1,620. At 12% monthly churn and $45/month ACV, the mathematically correct LTV is around $375. Kimi's figure is more than 4x too high. It also showed Net Revenue Retention of 108%, which doesn't hold up against the stated 12% monthly churn rate. That churn rate implies roughly 77% annual gross retention. NRR of 108% and gross retention of 77% don't coexist in a real business.

Kimi 2.5 slide with fabricated statistics and metrics
Kimi 2.5 slide with fabricated statistics and metrics

Kimi also invented a fictional company called FlowSync, a fictional person called James Mitchell, and added statistics like "3.2 hours daily" and "67% of projects fail" with no sources cited. It dropped the mortgage prompt entirely.

LayerProof was the only tool that preserved the $45 ACV without changing it. Every other tool that showed an ACV figure inflated it to $45K. LayerProof's $45 held in both test runs. Churn and NPS also passed through correctly.

But LayerProof dropped the mortgage prompt in both runs. It omitted the core ARR growth metrics in both runs: starting ARR, ending ARR, growth period, and YoY growth never appeared. In Run 2, it fabricated a TAM segment breakdown: SMB $4.1B, Mid-Market $2.9B, Enterprise $1.4B (twice). Those segments add to $9.8B, not the $8.4B in the input. It generated four AI-headshot team members with no prompting: Maria Chen, David Rodriguez, Priya Sharma, Alexandre Dubois. Multiple slides in both runs contained unfilled template placeholders.

Comparative scorecard

ToolInput values preservedInput values distortedFabricated figuresMortgage handledWorst error
Canva Magic Studio8/903 competitor caps (plausible)NoDropped mortgage prompt entirely
Kimi 2.58/91 (ACV context only)~47 fabricated statsNo78% fabrication rate; LTV wrong by 4x; NRR incompatible with stated churn
Gamma7/91 ($45 ACV to $45K)3 fictional companies with invented market capsYesFictional companies presented as real competitors
Beautiful.ai5/92 (ACV inflated; NPS wrong units)3 real companies, fabricated market capsYesReal names attached to wrong financials
Tome5/93 ($2.1M to $21M; $8.4B to $84B; ACV)0 competitors shownYesDecimal parsing bug: two separate 10x errors
LayerProof3 to 4/9 (varies by run)0 value distortionsTAM segments, team bios, ARR chartNoDropped mortgage prompt; omitted ARR metrics; fabricated TAM segments that don't add up

Three patterns worth knowing

1. Every tool silently inflates small dollar values. Gamma, Beautiful.ai, Tome, and LayerProof (in one run) all changed $45 ACV to $45K. No tool flagged the change. No tool asked for confirmation. The inflation appeared in the output alongside framing that made it sound intentional. Gamma added "demonstrates enterprise appeal." Only Canva and LayerProof (specifically for ACV) preserved the original value.

If your actual ACV is $45, this matters. The tools have an implicit assumption that small numbers are implausible and should be scaled up.

2. Internally consistent hallucinations are harder to catch than obvious errors. Gamma's fictional competitors don't contradict each other within the deck. Kimi's fabricated statistics sound authoritative. The math errors in Kimi's LTV and NRR only surface when you run the numbers yourself. A presenter reviewing for design and flow would find nothing to flag.

This is why zombie stats in presentations are so persistent. Numbers that look right, cite nothing, and come from nowhere circulate without anyone tracing them to a source.

3. Dropping a prompt is a quiet failure. Canva, Kimi, and LayerProof all ignored the mortgage scenario. Each tool generated a different kind of presentation instead, without any indication that the prompt wasn't followed. Someone who didn't already know what they'd asked for might not notice the switch.

In a financial reporting context, this failure is particularly costly. The mortgage analysis isn't in the deck. The user has no way to know that unless they check.

AI tools and financial data accuracy: what to verify before presenting

The term "ai invented numbers" might suggest obvious, detectable errors. Most of what we found is not obvious. Tome's 10x inflation is detectable if you know the source figure. Gamma's fictional companies are detectable if you search for them. Kimi's LTV overstatement requires recalculating the number yourself. Beautiful.ai's wrong market caps for real companies require looking up current financial data.

The failure mode that matters most is the one that passes a quick read.

Four things to check before presenting AI-generated financial content:

Check small dollar values for inflation. If any input was under $1,000, verify the output preserved the exact figure. Across every tool except Canva, small dollar values were silently scaled up. The tool doesn't ask. The tool doesn't flag it. It just changes the number.

Verify all competitor figures independently. Every tool that generated a competitor slide either invented the companies outright (Gamma) or used real names with fabricated financials (Beautiful.ai, Kimi). There is no tool in this test that handled competitor data correctly. Treat any competitor figure from an AI-generated slide as unverified until you've checked it yourself.

Check whether your full prompt was answered. Three of six tools dropped one of the two test scenarios with no warning. Review the output against what you actually asked for, not just against whether the output looks coherent.

Run the math on derived metrics. Kimi's LTV was 4x too high. Its NRR was incompatible with the stated churn rate. LayerProof's TAM segments didn't add to the stated TAM total. These errors don't show up on visual review. You have to calculate them.

The findings from our earlier fact-check of 6 AI presentation makers match the pattern here: every tool has a failure profile, and the profiles are distinct enough that you can predict where each tool will break. Tome's decimal bug is systematic. Any number with a decimal point is at risk. Beautiful.ai's errors cluster around competitor data. Gamma fabricates when no competitor data is provided. Kimi fills content gaps with invented statistics that sound authoritative and cite nothing.

Canva is the outlier. Its SaaS accuracy was the best of any tool in this test, and its failures were mostly omissions rather than distortions. Missing a Gartner attribution is a real error. It is less dangerous than publishing a $21M ARR figure for a $2.1M business.

Knowing your tool's failure profile matters. If you use Tome with decimal-heavy financial inputs, check every figure. If you use Gamma or Kimi and haven't provided competitor data, treat the competitor slide as fabricated until verified.


FAQ

Which AI presentation tool has the best financial data accuracy?

In this test, Canva Magic Studio preserved 8 of 9 SaaS input values exactly across both trials, with no value distortions. It had the best accuracy on the metrics we provided, and the best internal consistency of all tools. However, it dropped the mortgage scenario entirely in both trials with no warning. Its failure mode is omission rather than fabrication, which is less dangerous but still a real problem when the omitted content is what you needed.

What does "ai invented numbers" mean in practice?

AI invented numbers are figures that appear in the output but don't come from the input data, from any stated source, or from a correct calculation. In this test, Kimi 2.5 generated approximately 47 invented numbers, including an LTV that was 4.3x higher than the mathematically correct figure and an NRR that was incompatible with the stated churn rate. Gamma invented three competitor companies with fabricated market caps. These are not rounding errors or plausible estimates. They are fabricated.

Can AI presentation tools handle mortgage or financial calculation scenarios accurately?

Poorly, based on this test. Gamma, Beautiful.ai, and Tome attempted the mortgage scenario. Canva, Kimi, and LayerProof dropped it entirely in both trials without flagging the gap. Even the tools that attempted the mortgage content were not fully verified on the calculated outputs in this study; only input-to-output value preservation was tracked.

Why does this matter for real presentations?

Tome slide showing $84B TAM instead of $8.4B, a 10x decimal parsing error
Tome slide showing $84B TAM instead of $8.4B, a 10x decimal parsing error

A pitch deck showing $21M ARR when the actual figure is $2.1M misrepresents a business by a factor of ten. Fabricated competitor market caps could mislead investors who don't verify financials independently. An LTV overstated by 4.3x would affect unit economics for any investor doing diligence. NRR incompatible with stated churn is a basic sanity check failure. These aren't cosmetic errors. They affect decisions.

Does LayerProof perform better than competitors on this?

On one specific metric, yes: LayerProof was the only tool that preserved the $45 ACV without inflating it to $45K. That held across both test runs. On other measures, its accuracy in this test was below several competitors: it dropped the mortgage prompt in both runs, omitted all ARR growth metrics in both runs, generated TAM segment figures that didn't add up to the stated total, and produced AI-generated team members with no prompting. We're publishing these results because accurate data matters more than how we look in it.

Want to get on the
LayerProof waitlist early?

Contact us