Voice cloning corporate fraud: the verifiable defense for CFOs after the Arup case

Hong Kong, January 2024. An employee at the local branch of Arup, the global engineering firm with 18,000 staff, receives a suspicious email from the London-based CFO requesting a confidential transaction. To rule out a scam, the employee accepts a video conference with the finance director and four other colleagues. The voices, the faces, even the tics of the senior leaders are perfect. On the strength of that meeting the employee authorizes fifteen wire transfers totaling 25.6 million dollars across five different accounts. Every participant in that call was a deepfake, reconstructed from public footage available online.

The Arup case is not an outlier. It is the first documented incident in a new category of corporate fraud that combines voice cloning, real-time video deepfakes and BEC (Business Email Compromise) social engineering. Deloitte estimates that the total loss from AI-assisted fraud could reach 40 billion dollars by 2027 in financial services alone, with a 32% compound annual growth rate. For CFOs, treasurers and security leaders, the question is no longer “will this happen to us?” but “when, and what will we have in hand to prove what really happened?”.

The answer is not another detection tool. Detection of fakes is a race that attackers win by construction. The problem must be flipped: certify at source the authorized voice of the CFO and of every officer who can authorize a wire transfer, so that a verifiable baseline exists against which any suspicious request can be measured. That is the difference between chasing deepfakes and making them irrelevant.

Anatomy of a voice cloning fraud: from Arup to the Crosetto case in Italy

The Arup case proved that the video conference, long treated as a “verification step” against a suspicious email, has itself become an attack vector. The mechanics are simple and reproducible. Attackers gather public material (interviews, conference talks, podcasts, LinkedIn videos) of the executive to be impersonated. With commercial tools they train a model that reproduces the target’s voice, intonation and cadence. For the live face they apply face-swap on streaming. A few minutes of source footage is enough for a convincing result on a call, especially given that videoconference audio is already compressed.

The pattern reaches well beyond Hong Kong. In February 2025 the Milan Public Prosecutor opened an investigation into a voice cloning scam against Italian entrepreneurs using the cloned voice of Italian Defense Minister Guido Crosetto: the fraudsters requested “urgent” wire transfers to free Italian journalists held abroad, faking a government reimbursement guarantee. At least one entrepreneur transferred close to one million euros to a foreign account. By 2026 the phenomenon broadened, with the Bank of Italy issuing an official alert on video and audio deepfakes impersonating Governor Fabio Panetta and other public figures across European institutions.

The three core steps of the attack

Observed attacks follow a consistent playbook:

Reconnaissance: attackers map the org chart on LinkedIn, identify CFO, treasurer and the administrative officer who executes wires. Public voice samples of the executives are collected.
Pretexting: a first email from a lookalike address creates the context (confidential acquisition, litigation, regulatory transaction requiring strict confidentiality).
Real-time impersonation: a video conference or voice call closes the loop. The employee, already primed by the email, recognizes voice and face and approves the wires.

According to the World Economic Forum’s Global Cybersecurity Outlook 2024, more than 55% of organizations consider generative AI a primary accelerator of financial fraud. The attack surface is no longer just the mailbox: it is every audio and video channel used to authorize operations with economic effect.

Why detection tools fail on short signals

Most CISOs instinctively ask for a tool that flags deepfakes. The market does offer forensic detection solutions, but their error rate climbs sharply when the signal is short, of decent quality and already compressed by a videoconference codec. McAfee research has shown that three seconds of audio are enough to produce a voice clone with 85% accuracy; ten seconds push that figure above 95%. On the defensive side, detection tools reach high accuracy mostly in lab conditions: in production, on short and compressed clips, average accuracy measured by independent benchmarks drops below 70%.

The structural problem is asymmetric. Attackers need a convincing result for a few seconds: the time of a phone call or of a voice clip sent on WhatsApp. Defenders need to flag those few seconds with enough confidence to block a legitimate wire without false positives. The two curves cross in favor of the attacker. Each new generative model (ElevenLabs, Resemble, Tortoise, plus open source forks) pushes the indistinguishability frontier further.

The ENISA Threat Landscape 2024 ranks voice cloning and video manipulation among the threats with the highest growth rate observed in the past two years. The operational conclusion is straightforward: detection can serve as a second filter, but it cannot be the pillar of defense for operations with significant economic impact.

The cognitive limit of the human factor

Even assuming perfect detection tools, a human limit remains. The Arup employee saw and heard people they recognized, in an artificially induced sense of urgency. Experimental studies published in Royal Society Open Science show that people, even after specific training, recognize synthetic voices with around 73% accuracy in their native language, and even less in foreign languages or low-quality channels. The finance function cannot rely on perceptual capabilities that the technology has already surpassed.

The new attack surface: BEC plus voice cloning plus video deepfakes

Frauds documented in 2025-2026 show a convergence of three historically separate vectors. Classic Business Email Compromise (CEO impersonation via email to order urgent wires) was already a material loss item: the FBI Internet Crime Report 2023 recorded 2.9 billion dollars of BEC losses in the United States alone. Voice cloning adds the audio layer: a “confirmation” call that reassures. Video deepfakes add the visual layer: a meeting with the executive’s face on screen.

The result is a multi-channel attack that neutralizes legacy controls based on “second confirmation through an alternative channel”. If the email is fake, the verification call is fake, and the approval video meeting is fake, second confirmation does not add safety: it multiplies it by zero.

Payment procedures under stress

Companies with mature payment processes typically deploy at least three controls: dual signature on wires, hierarchical approval threshold, and callback to a known number. All three controls fail if the voice on the alternative channel is cloned. European banks are upgrading biometric onboarding, but on the corporate side most treasurers still operate procedures written before 2023, when high-quality voice cloning required hours of recordings and technical skills out of reach for mass cybercrime.

What the regulation says: NIS2, DORA, AI Act and international frameworks

The European regulatory framework moved faster than expected, although in a fragmented way. NIS2 (Directive EU 2022/2555) imposes on essential and important entities cyber risk management measures that explicitly cover protection against AI-assisted fraud in internal communications. For the financial sector, DORA (Regulation EU 2022/2554), operational since January 2025, requires banks, insurers and market infrastructures an ICT risk management framework with explicit reference to impersonation incidents and operational resilience testing.

The AI Act (Regulation EU 2024/1689) at Article 50 imposes transparency obligations for artificially generated content: anyone releasing a deepfake must label it as such, with limited exceptions. The rule does not stop fraud (criminals do not respect disclosure obligations), but it strengthens the evidentiary position of those who can prove that a piece of content is authentic and traceable to source.

International frameworks complete the picture. The ISO/IEC 27001 standard for information security management systems and ISO/IEC 27037 on digital evidence handling provide the technical reference for forensic acquisition. In the United States the Federal Rules of Evidence (Rule 901, 902) cover the authentication of digital records and accept as self-authenticating those records that satisfy specific provenance and chain-of-custody requirements.

Table: how voice cloning changes each regulation

Regulation	Scope	Impact on anti voice cloning procedures
NIS2 (Directive EU 2022/2555)	Essential and important entities	Technical and organizational measures for AI-assisted fraud in critical processes
DORA (Regulation EU 2022/2554)	Financial sector	Operational resilience testing including impersonation scenarios
AI Act art. 50	All sectors	Deepfake disclosure obligation; strengthens position of those proving authenticity
ISO/IEC 27001 + 27037	All sectors	Reference for ISMS and digital evidence handling
FRE Rules 901, 902 (US)	US litigation	Authentication of digital records and self-authenticating evidence

What is verifiable defense against voice cloning?

Verifiable defense is a structured procedure that combines two elements: out-of-band verification of every payment request above threshold and a certified voice baseline for authorized executives. The first element narrows the attack window. The second offers an objective proof, opposable in court and in cyber insurance claims, against which every suspicious request can be measured.

The idea is simple: instead of chasing deepfakes with detection tools that always lag behind, certify at source the authorized voice. Every executive with signature authority records a structured voice sample (standard sentences, reading of a fixed text, specific commands for wire authorization). That sample is acquired with forensic methodology, sealed with qualified electronic timestamp and digital signature, and preserved as a baseline. When a suspicious request arrives, the comparison is no longer subjective (“it sounded like the CFO’s voice”): it becomes objective and documented. TrueScreen is the Data Authenticity Platform that enables this certified baseline, integrating it into a legally opposable chain of custody.

How the certified voice baseline works with TrueScreen

The process unfolds in three operational steps:

Forensic acquisition of the sample: the executive records the voice sample through the TrueScreen App or the web portal, in a controlled environment. The recording captures metadata (device, geolocation, timestamp) and applies a forensic methodology that excludes any post-capture alteration.
Certification with qualified seal: TrueScreen applies to the voice file a qualified electronic timestamp and a digital signature compliant with the eIDAS Regulation, preserving the evidence inside Digital Provenance with a traceable chain of custody.
Operational use as baseline: in case of a suspicious request, the company can compare the received audio against the certified sample. The baseline is opposable in court and carries evidentiary value for cyber insurance reimbursement claims.

The advantage over detection is clear: the baseline does not lose validity when a new generative model ships. Its strength does not lie in recognizing the false but in identifying the true with forensic certainty.

Operational example: an 800,000 euro wire request

The CFO requests on a video call an urgent 800,000 euro wire to a new supplier. Company procedure prescribes three steps: (a) callback to the CFO’s known number, recorded and compared against the certified voice baseline in TrueScreen; (b) verification of the new IBAN through a protected supplier database with dual signature by the procurement lead; (c) operational limit of 250,000 euro per single transaction on new accounts, with manual escalation for higher amounts. If the voice in the callback does not match the certified baseline, the request is blocked and the incident response procedure kicks in.

The value in litigation and insurance reimbursement

Cyber insurance policies are introducing specific exclusion clauses for social engineering and deepfake fraud: some exclude reimbursement if the company cannot show that reasonable controls were applied. Holding a certified voice baseline and a documented out-of-band verification procedure strengthens the insurance position. In court, the comparison between the received audio and the sample sealed with qualified electronic timestamp constitutes documentary technical evidence, weighed by the judge as an objective element rather than as subjective testimony. For a deeper dive on the difference between detection and source certification, the analysis on the limits of deepfake detection covers the topic in detail.

FAQ: voice cloning and corporate fraud

How many seconds of audio are needed to clone a voice?

According to McAfee, three seconds of decent quality audio are enough to produce a voice clone with 85% accuracy. Ten seconds push that figure above 95%. Commercial tools such as ElevenLabs and Resemble make this capability available to anyone. For CFOs and publicly exposed executives (interviews, podcasts, conference talks) the source material is already online.

How much has deepfake fraud risk grown in 2026?

Corporate fraud attempts using voice and video deepfakes grew by 300% in 2026 versus 2024, according to industry reports. Deloitte estimates that the total loss in financial services could reach 40 billion dollars by 2027. The Arup case (25.6 million dollars loss, 2024) is the first documented incident of multi-channel BEC plus video deepfake fraud.

Are AI detection tools reliable?

In lab conditions they reach high accuracy. In production, on short clips already compressed by videoconference codecs, average accuracy drops below 70%. Detection can act as a second filter, but it cannot be the pillar of defense for operations with significant economic impact. The opposite direction (certifying authorized voice at source) offers an objective baseline that is independent of the evolution of generative models.

Is a certified voice baseline admissible as evidence in court?

A recording acquired with forensic methodology, sealed with qualified electronic timestamp and digital signature compliant with the eIDAS Regulation, qualifies as documentary technical evidence admissible in EU jurisdictions. In the United States, Federal Rules of Evidence 901 and 902 cover authentication of digital records and recognize self-authenticating records that satisfy specific provenance requirements. The evidentiary value rests on the traceable chain of custody and on the file immutability from the moment of acquisition.

Which regulations require controls against AI-assisted fraud?

In Europe the main ones are NIS2 (Directive EU 2022/2555) for essential and important entities, DORA (Regulation EU 2022/2554) for the financial sector operational since January 2025, and AI Act Article 50 (Regulation EU 2024/1689) for disclosure of artificially generated content. Standards ISO/IEC 27001 (ISMS) and ISO/IEC 27037 (digital evidence handling) provide the technical reference. All these frameworks strengthen the position of those who can prove the authenticity of their internal communications.

Voice certified at source: the foundation of verifiable defense

Evaluate a certification program for the authorized voices of your C-suite to protect payment processes, internal communications and your insurance position against voice cloning and deepfakes.

Start now

Request a demo