Hondurasgate — How the Verification Was Done

This is the companion post to Hondurasgate — Findings. It explains the technical process. The findings post covers what was found; this one covers how.

Why independent verification was needed

Hondurasgate’s dossier presented its forensic verification as powered by Phonexia Voice Inspector — a commercial product used in over 60 countries by courts and intelligence agencies. Phonexia publicly denied this and initiated legal action. Hondurasgate admitted it had used “API endpoints” while displaying Phonexia branding without authorization, then replaced all mentions of Phonexia with “HG Forensics” — an internal engine with no auditable track record — within five hours of the exposure.

That left the identity of the speakers in the recordings unestablished by any credible method. The work described here was done to fill that gap: independent analysis using fully open-source tools that any third party can replicate without commercial licenses.

Speaker identification — the tool

WavLM-Base+ (microsoft/wavlm-base-plus-sv) is a model developed by Microsoft Research, published in 2022 by Chen et al. It has 101 million parameters, is pre-trained with masked speech prediction, and fine-tuned for speaker verification on the SUPERB benchmark. Its weights, architecture, and training methodology are documented in peer-reviewed literature. It is publicly available on HuggingFace at no cost.

Why this model instead of a commercial product: its architecture is auditable, its outputs are reproducible, and it was selected specifically because the analysis can be verified or challenged by any independent party — the opposite of “HG Forensics.”

How it works: WavLM converts an audio segment into a 256-dimensional embedding that represents the biometric characteristics of the speaker’s vocal tract — the unique physical features that distinguish one person’s voice from another. To compare two audio samples, cosine similarity is computed between their embeddings. Scores range from 0 to 1; values near 1 indicate the same speaker.

Preprocessing

All audio was converted to WAV at 16,000 Hz mono using ffmpeg before analysis. The Hondurasgate recordings analyzed here are the original OGG/Opus files at 8,000 Hz mono, downloaded directly from hondurasgate.ch — not YouTube copies. Normalizing format eliminates differences in microphone type and codec that would otherwise introduce irrelevant variation into the comparison.

Calibration — the critical step

A similarity score by itself means nothing. 0.80 could indicate the same speaker or a different one depending on the model, the audio conditions, and the specific speakers involved. Before analyzing any of the Hondurasgate recordings, the model was validated against known identities using the same speakers and similar audio conditions.

Three reference samples of Cossette López from documented public appearances were used (32.6s, 27.4s, and 90.9s). The reference embedding was the normalized average of all three, to reduce the effect of variation in acoustic conditions in any single sample.

Positive controls — same speaker, known identity:

Comparison	Score
López ref 1 vs ref 2	0.9509
López ref 1 vs ref 3	0.9486
López ref 2 vs ref 3	0.9802

Negative control — different speaker, known identity (JOH, 33.9s):

Comparison	Score
López ref 1 vs JOH	0.7184
López ref 2 vs JOH	0.6616

Discrimination gap established empirically: ~0.27 points between same-speaker scores (0.95–0.98) and different-speaker scores (0.66–0.72). The decision threshold was set at 0.88 — derived from this specific dataset, not from generic literature. This is the threshold that matters. A model is only as useful as the calibration around it.

Results for Cossette López

Recording	Duration	Score	Result
”Threat — same bullet that killed Marlon, Salvador e Iroshka”	19.3s	0.9394	Same speaker — high confidence
”Offers millions to buy votes against Marlon”	28.4s	0.9246	Same speaker — high confidence

Both scores are above the decision threshold (0.88) and more than 0.20 points above the top of the negative control range (0.72). The result is unambiguous within the calibration established.

What this proves and what it does not:

QuestionAnswer

Is the voice Cossette López’s?Yes — high confidence

Is the voice human (not AI)?Yes — implied by high similarity with known human voice

Does she say what the audio title claims?Not determinable by this method

Is the recording from the alleged context?Not determinable by this method

Was the audio edited?See edit detection below

Results for JOH and Asfura — Earshot

An independent report by Earshot, a nonprofit specializing in sonic investigation, was completed May 13, 2026 and published May 19 by Drop Site News. Earshot used Resemblyzer — a different tool based on different architecture — without any coordination with the WavLM-Base+ analysis above.

Resemblyzer converts vocal characteristics (fundamental frequencies, harmonics, timbre, rhythm, cadence) into a numerical similarity score on a 0–1 scale. Reference samples were drawn from YouTube: a podcast for JOH, an interview for Asfura. Negative controls (decoys) were Halle Berry and Robert DeNiro.

Recording	Speaker	Score	Decoy	Gap	Result
”Thanks to me he’s sitting in that chair”	JOH	0.67	0.47	+0.20	Probably authentic
”María Antonieta, US office funds, Milei, Mexico”	JOH	0.66	0.48	+0.18	Probably authentic
”ZEDEs, Roatán, Comayagua, military base, GE, Argentina”	Asfura	0.66	0.48	+0.18	Probably authentic

Earshot’s conclusion: recordings are “authentic recordings of the voices of Hernández and Asfura and probably not AI-generated” — moderate confidence, limited by telephone audio quality.

Earshot also identified speech artifacts in all three recordings — breath sounds, hesitations, ambient noise, and in one recording a microphone distortion caused by direct exhalation onto the diaphragm — not found in AI-generated audio in prior investigations.

The absolute scores are not comparable between WavLM-Base+ and Resemblyzer — different architectures, different scales. What is comparable is the discrimination gap above the negative control. Both show clear, consistent separation between the attributed voice and different speakers. Three independent analyses, two different tools, no coordination, published the same day.

Edit detection — three methods

Identifying the speaker does not answer whether the audio was assembled from fragments recorded on different occasions. A recording can contain someone’s real voice, saying real things at real moments, edited together to create a conversation that never happened as presented. This is a separate question requiring separate methods.

Three methods were applied in parallel. All three were also applied to an unedited reference recording (López ref 3, 90.9s) to establish baseline behavior — a control that is essential for interpreting the results.

Method 1 — ENF (Electric Network Frequency)

Principle: Honduras’s power grid operates at nominally 60 Hz. This frequency fluctuates slightly and continuously in a unique pattern over time. Recording devices capture this as a faint signal in the audio. A cut between two recordings made at different moments produces a phase discontinuity in the ENF signal that does not match the natural continuous variation.

Implementation: 1-second analysis windows, 0.5s step. 4th-order Butterworth bandpass filter, ±0.5 Hz around 60, 120, 180, and 240 Hz. Phase extracted via Hilbert transform. Detection threshold: |Δphase| > 0.8 radians between consecutive windows.

Result: ENF hit rates were elevated in both recordings (4.82/s for “threat”, 4.05/s for “millions”) compared to the WAV reference baseline (1.39/s). However, the hits were distributed uniformly across both recordings — not concentrated at specific moments. OGG/Opus codec frame boundaries introduce phase discontinuities that inflate the ENF hit count uniformly throughout the file. This is a codec artifact, not a splice indicator. A real edit would produce a concentration of hits at one or two specific timestamps.

Method 2 — Background noise profile

Principle: Every recording environment and device has a characteristic spectral noise fingerprint. Assembling recordings from different sources produces a change in that profile at the splice point.

Implementation: 2-second segments, no overlap. Power Spectral Density via Welch method. Background noise = 10th percentile of PSD per frequency band. Change detection: L2 distance with z-score > 2.5.

Result: Zero changes in either Hondurasgate recording. Two minor changes in the unedited reference recording. This is the most robust indicator against assembly: if the audio were spliced from recordings made in different environments or on different devices, the spectral profile would change at the splice. It does not change.

Method 3 — Spectral flux

Principle: Natural speech produces gradual spectral changes between frames. An edit cut produces an abrupt flux spike at the moment of the splice.

Implementation: 30ms frames, 10ms step, Hanning window. Flux = sum of positive increments of spectrum between consecutive frames. Peak detection: z-score > 4.0; first and last 0.5s excluded.

Recording	Duration	Flux hits	Hits/s	Noise Δ
Unedited reference (control)	90.9s	126	1.39/s	1
Hondurasgate — “threat”	19.3s	26	1.35/s	0
Hondurasgate — “millions”	28.4s	46	1.62/s	0

The hit rate is statistically identical across both Hondurasgate recordings and the unedited control. The number of hits is proportional to duration. These are not anomalies — they are baseline behavior of the detector on compressed telephone audio.

Conclusion: no editing indicators detected in either recording.

What these methods cannot exclude

A professionally executed edit that carefully preserves ENF phase continuity and background noise profile could evade these methods. The ENF baseline should ideally use an unedited OGG/Opus recording rather than a WAV reference — OGG codec frame boundaries introduce phase discontinuities that inflate the ENF hit rate, as observed here (elevated baseline of 4.82/s and 4.05/s vs the WAV control at 1.39/s). The uniform distribution of hits confirms these are codec artifacts, not edits. And the date and location of the recordings cannot be determined by any of the methods used here. Confirming those would require correlating the ENF signal against Honduras’s ENEE power grid database for the period January–April 2026.

The analysis establishes what can be established with these tools. The limits are documented because they are real.

Sources · Phonexia denial and legal action: Contracorriente.red, 19 May 2026 · Earshot forensic report, completed 13 May 2026, published by Drop Site News 19 May 2026: dropsitenews.com · WavLM-Base+ model: huggingface.co/microsoft/wavlm-base-plus-sv · Chen et al. (2022), “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing”, IEEE Journal of Selected Topics in Signal Processing · SUPERB benchmark: superbbenchmark.org · For the full findings: Hondurasgate — What the Audios Say · Step-by-step replication guide: How to Verify a Voice with Open-Source Tools

Este artículo es el complemento técnico de Hondurasgate — Hallazgos. Explica el proceso. El artículo de hallazgos cubre qué se encontró; este cubre cómo.

Por qué era necesaria la verificación independiente

El dossier de Hondurasgate presentó su verificación forense como respaldada por Phonexia Voice Inspector — producto comercial usado en más de 60 países por tribunales y agencias de inteligencia. Phonexia negó públicamente esto e inició acciones legales. Hondurasgate admitió haber usado “endpoints de la API” exhibiendo el branding de Phonexia sin autorización, y reemplazó todas las menciones de Phonexia por “HG Forensics” — un motor interno sin trayectoria auditable — en cinco horas tras la exposición.

Eso dejó la identidad de los hablantes en las grabaciones sin establecer por ningún método creíble. El trabajo descrito aquí se realizó para cubrir ese vacío: análisis independiente con herramientas completamente de código abierto que cualquier tercero puede replicar sin licencias comerciales.

Identificación de hablante — la herramienta

WavLM-Base+ (microsoft/wavlm-base-plus-sv) es un modelo desarrollado por Microsoft Research, publicado en 2022 por Chen et al. Tiene 101 millones de parámetros, está preentrenado con predicción de habla enmascarada y ajustado para verificación de hablante en el benchmark SUPERB. Sus pesos, arquitectura y metodología de entrenamiento están documentados en literatura científica revisada por pares. Está disponible públicamente en HuggingFace sin costo.

Por qué este modelo en lugar de un producto comercial: su arquitectura es auditable, sus resultados son reproducibles, y fue seleccionado específicamente porque el análisis puede ser verificado o cuestionado por cualquier parte independiente — lo opuesto de “HG Forensics”.

Cómo funciona: WavLM convierte un segmento de audio en un embedding de 256 dimensiones que representa las características biométricas del tracto vocal del hablante — los rasgos físicos únicos que distinguen la voz de una persona de la de otra. Para comparar dos muestras de audio se calcula la similitud coseno entre sus embeddings. Los scores van de 0 a 1; valores próximos a 1 indican el mismo hablante.

Preprocesamiento

Todo el audio fue convertido a WAV a 16.000 Hz mono usando ffmpeg antes del análisis. Las grabaciones de Hondurasgate analizadas aquí son los archivos OGG/Opus originales a 8.000 Hz mono, descargados directamente de hondurasgate.ch — no copias de YouTube. Normalizar el formato elimina diferencias de tipo de micrófono y codec que de otro modo introducirían variación irrelevante en la comparación.

Calibración — el paso crítico

Un score de similitud por sí solo no significa nada. 0.80 puede indicar el mismo hablante o uno diferente según el modelo, las condiciones del audio y los hablantes concretos. Antes de analizar ninguna grabación de Hondurasgate, el modelo fue validado con identidades conocidas, usando los mismos hablantes y condiciones de audio similares.

Se usaron tres muestras de referencia de Cossette López de apariciones públicas documentadas (32,6s, 27,4s y 90,9s). El embedding de referencia fue el promedio normalizado de las tres, para reducir el efecto de variaciones acústicas en una sola muestra.

Controles positivos — mismo hablante, identidad conocida:

Comparación	Score
López ref 1 vs ref 2	0,9509
López ref 1 vs ref 3	0,9486
López ref 2 vs ref 3	0,9802

Control negativo — hablante diferente, identidad conocida (JOH, 33,9s):

Comparación	Score
López ref 1 vs JOH	0,7184
López ref 2 vs JOH	0,6616

Gap de discriminación establecido empíricamente: ~0,27 puntos entre scores del mismo hablante (0,95–0,98) y de hablante diferente (0,66–0,72). El umbral de decisión fue fijado en 0,88 — derivado de este conjunto de datos específico, no de literatura genérica. Este es el umbral que importa. Un modelo solo es útil en la medida en que la calibración que lo rodea sea rigurosa.

Resultados para Cossette López

Grabación	Duración	Score	Resultado
”Amenaza — con la misma bala que mató a Marlon, Salvador e Iroshka”	19,3s	0,9394	Mismo hablante — alta confianza
”Ofrece millones para comprar votos contra Marlon”	28,4s	0,9246	Mismo hablante — alta confianza

Ambos scores superan el umbral de decisión (0,88) y están más de 0,20 puntos por encima del extremo superior del control negativo (0,72). El resultado es inequívoco dentro de la calibración establecida.

Qué prueba y qué no prueba:

PreguntaRespuesta

¿La voz es de Cossette López?Sí — alta confianza

¿La voz es humana (no IA)?Sí — implícito en la alta similitud con voz humana conocida

¿Dice lo que el título del audio afirma?No determinable por este método

¿La grabación es del contexto que se alega?No determinable por este método

¿Fue editado el audio?Ver detección de edición a continuación

Resultados para JOH y Asfura — Earshot

Un informe independiente de Earshot — ONG especializada en investigación sónica — fue completado el 13 de mayo de 2026 y publicado el 19 de mayo por Drop Site News. Earshot usó Resemblyzer — una herramienta diferente basada en una arquitectura distinta — sin ninguna coordinación con el análisis WavLM-Base+ anterior.

Resemblyzer convierte características vocales (frecuencias fundamentales, armónicos, timbre, ritmo, cadencia) en un score de similitud numérico de 0 a 1. Las muestras de referencia fueron obtenidas de YouTube: un podcast para JOH, una entrevista para Asfura. Los controles negativos (señuelos) fueron Halle Berry y Robert DeNiro.

Grabación	Hablante	Score	Señuelo	Gap	Resultado
”Gracias a mí está sentado en esa silla”	JOH	0,67	0,47	+0,20	Probablemente auténtico
”María Antonieta, fondos oficina EE.UU., Milei, México”	JOH	0,66	0,48	+0,18	Probablemente auténtico
”ZEDEs, Roatán, Comayagua, base militar, GE, Argentina”	Asfura	0,66	0,48	+0,18	Probablemente auténtico

Conclusión de Earshot: las grabaciones son “grabaciones auténticas de las voces de Hernández y Asfura y probablemente no fueron generadas por IA” — confianza moderada, limitada por la calidad de audio telefónico.

Earshot también identificó artefactos de habla en las tres grabaciones — respiraciones, hesitaciones, ruido ambiente, y en una grabación una distorsión de micrófono causada por exhalación directa sobre el diafragma — no encontrados en audio generado por IA en investigaciones previas.

Los scores absolutos no son comparables entre WavLM-Base+ y Resemblyzer — arquitecturas distintas, escalas distintas. Lo que sí es comparable es el gap de discriminación por encima del control negativo. Ambos muestran separación clara y consistente entre la voz atribuida y otras voces. Tres análisis independientes, dos herramientas distintas, sin coordinación, publicados el mismo día.

Detección de edición — tres métodos

Identificar al hablante no responde si el audio fue ensamblado de fragmentos grabados en momentos distintos. Una grabación puede contener la voz real de alguien, diciendo cosas reales en momentos reales, editadas para crear una conversación que nunca ocurrió como se presenta. Esta es una pregunta separada que requiere métodos separados.

Se aplicaron tres métodos en paralelo. Los tres se aplicaron también a una grabación de referencia no editada (López ref 3, 90,9s) para establecer el comportamiento base — un control esencial para interpretar los resultados.

Método 1 — ENF (Frecuencia de la Red Eléctrica)

Principio: La red eléctrica de Honduras opera a 60 Hz nominales. Esta frecuencia fluctúa ligeramente y de forma continua en un patrón único a lo largo del tiempo. Los dispositivos de grabación capturan esta señal como un ruido tenue en el audio. Un corte entre dos grabaciones realizadas en momentos distintos produce una discontinuidad de fase en la señal ENF que no corresponde a la variación natural continua.

Implementación: Ventanas de análisis de 1 segundo, salto de 0,5s. Filtro paso-banda Butterworth de orden 4, ±0,5 Hz alrededor de 60, 120, 180 y 240 Hz. Fase extraída via transformada de Hilbert. Umbral de detección: |Δfase| > 0,8 radianes entre ventanas consecutivas.

Resultado: La tasa de hits ENF fue elevada en ambas grabaciones (4,82/s en “amenaza”, 4,05/s en “millones”) en comparación con la referencia WAV (1,39/s). Sin embargo, los hits se distribuyeron uniformemente a lo largo de las grabaciones — no concentrados en momentos específicos. Los límites de frame del codec OGG/Opus introducen discontinuidades de fase que inflan el contador de hits de forma uniforme en todo el archivo. Esto es un artefacto del codec, no un indicador de empalme. Un corte de edición real produciría concentración de hits en uno o dos momentos concretos.

Método 2 — Perfil de ruido de fondo

Principio: Cada entorno y dispositivo de grabación tiene una huella espectral de ruido de fondo característica. Un ensamblaje de grabaciones de fuentes distintas produce un cambio en ese perfil en el punto de empalme.

Implementación: Segmentos de 2 segundos sin solapamiento. Power Spectral Density via método de Welch. Ruido de fondo = percentil 10 de la PSD por banda de frecuencia. Detección de cambios: distancia L2 con z-score > 2,5.

Resultado: Cero cambios en ninguna de las dos grabaciones de Hondurasgate. Dos cambios menores en la grabación de referencia no editada. Este es el indicador más robusto contra ensamblaje: si el audio fuera un empalme de grabaciones de entornos o dispositivos distintos, el perfil espectral cambiaría en el punto de unión. No cambia.

Método 3 — Flux espectral

Principio: La voz natural produce cambios espectrales graduales entre frames. Un corte de edición produce un pico abrupto de flux en el momento del empalme.

Implementación: Frames de 30ms, salto de 10ms, ventana Hanning. Flux = suma de incrementos positivos del espectro entre frames consecutivos. Detección de picos: z-score > 4,0; primeros y últimos 0,5s excluidos.

Grabación	Duración	Hits flux	Hits/s	Ruido Δ
Referencia no editada (control)	90,9s	126	1,39/s	1
Hondurasgate — “amenaza”	19,3s	26	1,35/s	0
Hondurasgate — “millones”	28,4s	46	1,62/s	0

La tasa de hits es estadísticamente idéntica en ambas grabaciones de Hondurasgate y en el control no editado. El número de hits es proporcional a la duración. No son anomalías — son el comportamiento base del detector sobre audio de teléfono comprimido.

Conclusión: no se detectaron indicadores de edición en ninguna de las dos grabaciones.

Lo que estos métodos no pueden descartar

Una edición profesional cuidadosamente ejecutada que preserve la continuidad de fase ENF y el perfil de ruido de fondo podría no ser detectable con estos métodos. El baseline del ENF idealmente debería usar una grabación OGG/Opus no editada en lugar de WAV de referencia — los límites de frame del codec OGG introducen discontinuidades de fase que inflan la tasa de hits ENF, como se observa aquí (baseline elevado de 4,82/s y 4,05/s frente al control WAV en 1,39/s). La distribución uniforme de los hits confirma que son artefactos del codec, no ediciones. Y la fecha y el lugar de las grabaciones no pueden determinarse con ninguno de los métodos aquí usados. Confirmarlos requeriría correlacionar la señal ENF con la base de datos histórica de la ENEE para el período enero–abril 2026.

El análisis establece lo que puede establecerse con estas herramientas. Los límites están documentados porque son reales.

Fuentes · Denegación de Phonexia y acciones legales: Contracorriente.red, 19 de mayo de 2026 · Informe forense Earshot, completado el 13 de mayo de 2026, publicado por Drop Site News el 19 de mayo de 2026: dropsitenews.com · Modelo WavLM-Base+: huggingface.co/microsoft/wavlm-base-plus-sv · Chen et al. (2022), “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing”, IEEE Journal of Selected Topics in Signal Processing · Benchmark SUPERB: superbbenchmark.org · Para los hallazgos completos: Hondurasgate — Qué Dicen los Audios · Guía de replicación paso a paso: Cómo Verificar una Voz con Herramientas de Código Abierto