ROI Case File No.381 | 'TechScribe's 60% Accuracy Nightmare'ROI Detective Agency

Chapter 1: The Nightmare of 60% Accuracy—A 1-Hour Meeting Becomes 2 Hours of Editing Hell
Chapter 2: The Illusion of Universal AI—The Folly of Using the Same Model for All Meetings
Chapter 3: Phase 1—Scene Classification and Object Set Design
Chapter 4: Phase 2—Deploy and Validate to Measure Accuracy
Chapter 5: The Detective's Diagnosis—Deploy Optimal Solutions for Each Scene
Related Files

Chapter 1: The Nightmare of 60% Accuracy—A 1-Hour Meeting Becomes 2 Hours of Editing Hell

The day after resolving NeuroPlay's MECE incident, a new consultation arrived regarding AI meeting minutes accuracy improvement. Volume 31, "The Pursuit of Reproducibility," Episode 381, tells the story of deploying optimal solutions for each scene.

"Detective, our meetings are not being recorded. Or rather, they are recorded, but they're unusable. We introduced the AI meeting minutes tool 'donut AI,' but the accuracy is below 60%. It takes 2 hours to correct a 1-hour meeting record. At this point, it would be faster for someone to take handwritten notes."

Jennifer Kim, Product Manager at TechScribe Inc. from Silicon Valley, visited 221B Baker Street with an exhausted expression. In her hands were printouts of meeting minutes covered in red pen corrections, contrasting sharply with a hopeful project proposal titled "AI Minutes Revolution 2026."

"We're a B2B SaaS company. 120 employees. Annual revenue of 1.8 billion yen. 40 meetings per week. Sales, development, customer success, management meetings. We've tried AI meeting minutes for all meetings, but none are usable."

TechScribe Inc.'s Current Situation: - Established: 2019 (B2B SaaS product development) - Employees: 120 - Annual Revenue: 1.8 billion yen - Weekly Meetings: 40 - Problem: AI meeting minutes accuracy below 60%, editing time double the meeting time

There was deep frustration in Jennifer's voice.

"The demo was perfect. The donut AI sales rep showed us a demo video with 95% accuracy transcription. 'Uses the latest Whisper API,' 'Handles technical terms,' 'Speaker identification 99%.' Everything looked wonderful. But when we actually implemented it, it was hell."

AI Meeting Minutes Accuracy Collapse Reality:

Case 1: Sales Meeting (5 participants, 60 minutes) - Actual statement: "To improve client LTV, we'll refactor the onboarding flow" - AI minutes: "To improve client erutiibii, we'll rifakutaringu the onboodingu furoo" - Katakana representation rate: 78% - Technical term misrecognition: 12 instances/60 minutes

Case 2: Development Meeting (8 participants, 90 minutes) - Actual statement: "To resolve PostgreSQL's N+1 problem, implement Eager Loading" - AI minutes: "To resolve posutoguresukiyuueru's enpurasichi problem, implement iigaaroodingu" - Technical term misrecognition: 23 instances/90 minutes - Speaker identification failures: 17 times ("Speaker A" and "Speaker B" swap)

Case 3: Customer Success Meeting (4 participants, 45 minutes) - Actual statement: "To improve churn rate from 3.2% to 2.5%, conduct NPS survey" - AI minutes: "To improve chaanreeto 3.2% to 2.5%, conduct enupiiesu survey" - Metric name misrecognition: 8 instances/45 minutes

Monthly Editing Time Reality: - 40 meetings/week × average 60 minutes = 2,400 minutes (40 hours) of meetings - 1 hour meeting → 2 hours editing work - Monthly editing time: 40 hours × 2× × 4 weeks = 320 hours/month - Staff: 3 people (sales assistant, development PMO, CS staff) - Editing time per person: 107 hours/month

Jennifer sighed deeply.

"There's another problem. We tried tools other than donut AI. Otter.ai, Notta, Rimo Voice. All with the same results. All 60% accuracy. Ideally, we need 80%. With 80%, we can fix it in 20 minutes. But with 60%, it takes 2 hours."

Chapter 2: The Illusion of Universal AI—The Folly of Using the Same Model for All Meetings

"Jennifer, do you think using the same AI model for all meetings will produce the same accuracy for all meetings?"

Jennifer showed a puzzled expression at my question.

"Huh, isn't that the case? I was told AI uses the latest Whisper API, so it should be highly accurate for any meeting."

Current Understanding (Universal AI Model): - Expectation: One AI model handles all meetings - Problem: Meeting scene (context) is not considered

I explained the importance of deploying optimal solutions for each meeting type using Scene-Cast Theory.

"The problem is thinking 'use the same AI model for all meetings.' Scene-Cast Theory. By deploying the optimal object set (tools, models, settings) for each scene, we achieve reproducible accuracy improvements."

⬜️ ChatGPT | Concept Catalyst

"Don't rely on universal AI. Deploy optimal solutions for each scene with Scene-Cast Theory."

🟧 Claude | Story Alchemist

"Meetings are always 'different plays performed on the same stage.' The key is deploying the right actors for each play."

🟦 Gemini | Compass of Reason

"Apply Scene-Cast Theory's 3 steps: Scene Classification, Object Set Design, Deploy & Validate."

The three members began their analysis. Gemini developed "Scene-Cast Theory" on the whiteboard.

Scene-Cast Theory's 3 Steps: 1. Scene Classification: Classify meeting types by characteristics 2. Object Set Design: Design optimal combinations for each scene 3. Deploy & Validate: Measure and improve accuracy in actual operation

"Jennifer, let's first classify meetings by scene."

Chapter 3: Phase 1—Scene Classification and Object Set Design

Step 1: Meeting Scene Classification (1 week)

Classification Axes: - X-axis: Technical term density (Low, Medium, High) - Y-axis: Speaker transition frequency (Low, Medium, High)

Scene Classification Results:

Meeting Type	Technical Term Density	Speaker Transition	Weekly Count	Scene Name
Sales Meeting	Medium (LTV, CAC, MRR)	Medium (5 people)	12 times	Scene A
Development Meeting	High (SQL, API, Git)	High (8 people)	15 times	Scene B
CS Meeting	Medium (NPS, Churn)	Low (4 people)	8 times	Scene C
Management Meeting	Low (qualitative discussion)	Low (3 people)	5 times	Scene D

Important Discovery: - Scene B (development meetings) has the lowest accuracy (high technical terms × high speaker transitions) - Scene D (management meetings) has relatively high accuracy (low technical terms × low speaker transitions) - Donut AI uses the same model (Whisper Large V3) for all scenes

Essence of the Problem: - Same model applied to all meetings → ignores scene characteristics - No technical term dictionary → full of katakana representations - Insufficient speaker identification training → "Speaker A" and "Speaker B" frequently swap

Step 2: Object Set Design (2 weeks)

Object Set for Scene A (Sales Meetings): - AI Model: Whisper Large V3 + custom vocabulary (300 sales terms) - Technical Term Dictionary: LTV, CAC, MRR, ARR, churn, onboarding - Speaker Identification: Pre-train voice prints of 5 people (5-minute audio sample each) - Post-processing: Katakana → alphanumeric automatic conversion script

Object Set for Scene B (Development Meetings): - AI Model: Whisper Large V3 + custom vocabulary (500 technical terms) - Technical Term Dictionary: PostgreSQL, N+1, Eager Loading, Git, API, Docker, Kubernetes - Speaker Identification: Pre-train voice prints of 8 people (5-minute audio sample each) - Post-processing: Technical term notation unification (e.g., posutoguresu → PostgreSQL)

Object Set for Scene C (CS Meetings): - AI Model: Whisper Large V3 + custom vocabulary (200 CS terms) - Technical Term Dictionary: NPS, churn rate, onboarding, retention - Speaker Identification: Pre-train voice prints of 4 people (5-minute audio sample each) - Post-processing: Metric numerical format unification

Object Set for Scene D (Management Meetings): - AI Model: Whisper Large V3 (standard settings) - Technical Term Dictionary: Minimal (EBITDA, KPI, etc., 20 words) - Speaker Identification: Pre-train voice prints of 3 people (5-minute audio sample each) - Post-processing: Minimal

Step 3: Prototype Development (Months 1-2)

Technical Configuration: - Base Model: OpenAI Whisper Large V3 - Customization: No fine-tuning required, handled by Prompt Engineering - Technical Term Dictionary: Managed in JSON format (switched per scene) - Post-processing Script: Python + regular expressions for automatic conversion

Implementation Example (Scene B: Development Meeting): Technical term dictionary (JSON)

tech_vocabulary = {
    "posutoguresukiyuueru": "PostgreSQL",
    "enpurasichi": "N+1",
    "iigaaroodingu": "Eager Loading",
    "gitto": "Git",
    "eepiiāi": "API"
}

Post-processing script

def post_process(transcript, vocabulary):
    for katakana, correct in vocabulary.items():
    transcript = transcript.replace(katakana, correct)
    return transcript

Chapter 4: Phase 2—Deploy and Validate to Measure Accuracy

Month 3: Effectiveness Measurement

KPI 1: Transcription Accuracy (Technical Term Accuracy Rate)

Scene	Before	After	Improvement
Scene A (Sales)	58%	82%	+41%
Scene B (Development)	54%	79%	+46%
Scene C (CS)	61%	84%	+38%
Scene D (Management)	68%	88%	+29%

Overall Average Accuracy: - Before: 60% - After: 83% - Improvement: +38%

KPI 2: Editing Time

Scene	Before	After	Reduction
Scene A (Sales)	120 min	25 min	79%
Scene B (Development)	180 min	35 min	81%
Scene C (CS)	90 min	18 min	80%
Scene D (Management)	90 min	15 min	83%

Monthly Editing Time: - Before: 320 hours/month - After: 68 hours/month - Time saved: 252 hours/month

Annual Impact:

Personnel Cost Reduction: - Time saved: 252 hours/month × 12 months = 3,024 hours/year - Staff hourly rate: 3,200 yen (annual salary 6 million yen ÷ 1,875 hours) - Personnel cost reduction: 3,024 hours × 3,200 yen = 9.67 million yen/year

Investment: - Custom vocabulary dictionary creation: 600,000 yen - Voice print training data collection: 400,000 yen - Post-processing script development: 1 million yen - Total initial investment: 2 million yen - Annual AI API cost: 1.8 million yen (Whisper API usage fee)

ROI: - (9.67 million yen - 1.8 million yen) / 2 million yen × 100 = 394% - Payback period: 2 million yen ÷ 7.87 million yen = 0.25 years (3 months)

Chapter 5: The Detective's Diagnosis—Deploy Optimal Solutions for Each Scene

That night, I reflected on the essence of Scene-Cast Theory.

TechScribe Inc. held the illusion that "using the same AI model for all meetings" would work. However, sales meetings, development meetings, CS meetings, and management meetings are each different scenes. Technical term density, speaker transition frequency, and required accuracy all differ.

Using Scene-Cast Theory, we classified meetings into four scenes (A-D) and designed optimal object sets (AI model + technical term dictionary + speaker identification + post-processing) for each scene. As a result, accuracy improved from 60% to 83%, and editing time was reduced from 320 hours/month to 68 hours/month.

Annual personnel cost reduction of 9.67 million yen, ROI of 394%, payback period of 3 months.

The key is not pursuing "universal AI" but deploying "scene-optimized AI." Even with the same Whisper Large V3, accuracy improves dramatically just by adding a technical term dictionary and post-processing.

"Don't rely on universal AI. Deploy optimal solutions for each scene with Scene-Cast Theory. Different plays performed on the same stage require the right actors for each play. Reproducible accuracy improvement begins with understanding the scene."

The next case will also depict the moment of deploying optimal solutions for each scene.

"Scene-Cast Theory. Deploy optimal object sets for each scene. There is no universal solution. By understanding scene characteristics and deploying optimal solutions, reproducible results emerge."—From the Detective's Notes

ROI Case File No.381 | 'TechScribe's 60% Accuracy Nightmare'