📅 2026-01-11 23:00
🕒 Reading time: 9 min
🏷️ SCENE_CAST
![]()
The day after resolving NeuroPlay's MECE incident, a new consultation arrived regarding AI meeting minutes accuracy improvement. Volume 31, "The Pursuit of Reproducibility," Episode 381, tells the story of deploying optimal solutions for each scene.
"Detective, our meetings are not being recorded. Or rather, they are recorded, but they're unusable. We introduced the AI meeting minutes tool 'donut AI,' but the accuracy is below 60%. It takes 2 hours to correct a 1-hour meeting record. At this point, it would be faster for someone to take handwritten notes."
Jennifer Kim, Product Manager at TechScribe Inc. from Silicon Valley, visited 221B Baker Street with an exhausted expression. In her hands were printouts of meeting minutes covered in red pen corrections, contrasting sharply with a hopeful project proposal titled "AI Minutes Revolution 2026."
"We're a B2B SaaS company. 120 employees. Annual revenue of 1.8 billion yen. 40 meetings per week. Sales, development, customer success, management meetings. We've tried AI meeting minutes for all meetings, but none are usable."
TechScribe Inc.'s Current Situation: - Established: 2019 (B2B SaaS product development) - Employees: 120 - Annual Revenue: 1.8 billion yen - Weekly Meetings: 40 - Problem: AI meeting minutes accuracy below 60%, editing time double the meeting time
There was deep frustration in Jennifer's voice.
"The demo was perfect. The donut AI sales rep showed us a demo video with 95% accuracy transcription. 'Uses the latest Whisper API,' 'Handles technical terms,' 'Speaker identification 99%.' Everything looked wonderful. But when we actually implemented it, it was hell."
AI Meeting Minutes Accuracy Collapse Reality:
Case 1: Sales Meeting (5 participants, 60 minutes) - Actual statement: "To improve client LTV, we'll refactor the onboarding flow" - AI minutes: "To improve client erutiibii, we'll rifakutaringu the onboodingu furoo" - Katakana representation rate: 78% - Technical term misrecognition: 12 instances/60 minutes
Case 2: Development Meeting (8 participants, 90 minutes) - Actual statement: "To resolve PostgreSQL's N+1 problem, implement Eager Loading" - AI minutes: "To resolve posutoguresukiyuueru's enpurasichi problem, implement iigaaroodingu" - Technical term misrecognition: 23 instances/90 minutes - Speaker identification failures: 17 times ("Speaker A" and "Speaker B" swap)
Case 3: Customer Success Meeting (4 participants, 45 minutes) - Actual statement: "To improve churn rate from 3.2% to 2.5%, conduct NPS survey" - AI minutes: "To improve chaanreeto 3.2% to 2.5%, conduct enupiiesu survey" - Metric name misrecognition: 8 instances/45 minutes
Monthly Editing Time Reality: - 40 meetings/week × average 60 minutes = 2,400 minutes (40 hours) of meetings - 1 hour meeting → 2 hours editing work - Monthly editing time: 40 hours × 2× × 4 weeks = 320 hours/month - Staff: 3 people (sales assistant, development PMO, CS staff) - Editing time per person: 107 hours/month
Jennifer sighed deeply.
"There's another problem. We tried tools other than donut AI. Otter.ai, Notta, Rimo Voice. All with the same results. All 60% accuracy. Ideally, we need 80%. With 80%, we can fix it in 20 minutes. But with 60%, it takes 2 hours."
"Jennifer, do you think using the same AI model for all meetings will produce the same accuracy for all meetings?"
Jennifer showed a puzzled expression at my question.
"Huh, isn't that the case? I was told AI uses the latest Whisper API, so it should be highly accurate for any meeting."
Current Understanding (Universal AI Model): - Expectation: One AI model handles all meetings - Problem: Meeting scene (context) is not considered
I explained the importance of deploying optimal solutions for each meeting type using Scene-Cast Theory.
"The problem is thinking 'use the same AI model for all meetings.' Scene-Cast Theory. By deploying the optimal object set (tools, models, settings) for each scene, we achieve reproducible accuracy improvements."
"Don't rely on universal AI. Deploy optimal solutions for each scene with Scene-Cast Theory."
"Meetings are always 'different plays performed on the same stage.' The key is deploying the right actors for each play."
"Apply Scene-Cast Theory's 3 steps: Scene Classification, Object Set Design, Deploy & Validate."
The three members began their analysis. Gemini developed "Scene-Cast Theory" on the whiteboard.
Scene-Cast Theory's 3 Steps: 1. Scene Classification: Classify meeting types by characteristics 2. Object Set Design: Design optimal combinations for each scene 3. Deploy & Validate: Measure and improve accuracy in actual operation
"Jennifer, let's first classify meetings by scene."
Step 1: Meeting Scene Classification (1 week)
Classification Axes: - X-axis: Technical term density (Low, Medium, High) - Y-axis: Speaker transition frequency (Low, Medium, High)
Scene Classification Results:
| Meeting Type | Technical Term Density | Speaker Transition | Weekly Count | Scene Name |
|---|---|---|---|---|
| Sales Meeting | Medium (LTV, CAC, MRR) | Medium (5 people) | 12 times | Scene A |
| Development Meeting | High (SQL, API, Git) | High (8 people) | 15 times | Scene B |
| CS Meeting | Medium (NPS, Churn) | Low (4 people) | 8 times | Scene C |
| Management Meeting | Low (qualitative discussion) | Low (3 people) | 5 times | Scene D |
Important Discovery: - Scene B (development meetings) has the lowest accuracy (high technical terms × high speaker transitions) - Scene D (management meetings) has relatively high accuracy (low technical terms × low speaker transitions) - Donut AI uses the same model (Whisper Large V3) for all scenes
Essence of the Problem: - Same model applied to all meetings → ignores scene characteristics - No technical term dictionary → full of katakana representations - Insufficient speaker identification training → "Speaker A" and "Speaker B" frequently swap
Step 2: Object Set Design (2 weeks)
Object Set for Scene A (Sales Meetings): - AI Model: Whisper Large V3 + custom vocabulary (300 sales terms) - Technical Term Dictionary: LTV, CAC, MRR, ARR, churn, onboarding - Speaker Identification: Pre-train voice prints of 5 people (5-minute audio sample each) - Post-processing: Katakana → alphanumeric automatic conversion script
Object Set for Scene B (Development Meetings): - AI Model: Whisper Large V3 + custom vocabulary (500 technical terms) - Technical Term Dictionary: PostgreSQL, N+1, Eager Loading, Git, API, Docker, Kubernetes - Speaker Identification: Pre-train voice prints of 8 people (5-minute audio sample each) - Post-processing: Technical term notation unification (e.g., posutoguresu → PostgreSQL)
Object Set for Scene C (CS Meetings): - AI Model: Whisper Large V3 + custom vocabulary (200 CS terms) - Technical Term Dictionary: NPS, churn rate, onboarding, retention - Speaker Identification: Pre-train voice prints of 4 people (5-minute audio sample each) - Post-processing: Metric numerical format unification
Object Set for Scene D (Management Meetings): - AI Model: Whisper Large V3 (standard settings) - Technical Term Dictionary: Minimal (EBITDA, KPI, etc., 20 words) - Speaker Identification: Pre-train voice prints of 3 people (5-minute audio sample each) - Post-processing: Minimal
Step 3: Prototype Development (Months 1-2)
Technical Configuration: - Base Model: OpenAI Whisper Large V3 - Customization: No fine-tuning required, handled by Prompt Engineering - Technical Term Dictionary: Managed in JSON format (switched per scene) - Post-processing Script: Python + regular expressions for automatic conversion
Implementation Example (Scene B: Development Meeting): Technical term dictionary (JSON)
tech_vocabulary = {
"posutoguresukiyuueru": "PostgreSQL",
"enpurasichi": "N+1",
"iigaaroodingu": "Eager Loading",
"gitto": "Git",
"eepiiāi": "API"
}
Post-processing script
def post_process(transcript, vocabulary):
for katakana, correct in vocabulary.items():
transcript = transcript.replace(katakana, correct)
return transcript
Month 3: Effectiveness Measurement
KPI 1: Transcription Accuracy (Technical Term Accuracy Rate)
| Scene | Before | After | Improvement |
|---|---|---|---|
| Scene A (Sales) | 58% | 82% | +41% |
| Scene B (Development) | 54% | 79% | +46% |
| Scene C (CS) | 61% | 84% | +38% |
| Scene D (Management) | 68% | 88% | +29% |
Overall Average Accuracy: - Before: 60% - After: 83% - Improvement: +38%
KPI 2: Editing Time
| Scene | Before | After | Reduction |
|---|---|---|---|
| Scene A (Sales) | 120 min | 25 min | 79% |
| Scene B (Development) | 180 min | 35 min | 81% |
| Scene C (CS) | 90 min | 18 min | 80% |
| Scene D (Management) | 90 min | 15 min | 83% |
Monthly Editing Time: - Before: 320 hours/month - After: 68 hours/month - Time saved: 252 hours/month
Annual Impact:
Personnel Cost Reduction: - Time saved: 252 hours/month × 12 months = 3,024 hours/year - Staff hourly rate: 3,200 yen (annual salary 6 million yen ÷ 1,875 hours) - Personnel cost reduction: 3,024 hours × 3,200 yen = 9.67 million yen/year
Investment: - Custom vocabulary dictionary creation: 600,000 yen - Voice print training data collection: 400,000 yen - Post-processing script development: 1 million yen - Total initial investment: 2 million yen - Annual AI API cost: 1.8 million yen (Whisper API usage fee)
ROI: - (9.67 million yen - 1.8 million yen) / 2 million yen × 100 = 394% - Payback period: 2 million yen ÷ 7.87 million yen = 0.25 years (3 months)
That night, I reflected on the essence of Scene-Cast Theory.
TechScribe Inc. held the illusion that "using the same AI model for all meetings" would work. However, sales meetings, development meetings, CS meetings, and management meetings are each different scenes. Technical term density, speaker transition frequency, and required accuracy all differ.
Using Scene-Cast Theory, we classified meetings into four scenes (A-D) and designed optimal object sets (AI model + technical term dictionary + speaker identification + post-processing) for each scene. As a result, accuracy improved from 60% to 83%, and editing time was reduced from 320 hours/month to 68 hours/month.
Annual personnel cost reduction of 9.67 million yen, ROI of 394%, payback period of 3 months.
The key is not pursuing "universal AI" but deploying "scene-optimized AI." Even with the same Whisper Large V3, accuracy improves dramatically just by adding a technical term dictionary and post-processing.
"Don't rely on universal AI. Deploy optimal solutions for each scene with Scene-Cast Theory. Different plays performed on the same stage require the right actors for each play. Reproducible accuracy improvement begins with understanding the scene."
The next case will also depict the moment of deploying optimal solutions for each scene.
"Scene-Cast Theory. Deploy optimal object sets for each scene. There is no universal solution. By understanding scene characteristics and deploying optimal solutions, reproducible results emerge."—From the Detective's Notes
🎖️ Top 3 Weekly Ranking of Classified Case Files
Solve Your Business Challenges with Kindle Unlimited!
Access millions of books with unlimited reading.
Read the latest from ROI Detective Agency now!
*Free trial available for eligible customers only