I Built HelloBible's Voice of Customer Pipeline for $1 a Month.
4 scattered feedback channels (Zendesk, WhatsApp, Play Store, App Store) - 2,020 customer verbatims classified across 27 business-specific themes - for $1 in API calls. Built with Python and Claude Haiku. Runs monthly from a single command, with checkpoints so an interruption doesn't cost a re-run.
What this project actually proves
The business problem: HelloBible was making roadmap decisions on intuition. Which bugs are recurring? What do users actually want next? Who’s threatening to churn? Nobody could tell you without spending half a day in raw text - so people stopped checking.
The pipeline I built ends that. 2,020 verbatims classified across 27 business-specific themes, every month, for ~$1 in API costs. The team now has certainty about what bothers users and what they want - which means the roadmap goes in the right direction and the backlog gets fed by what’s actually said, not what’s loudest in the last standup.
The point isn’t that AI can classify customer feedback. The point is that one Product Ops with a budget mindset can replace a function most companies pay six figures for.
The friction I found
HelloBible had feedback flowing in from 4 places, none of which talked to each other:
- Zendesk support tickets (1,145 in 2026)
- WhatsApp beta group (516 messages, 8 months of scrolling history)
- Google Play Store reviews (326)
- App Store iOS reviews (33 from 2026 out of 118 pasted manually)
Tickets got read one by one. Store reviews were checked in each app’s interface. The WhatsApp group was scrolled manually when someone remembered to. No taxonomy, no priority signals, no shared way for the team to talk about users.
The decisions that depended on this data were getting made on intuition. Which bugs are recurring? What features do users actually want? Who’s threatening to churn? Nobody could tell you without spending half a day in raw text.
What I actually built
A 6-step Python pipeline that runs monthly:
- Export Zendesk via the Incremental API (full tickets + every comment in every conversation)
- Parse the WhatsApp
_chat.txtexport (with a robust regex that handles names containing colons - my first version broke on something like ”~ Jean 3:16”) - Integrate monthly Google Play Store CSVs (with UTF-16 BOM handling because of course they ship it that way)
- Append App Store iOS reviews (manually copy-pasted, because there’s no public reviews API for iOS)
- Classify every verbatim via Claude Haiku 4.5 against a 27-theme taxonomy I designed and iterated
- Output two artifacts: a CSV for the team to explore and filter, a Markdown executive synthesis for the CEO
The whole thing runs from a single command. State is checkpointed every 100 records. If it crashes mid-classification, it resumes from where it stopped.
What I’m actually selling
The reflex most teams have when they hear “let’s use AI to classify customer feedback”: reach for the biggest model on every record. Then get a four-figure monthly bill and quietly stop using it.
The reflex I have instead: deterministic pre-filter first, cheap model second, cache the parts that repeat, checkpoint everything, and only call the expensive model when you need to audit. ~$1 a month, end to end.
The technical details are in the accordion below. The point is the discipline.
The technical architecture
Stack
- Python 3 with
anthropic,requests,csv,json,re,pathlib - Zendesk Incremental Export API for tickets + Comments API for full conversations
- Anthropic Messages API with
claude-haiku-4-5-20251001andcache_control: {"type": "ephemeral"}on the system prompt - JSONL as intermediate format (append-friendly, diffable, debuggable)
- CSV UTF-8-sig for Google Sheets compatibility
- Markdown for the executive synthesis
- Local execution - scripts in iCloud, outputs in
~/Downloads/
Flow end-to-end
1. export_zendesk_2026.py → zendesk_tickets_2026.jsonl (1,145 tickets)
2. consolidate_voc_2026.py → voc_unified_2026.jsonl (2,020 verbatims)
3. classify_voc_haiku.py → voc_classified_2026.jsonl (theme + sentiment + signal)
4. [dedup + export CSV] → voc_classified_2026_final.csv (Google Sheets)
5. [executive synthesis] → VoC_HelloBible_2026.md (CEO read)
6. run_monthly_voc.py → orchestrates 1-5 every month
The taxonomy
27 themes with hierarchical prefixes for rollups (bug.*, contenu.*, abonnement.*, etc.). Plus 4 sentiments, 7 signals, and a confidence field on every classification.
Some entries that show the domain encoding:
abonnement.paiement_localfor Wave / Orange Money - African market payment friction that generic taxonomies misscontenu.ia_qualite(factual AI errors) vscontenu.theologie(doctrinal disagreements) - a subtle but important distinction for an app that generates AI Bible contentchurn_riskdefined precisely as “the user threatens or announces leaving / unsubscribing”- “Mise en garde théologique” flagged separately because “the app replacing God” is a specific ethical churn risk that generic AI taxonomies miss entirely
I didn’t let the LLM “discover” the categories. The taxonomy is the product - it’s what becomes the team’s shared language for talking about users.
Cost engineering specifics
- Deterministic pre-filter saves ~14% of LLM calls for free
- Haiku at ~$0.80/M input + $4/M output tokens
- Ephemeral cache amortizes the ~1.5kb taxonomy system prompt across all calls
- Total cost: ~$1 for 1,703 classified records
Safety and idempotence
- Strict JSON validation on Haiku output → unknown theme falls back to
bruit - Exponential retry (5 attempts, 5s → 60s backoff) on rate limits
- Graceful fallback: if all retries fail, the record is saved with
filter_reason: "classification_failed"- never silent data loss - Checkpoints every 100 records (resumable state on disk)
- Append-mode on all outputs (no data loss after interruption)
- Self-deduplication post-hoc (a partial-stats bug introduced 51 duplicates on first run)
What broke (the useful ones)
The bugs worth telling about - not the typos, the ones that taught me something:
- WhatsApp regex broken on colons in names like ”~ Jean 3:16”. The first split on
:captured the Bible reference as the message body. Fix: split on the first": "pattern instead. Real lesson on domain edge cases in raw text parsing. - Haiku appending text after the JSON object caused
json.loadsto fail with “Extra data”. Fix: regex-extract the first JSON object instead of parsing the whole response. Real lesson on LLM output validation. - Hit the monthly Anthropic API limit at record 650/1,703 mid-classification. Bumped the limit, the script resumed from the checkpoint - exactly what the resumable design was built for. Validation of the architecture.
- Partial stats + 51 duplicates after resume because the stats dict reset on each run. Caught it because Claude Opus flagged that my numbers didn’t add up arithmetically. Regenerated a clean deduplicated CSV.
That last one is the most useful lesson: you need a second LLM to audit your first LLM’s outputs. I had Claude Opus check Haiku’s classifications for consistency, and it flagged real errors I would have missed. That’s how you build trust in an automated AI system - keep a smarter model in the audit loop.
What this changes operationally
Before the pipeline, deciding what to build next at HelloBible meant trusting whoever shouted loudest in the standup. Now the team has the data to answer three concrete questions every month:
- What’s bothering users right now? (filter the CSV by
sentiment: negativeand recurring themes) - What do users actually want next? (filter by
signal: feature_requestand roll up by theme) - Who’s about to leave? (filter by
signal: churn_riskand surface in support)
The exec gets a 1-page Markdown synthesis. The product team gets the full CSV to dig into. Same data, two artifacts, $1 a month to keep both fresh.
The taxonomy itself is now an asset - the shared vocabulary the team uses to talk about users. The pipeline keeps feeding it every month from a single Python command.
Want to talk about something like this?
Email me, send a LinkedIn message, or download the CV. Conversations are what this site is built for.