← Back to work
USER INSIGHT SYSTEM

I Built HelloBible's Voice of Customer Pipeline for $1 a Month.

4 scattered feedback channels (Zendesk, WhatsApp, Play Store, App Store) - 2,020 customer verbatims classified across 27 business-specific themes - for $1 in API calls. Built with Python and Claude Haiku. Runs monthly from a single command, with checkpoints so an interruption doesn't cost a re-run.

Verbatims classified 2,020 From 4 sources, unified
API cost ~$1 For 1,703 LLM classifications
Custom taxonomy 27 themes 4 sentiments + 7 signals
PythonAnthropic SDKClaude Haiku 4.5Zendesk APIJSONLCSVMarkdown

What this project actually proves

The business problem: HelloBible was making roadmap decisions on intuition. Which bugs are recurring? What do users actually want next? Who’s threatening to churn? Nobody could tell you without spending half a day in raw text - so people stopped checking.

The pipeline I built ends that. 2,020 verbatims classified across 27 business-specific themes, every month, for ~$1 in API costs. The team now has certainty about what bothers users and what they want - which means the roadmap goes in the right direction and the backlog gets fed by what’s actually said, not what’s loudest in the last standup.

The point isn’t that AI can classify customer feedback. The point is that one Product Ops with a budget mindset can replace a function most companies pay six figures for.


The friction I found

HelloBible had feedback flowing in from 4 places, none of which talked to each other:

  • Zendesk support tickets (1,145 in 2026)
  • WhatsApp beta group (516 messages, 8 months of scrolling history)
  • Google Play Store reviews (326)
  • App Store iOS reviews (33 from 2026 out of 118 pasted manually)

Tickets got read one by one. Store reviews were checked in each app’s interface. The WhatsApp group was scrolled manually when someone remembered to. No taxonomy, no priority signals, no shared way for the team to talk about users.

The decisions that depended on this data were getting made on intuition. Which bugs are recurring? What features do users actually want? Who’s threatening to churn? Nobody could tell you without spending half a day in raw text.


What I actually built

A 6-step Python pipeline that runs monthly:

  1. Export Zendesk via the Incremental API (full tickets + every comment in every conversation)
  2. Parse the WhatsApp _chat.txt export (with a robust regex that handles names containing colons - my first version broke on something like ”~ Jean 3:16”)
  3. Integrate monthly Google Play Store CSVs (with UTF-16 BOM handling because of course they ship it that way)
  4. Append App Store iOS reviews (manually copy-pasted, because there’s no public reviews API for iOS)
  5. Classify every verbatim via Claude Haiku 4.5 against a 27-theme taxonomy I designed and iterated
  6. Output two artifacts: a CSV for the team to explore and filter, a Markdown executive synthesis for the CEO

The whole thing runs from a single command. State is checkpointed every 100 records. If it crashes mid-classification, it resumes from where it stopped.


What I’m actually selling

The reflex most teams have when they hear “let’s use AI to classify customer feedback”: reach for the biggest model on every record. Then get a four-figure monthly bill and quietly stop using it.

The reflex I have instead: deterministic pre-filter first, cheap model second, cache the parts that repeat, checkpoint everything, and only call the expensive model when you need to audit. ~$1 a month, end to end.

The technical details are in the accordion below. The point is the discipline.


The technical architecture

Stack

  • Python 3 with anthropic, requests, csv, json, re, pathlib
  • Zendesk Incremental Export API for tickets + Comments API for full conversations
  • Anthropic Messages API with claude-haiku-4-5-20251001 and cache_control: {"type": "ephemeral"} on the system prompt
  • JSONL as intermediate format (append-friendly, diffable, debuggable)
  • CSV UTF-8-sig for Google Sheets compatibility
  • Markdown for the executive synthesis
  • Local execution - scripts in iCloud, outputs in ~/Downloads/

Flow end-to-end

1. export_zendesk_2026.py     → zendesk_tickets_2026.jsonl     (1,145 tickets)
2. consolidate_voc_2026.py    → voc_unified_2026.jsonl         (2,020 verbatims)
3. classify_voc_haiku.py      → voc_classified_2026.jsonl      (theme + sentiment + signal)
4. [dedup + export CSV]       → voc_classified_2026_final.csv  (Google Sheets)
5. [executive synthesis]      → VoC_HelloBible_2026.md         (CEO read)
6. run_monthly_voc.py         → orchestrates 1-5 every month

The taxonomy

27 themes with hierarchical prefixes for rollups (bug.*, contenu.*, abonnement.*, etc.). Plus 4 sentiments, 7 signals, and a confidence field on every classification.

Some entries that show the domain encoding:

  • abonnement.paiement_local for Wave / Orange Money - African market payment friction that generic taxonomies miss
  • contenu.ia_qualite (factual AI errors) vs contenu.theologie (doctrinal disagreements) - a subtle but important distinction for an app that generates AI Bible content
  • churn_risk defined precisely as “the user threatens or announces leaving / unsubscribing”
  • “Mise en garde théologique” flagged separately because “the app replacing God” is a specific ethical churn risk that generic AI taxonomies miss entirely

I didn’t let the LLM “discover” the categories. The taxonomy is the product - it’s what becomes the team’s shared language for talking about users.

Cost engineering specifics

  • Deterministic pre-filter saves ~14% of LLM calls for free
  • Haiku at ~$0.80/M input + $4/M output tokens
  • Ephemeral cache amortizes the ~1.5kb taxonomy system prompt across all calls
  • Total cost: ~$1 for 1,703 classified records

Safety and idempotence

  • Strict JSON validation on Haiku output → unknown theme falls back to bruit
  • Exponential retry (5 attempts, 5s → 60s backoff) on rate limits
  • Graceful fallback: if all retries fail, the record is saved with filter_reason: "classification_failed" - never silent data loss
  • Checkpoints every 100 records (resumable state on disk)
  • Append-mode on all outputs (no data loss after interruption)
  • Self-deduplication post-hoc (a partial-stats bug introduced 51 duplicates on first run)

What broke (the useful ones)

The bugs worth telling about - not the typos, the ones that taught me something:

  • WhatsApp regex broken on colons in names like ”~ Jean 3:16”. The first split on : captured the Bible reference as the message body. Fix: split on the first ": " pattern instead. Real lesson on domain edge cases in raw text parsing.
  • Haiku appending text after the JSON object caused json.loads to fail with “Extra data”. Fix: regex-extract the first JSON object instead of parsing the whole response. Real lesson on LLM output validation.
  • Hit the monthly Anthropic API limit at record 650/1,703 mid-classification. Bumped the limit, the script resumed from the checkpoint - exactly what the resumable design was built for. Validation of the architecture.
  • Partial stats + 51 duplicates after resume because the stats dict reset on each run. Caught it because Claude Opus flagged that my numbers didn’t add up arithmetically. Regenerated a clean deduplicated CSV.

That last one is the most useful lesson: you need a second LLM to audit your first LLM’s outputs. I had Claude Opus check Haiku’s classifications for consistency, and it flagged real errors I would have missed. That’s how you build trust in an automated AI system - keep a smarter model in the audit loop.


What this changes operationally

Before the pipeline, deciding what to build next at HelloBible meant trusting whoever shouted loudest in the standup. Now the team has the data to answer three concrete questions every month:

  • What’s bothering users right now? (filter the CSV by sentiment: negative and recurring themes)
  • What do users actually want next? (filter by signal: feature_request and roll up by theme)
  • Who’s about to leave? (filter by signal: churn_risk and surface in support)

The exec gets a 1-page Markdown synthesis. The product team gets the full CSV to dig into. Same data, two artifacts, $1 a month to keep both fresh.

The taxonomy itself is now an asset - the shared vocabulary the team uses to talk about users. The pipeline keeps feeding it every month from a single Python command.

Want to talk about something like this?

Email me, send a LinkedIn message, or download the CV. Conversations are what this site is built for.

LinkedIn ↗ Download CV
Email copied to clipboard