Methodology
How we find signals, score them, and turn them into articles.
The Content Pipeline
- 1Collect
RSS feeds + HTML scraping from 16 Japanese government and industry sources. robots.txt respected. Rate-limited.
- 2Extract
Body text extracted via trafilatura (HTML) or pypdf (PDFs). Min. 200 chars to proceed.
- 3Classify
Keyword matching against 7 categories and 16 investment themes. LLM assist for low-confidence items.
- 4Score
7-axis scoring (0–5 each, max 35). Only items scoring ≥15 proceed.
- 5Generate
Claude API with journalist-voice prompt. Deterministic humanize pass applied automatically.
- 6Review
Human editor reviews draft. [VERIFY] markers block publish until resolved.
- 7Publish
Manual publish command moves draft to published/. Translations generated for all 4 locales.
Signal Scoring
Only signals scoring 15 or above out of 35 proceed to the article generation stage.
Article Generation
Qualifying signals are drafted using a custom Claude API prompt designed around journalist voice — not consulting language. The prompt explicitly bans the patterns that make AI writing detectable: uniform sentence length, filler transitions, vague quantifiers, and marketing copy. A deterministic post-processing pass then replaces any remaining AI-typical phrases using a curated list of 80+ substitutions.
Human Review Gate
Every draft is saved to a staging directory and reviewed by a human editor before publication. Drafts containing unverified factual claims (marked [VERIFY] by the generation system) are blocked from publishing until the editor resolves them. No article is ever auto-published.
Translations
English articles are the authoritative source. Translations to Hindi, French, and Simplified Chinese are generated via the Claude API with explicit instructions to preserve company names, yen figures, and proper nouns unchanged. Translation quality is reviewed spot-check for each language.