Methodology | State of Technology

State of Technology is an event-first news reader. It does not rank articles by clicks, likes, or paid placement. It groups public coverage into story events, then ranks those events by independent corroboration, coverage velocity, and freshness.

Pipeline

Each scheduler pass follows the same path. The live site reads the finished cache, not the raw feeds.

01
Source ingest
SoT polls public RSS feeds and bounded public sitemaps from news desks, AI labs, security publications, practitioner sources, and filtered Hacker News queries.
Sources -> entries
02
URL cleanup
Tracking parameters are stripped, canonical URLs are normalized, and exact duplicate URLs are collapsed before clustering.
Entries -> canonical URLs
03
Candidate matching
The clusterer compares titles with token overlap and local embeddings, then sends plausible pairs through deterministic same-event checks.
Entries -> possible pairs
04
Event gates
Candidate merges must survive checks for shared anchors, event facets, conflicting story types, and publish-time span.
Pairs -> story clusters
05
Ranking
Story clusters are scored by independent source corroboration, coverage velocity, and recency decay.
Clusters -> edition order
06
Evidence dossier
Each cluster gets deterministic why-ranked reasons, source mix, first/last article times, and a compact coverage timeline.
Scores -> reader evidence
07
Cached API
The scheduler writes a fresh cluster cache. Site pages read that cache, so reader requests never trigger live clustering.
Edition -> pages

Clustering Guardrails

Exact duplicates

Articles with the same canonical URL are treated as the same source article identity and collapsed before event clustering.

Shared event evidence

Normal semantic merges need enough title similarity or embedding similarity, plus specific anchors such as product names, policy actions, labs, or model names.

False-merge vetoes

Pairs can be rejected when event facets conflict, when the combined time span is too wide, or when a broad company name is the only thing in common.

Repair pass

A conservative repair pass can merge split same-event clusters when lexical overlap, embeddings, anchors, and publish-time proximity all agree.

Ranking Formula

rank_score = corroboration_weight * (1 + spike_score) * recency_decay

corroboration_weight = source-count boost with single-source damping
spike_score = recent arrivals / prior arrivals
recency_decay = exponential age decay from first article time

Corroboration

Distinct source count is the primary ranking lever. Single-source stories are useful, but multi-source coverage gets a stronger signal.

Velocity

Recent article arrivals are compared against the prior window to identify accelerating stories. Trending labels require at least two independent sources.

Recency

Older stories decay over time so the front page reflects the current edition instead of yesterday's already-saturated cluster.

Evidence

Story detail pages expose the same ranking inputs as reader-facing evidence: why-ranked reasons, source mix, timeline, and first/last article timestamps.

Limits

SoT uses English public feeds and sitemaps in a rolling live window. It can miss stories when a source does not publish usable feed or sitemap metadata, and it can split an event when headlines describe the same incident with very different language.

SoT is inspired by the useful public behavior of event-grouped news products and fast public conversation, but it does not have access to private social ranking systems and does not try to reproduce them.

Algorithm Credits

The clustering rules, ranking formula, and repair gates are project code. The non-obvious algorithm layer depends on these open-source projects and model weights.

FastEmbed

Runs local text embeddings for candidate story matching without sending article titles to a hosted embedding API.

BAAI/bge-small-en-v1.5

The English embedding model used for title similarity in the semantic clustering tier.

NumPy

Used for vector normalization and cosine similarity calculations inside the clustering pipeline.

Back to About

rank_score = corroboration_weight * (1 + spike_score) * recency_decay corroboration_weight = source-count boost with single-source damping spike_score = recent arrivals / prior arrivals recency_decay = exponential age decay from first article time

How SoT Builds a News Edition

Pipeline

Source ingest

URL cleanup

Candidate matching

Event gates

Ranking

Evidence dossier

Cached API

Clustering Guardrails

Exact duplicates

Shared event evidence

False-merge vetoes

Repair pass

Ranking Formula

Corroboration

Velocity

Recency

Evidence

Limits

Algorithm Credits

FastEmbed

BAAI/bge-small-en-v1.5

NumPy

How SoT Builds a News Edition

Pipeline

Source ingest

URL cleanup

Candidate matching

Event gates

Ranking

Evidence dossier

Cached API

Clustering Guardrails

Exact duplicates

Shared event evidence

False-merge vetoes

Repair pass

Ranking Formula

Corroboration

Velocity

Recency

Evidence

Limits

Algorithm Credits

FastEmbed

BAAI/bge-small-en-v1.5

NumPy