State of Technology is an event-first news reader. It does not rank articles by clicks, likes, or paid placement. It groups public coverage into story events, then ranks those events by independent corroboration, coverage velocity, and freshness.
Pipeline
Each scheduler pass follows the same path. The live site reads the finished cache, not the raw feeds.
01
Source ingest
SoT polls public RSS feeds and bounded public sitemaps from news desks, AI labs, security publications, practitioner sources, and filtered Hacker News queries.
Sources -> entries
02
URL cleanup
Tracking parameters are stripped, canonical URLs are normalized, and exact duplicate URLs are collapsed before clustering.
Entries -> canonical URLs
03
Candidate matching
The clusterer compares titles with token overlap and local embeddings, then sends plausible pairs through deterministic same-event checks.
Entries -> possible pairs
04
Event gates
Candidate merges must survive checks for shared anchors, event facets, conflicting story types, and publish-time span.
Pairs -> story clusters
05
Ranking
Story clusters are scored by independent source corroboration, coverage velocity, and recency decay.
Clusters -> edition order
06
Evidence dossier
Each cluster gets deterministic why-ranked reasons, source mix, first/last article times, and a compact coverage timeline.
Scores -> reader evidence
07
Cached API
The scheduler writes a fresh cluster cache. Site pages read that cache, so reader requests never trigger live clustering.
Edition -> pages
Clustering Guardrails
Exact duplicates
Articles with the same canonical URL are treated as the same source article identity and collapsed before event clustering.
Shared event evidence
Normal semantic merges need enough title similarity or embedding similarity, plus specific anchors such as product names, policy actions, labs, or model names.
False-merge vetoes
Pairs can be rejected when event facets conflict, when the combined time span is too wide, or when a broad company name is the only thing in common.
Repair pass
A conservative repair pass can merge split same-event clusters when lexical overlap, embeddings, anchors, and publish-time proximity all agree.
Ranking Formula
rank_score = corroboration_weight * (1 + spike_score) * recency_decay corroboration_weight = source-count boost with single-source damping spike_score = recent arrivals / prior arrivals recency_decay = exponential age decay from first article time
Corroboration
Distinct source count is the primary ranking lever. Single-source stories are useful, but multi-source coverage gets a stronger signal.
Velocity
Recent article arrivals are compared against the prior window to identify accelerating stories. Trending labels require at least two independent sources.
Recency
Older stories decay over time so the front page reflects the current edition instead of yesterday's already-saturated cluster.
Evidence
Story detail pages expose the same ranking inputs as reader-facing evidence: why-ranked reasons, source mix, timeline, and first/last article timestamps.
Limits
SoT uses English public feeds and sitemaps in a rolling live window. It can miss stories when a source does not publish usable feed or sitemap metadata, and it can split an event when headlines describe the same incident with very different language.
SoT is inspired by the useful public behavior of event-grouped news products and fast public conversation, but it does not have access to private social ranking systems and does not try to reproduce them.
Algorithm Credits
The clustering rules, ranking formula, and repair gates are project code. The non-obvious algorithm layer depends on these open-source projects and model weights.
FastEmbed
Runs local text embeddings for candidate story matching without sending article titles to a hosted embedding API.
BAAI/bge-small-en-v1.5
The English embedding model used for title similarity in the semantic clustering tier.
NumPy
Used for vector normalization and cosine similarity calculations inside the clustering pipeline.
