State of Technology
Today's NewsLatestBriefingTrendingTopics
State of Technology

A public front page for technology news that treats the event as the unit: one story, many sources, ranked by corroboration and velocity.

X: @NiravJ3niravjoshi3000@gmail.com

Sections

  • About
  • Methodology
  • Updates
  • Latest
  • Trending
  • Topics

Colophon

Core clustering uses FastEmbed with BAAI/bge-small-en-v1.5 embeddings and NumPy cosine math.

Methodology and algorithm credits

2026 State of Technology

Built by Nirav Joshi

Methodology

Tuesday, June 16, 2026

How SoT Builds a News Edition

The public pipeline behind source ingest, event clustering, ranking, and core algorithm credits.

State of Technology is an event-first news reader. It does not rank articles by clicks, likes, or paid placement. It groups public coverage into story events, then ranks those events by independent corroboration, coverage velocity, and freshness.

Pipeline

Each scheduler pass follows the same path. The live site reads the finished cache, not the raw feeds.

  1. 01

    Source ingest

    SoT polls public RSS feeds and bounded public sitemaps from news desks, AI labs, security publications, practitioner sources, and filtered Hacker News queries.

    Sources -> entries

  2. 02

    URL cleanup

    Tracking parameters are stripped, canonical URLs are normalized, and exact duplicate URLs are collapsed before clustering.

    Entries -> canonical URLs

  3. 03

    Candidate matching

    The clusterer compares titles with token overlap and local embeddings, then sends plausible pairs through deterministic same-event checks.

    Entries -> possible pairs

  4. 04

    Event gates

    Candidate merges must survive checks for shared anchors, event facets, conflicting story types, and publish-time span.

    Pairs -> story clusters

  5. 05

    Ranking

    Story clusters are scored by independent source corroboration, coverage velocity, and recency decay.

    Clusters -> edition order

  6. 06

    Evidence dossier

    Each cluster gets deterministic why-ranked reasons, source mix, first/last article times, and a compact coverage timeline.

    Scores -> reader evidence

  7. 07

    Cached API

    The scheduler writes a fresh cluster cache. Site pages read that cache, so reader requests never trigger live clustering.

    Edition -> pages

Clustering Guardrails

Exact duplicates

Articles with the same canonical URL are treated as the same source article identity and collapsed before event clustering.

Shared event evidence

Normal semantic merges need enough title similarity or embedding similarity, plus specific anchors such as product names, policy actions, labs, or model names.

False-merge vetoes

Pairs can be rejected when event facets conflict, when the combined time span is too wide, or when a broad company name is the only thing in common.

Repair pass

A conservative repair pass can merge split same-event clusters when lexical overlap, embeddings, anchors, and publish-time proximity all agree.

Ranking Formula

rank_score = corroboration_weight * (1 + spike_score) * recency_decay

corroboration_weight = source-count boost with single-source damping
spike_score = recent arrivals / prior arrivals
recency_decay = exponential age decay from first article time

Corroboration

Distinct source count is the primary ranking lever. Single-source stories are useful, but multi-source coverage gets a stronger signal.

Velocity

Recent article arrivals are compared against the prior window to identify accelerating stories. Trending labels require at least two independent sources.

Recency

Older stories decay over time so the front page reflects the current edition instead of yesterday's already-saturated cluster.

Evidence

Story detail pages expose the same ranking inputs as reader-facing evidence: why-ranked reasons, source mix, timeline, and first/last article timestamps.

Limits

SoT uses English public feeds and sitemaps in a rolling live window. It can miss stories when a source does not publish usable feed or sitemap metadata, and it can split an event when headlines describe the same incident with very different language.

SoT is inspired by the useful public behavior of event-grouped news products and fast public conversation, but it does not have access to private social ranking systems and does not try to reproduce them.

Algorithm Credits

The clustering rules, ranking formula, and repair gates are project code. The non-obvious algorithm layer depends on these open-source projects and model weights.

FastEmbed

Runs local text embeddings for candidate story matching without sending article titles to a hosted embedding API.

BAAI/bge-small-en-v1.5

The English embedding model used for title similarity in the semantic clustering tier.

NumPy

Used for vector normalization and cosine similarity calculations inside the clustering pipeline.

Back to About