Extracts novel information from daily news and generates dense daily digests
  • Python 97.2%
  • Shell 2.7%
Find a file
2026-03-09 20:26:29 -05:00
eval Squashed commit. 2026-03-09 20:26:29 -05:00
examples Squashed commit. 2026-03-09 20:26:29 -05:00
ops Squashed commit. 2026-03-09 20:26:29 -05:00
research Squashed commit. 2026-03-09 20:26:29 -05:00
src/news_curator Squashed commit. 2026-03-09 20:26:29 -05:00
tests Squashed commit. 2026-03-09 20:26:29 -05:00
.gitignore Squashed commit. 2026-03-09 20:26:29 -05:00
.python-version Squashed commit. 2026-03-09 20:26:29 -05:00
LICENSE Squashed commit. 2026-03-09 20:26:29 -05:00
poetry.lock Squashed commit. 2026-03-09 20:26:29 -05:00
pyproject.toml Squashed commit. 2026-03-09 20:26:29 -05:00
README.md Squashed commit. 2026-03-09 20:26:29 -05:00

News curator

This software extracts new information from daily news, and produces compact daily digests that report only novelties, excluding background or evergreen information. An individual well versed in a specific domain can then quickly access dense information that matters from selected article feeds.

This software is distributed under the GNU GPL v3.

Why this project

This repository is built as an end-to-end AI engineering project around a practical task (daily news monitoring):

  • ingestion of articles from RSS feeds,
  • content extraction from raw HTML,
  • LLM-based takeaway generation,
  • embedding-based deduplication with pgvector,
  • daily report generation,
  • offline evaluation scripts for model comparison.

Engineering choices

Here is the implemented workflow, with comments on the engineering choices:

  1. The user provides a list of article feeds.
    • The feeds may cover a specific topic, such as AI, or multiple topics.
  2. The articles are downloaded and stored in PostgreSQL.
    • PostgreSQL was chosen as a general-purpose database that can also serve as a vector database (see below).
    • The download is a script designed to be run as an idempotent scheduled job.
  3. A takeaway extracts the core new information conveyed by each downloaded article, and is stored in PostgreSQL with pgvector extension.
    • A takeaway is generated by an LLM from the article's title and content.
    • The embedding model BAAI/bge-large-en-v1.5 is used for strong retrieval performance.
    • The vector database is PostgreSQL with extension pgvector, so as to rely on a consolidated database for articles and takeaways.
  4. The daily review is generated from the takeaway store, with deduplication.
    • The takeaways generated for today (or another target date) are compiled in a Markdown review, with links to the original articles for more information.
    • Today's takeaways are deduplicated by selecting one takeaway (the longest) among all similar takeaways covering the same piece of news.
    • Takeaways that are similar to past takeaways (from previous days) are filtered out, since they have already been included in a past review.
    • Similar takeaways are detected using (1) a similarity threshold (cosine similarity) between the embeddings of the takeaway store, (2) a second and finer filtering with a cross-encoder.

Limitations and future work:

  • The similarity thresholds used to select duplicates, and potential duplicates before cross-encoder evaluation, were derived from manual investigation. It should be derived from a larger set of selected articles covering the same news, and implemented in a program included in the repository.
  • Integration tests run on SQLite (faster and easier to run temporarily), while production vector similarity uses PostgreSQL + pgvector.

Architecture

The core module follows a Domain-Driven Design (DDD):

  • src/news_curator/domain/: domain entities (Article, Takeaway, Review),
  • src/news_curator/application/: use cases and abstract interfaces,
  • src/news_curator/infrastructure/: fetchers, persistence, LLM adapters,
  • src/news_curator/presentation/: CLI entry points.

Additional directories:

  • eval/: evaluation scripts for takeaway quality,
  • ops/: operational scripts (PostgreSQL, LLM start, pipeline helpers),
  • tests/: unit and integration tests,
  • files/reviews/: generated daily reviews.

Tech stack and tools:

  • Fetching: Playwright
  • HTML parsing: Trafilatura
  • Storage: PostgreSQL (article store) with pgvector (takeaway store) and SQLAlchemy
  • LLM: OpenAI-compatible API (OpenAI, vLLM)
  • Bi-encoder: embeddings of vector DB
  • Cross-encoder: semantic textual similarity for evaluation and review deduplication
  • Packaging: Docker, Poetry
  • Quality: Pytest, Black, mypy

Installation

1) Prerequisites

  • Python 3.13.x
  • Poetry (Ubuntu package: python3-poetry)
  • Docker + Docker Compose v2 (Ubuntu packages: docker.io and docker-compose-v2)
  • (Optional) pyenv

2) Install dependencies

# Optional, if using pyenv:
pyenv install 3.13.11
pyenv local 3.13.11

poetry env use 3.13.11
poetry install

# Install Playwright browsers.
poetry run playwright install

# Ubuntu 24.04, if Playwright requires extra system libraries:
sudo apt-get install libgstreamer-plugins-bad1.0-0 libavif16

Quick demo

This workflow is isolated from the development workflow (database, LLM inference engine). It is provided to demonstrate an end-to-end run of the review generation pipeline.

1) Start and initialize the demo PostgreSQL

poetry run ops/scripts/demo/pg_start
poetry run ops/scripts/demo/pg_init

2) Start a local LLM server (vLLM)

ENV_FILE=examples/.env.demo poetry run ops/scripts/llm_start

3) Run the demo pipeline

poetry run ops/scripts/demo/run_pipeline

The demo pipeline uses examples/demo_ai.yaml and writes the review to examples/demo_review_2026-03-06.md.

The 2026-03-06 review should look like:

1. Microsoft and Google confirm Anthropic Claude remains available to non-defense customers despite Defense Department's supply-chain risk designation.   [ https://techcrunch.com/2026/03/06/microsoft-anthropic-claude-remains-available-to-customers-except-the-defense-depar>
2. Anthropic's Claude identified 22 vulnerabilities in Firefox, 14 of which are high-severity, in just two weeks, significantly contributing to Firefox's security updates.   [ https://techcrunch.com/2026/03/06/anthropics-claude-found-22-vulnerabilities-in-firefox-over-two>
3. City Detect, using AI to monitor building health, raises $13M Series A to automate building inspections, enabling cities to track and address issues faster than human crews.   [ https://techcrunch.com/2026/03/06/city-detect-uses-ai-to-help-cities-stay-safe-and-clean/ ]

The configuration file examples/demo_ai.yaml includes 6 articles, but 2 articles are from the previous day (2026-03-05), and one of the articles from 2026-03-06 (https://techcrunch.com/2026/03/06/after-europe-whatsapp-will-let-rival-ai-companies-offer-chatbots-in-brazil/) reports news already covered by an article from 2026-03-05 (https://www.globalbankingandfinance.com/meta-allow-ai-rivals-whatsapp-bid-stave-off-eu-action/). It is therefore detected as a duplicate, already covered in the 2026-03-05 review, and excluded from the review.

4) Stop or remove demo PostgreSQL resources

poetry run ops/scripts/demo/pg_stop
poetry run ops/scripts/demo/pg_remove

Development workflow

1) Configure environment

cp examples/.env .env

Then edit .env if needed. The default values are sufficient to run the local pipeline. The evaluation part (LLM-as-a-judge) requires an OpenAI key or a local OpenAI-compatible API.

2) Start and initialize PostgreSQL

poetry run ops/scripts/pg_start
poetry run ops/scripts/pg_init
poetry run ops/scripts/pg_init_test

ops/scripts/pg_init_test is creates an additional database named news_curator_test, with pgvector enabled. It is required to run certain PostgreSQL integration tests.

To stop PostgreSQL later:

poetry run ops/scripts/pg_stop

To fully clean PostgreSQL resources (container, network, image, volume):

poetry run ops/scripts/pg_remove

3) Start a local LLM server (vLLM)

poetry run ops/scripts/llm_start

Run the pipeline

1) Download articles

Save a YAML file listing RSS feeds, e.g., ai.yaml:

feeds:
  - name: TechCrunch
    url: https://techcrunch.com/category/artificial-intelligence/feed/
  - name: MIT
    url: https://news.mit.edu/topic/mitartificial-intelligence2-rss.xml

To download and store the articles from these sources, run:

poetry run python src/news_curator/presentation/run_download_articles.py -c ai.yaml

2) Extract takeaways

From the articles now in the article store, generate the takeaways for each article and store them in the takeaway store with:

poetry run python src/news_curator/presentation/run_extract_takeaways.py --llm-chat-completions

3) Generate daily review

The review for a certain date is generated with:

poetry run python src/news_curator/presentation/run_generate_review.py --date 2026-03-04

Write today's review to a Markdown file:

mkdir -p files/reviews
DATE=$(date -I)
poetry run python src/news_curator/presentation/run_generate_review.py --date "$DATE" > "files/reviews/review-${DATE}.md"

4) Run all steps with helper script

This generates yesterday's and today's reviews:

poetry run ops/scripts/run_pipeline

Evaluation workflow

The eval/takeaways/ scripts A) compare takeaway quality across models and B) support LLM-as-a-judge scoring. They are used to compare various LLMs and select the best for the task.

For both evaluations A and B, a number of articles (e.g., 200) are first extracted from the article store.

Evaluation A

  • Generate a takeaway per article, with a high-quality reference model and with the production model.
  • Compare, using a cross-encoder, the reference takeaways with the production takeaways, and assign semantic similarity scores.
  • Use the scores to compare multiple production models, and select the one closest to the reference LLM.

Example

This is an example for evaluation A, with similarity scores between production takeaways and reference takeaways (higher is better). This example is available in file examples/eval/takeaways/compare_takeaways.md and generated by eval.takeaways.compare_takeaways.

Reference model: gpt-5

Scoring model: cross-encoder/stsb-roberta-large

Production model Articles Min P25 Median Mean P75 Max Stddev
unsloth/gemma-3-12b-it-GGUF 100 0.3102 0.6412 0.6885 0.6698 0.7283 0.8031 0.0987
Qwen/Qwen2.5-3B-Instruct 100 0.3404 0.5990 0.6550 0.6464 0.7025 0.8433 0.0900
Qwen/Qwen2.5-0.5B-Instruct 100 0.2631 0.5644 0.6202 0.6074 0.6784 0.7887 0.1057

Evaluation B

  • Provide a strong LLM-as-a-judge with the articles and corresponding takeaways.
  • Let the LLM-as-a-judge return a numerical score and a written explanation of its evaluation.

Example

This is an example for evaluation B, with mean score and score distribution. This example is available in file examples/eval/takeaways/llm_as_judge.md and generated by eval.takeaways.report_llm_as_judge.

Production model Judge model Articles Mean Score 1 (%) Score 2 (%) Score 3 (%) Score 4 (%)
unsloth/gemma-3-12b-it-GGUF gpt-5 100 3.0400 5.0% 14.0% 53.0% 28.0%
Qwen/Qwen2.5-3B-Instruct gpt-5 100 2.3900 19.0% 26.0% 52.0% 3.0%
Qwen/Qwen2.5-0.5B-Instruct gpt-5 100 2.0400 28.0% 43.0% 26.0% 3.0%

1) Build an evaluation article dataset

mkdir -p data/eval/takeaways

poetry run python -m eval.takeaways.extract_dataset_articles --num-articles 200 --output data/eval/takeaways/articles.parquet

2) Generate takeaways with a reference model

This step is needed for evaluation A only.

poetry run python -m eval.takeaways.generate_dataset_takeaways \
  --input data/eval/takeaways/articles.parquet \
  --output data/eval/takeaways/ref_takeaways.parquet \
  --llm-id gpt-5-nano \
  --llm-base-url https://api.openai.com/v1/ \
  --llm-api-key "$OPENAI_API_KEY"

3) Generate takeaways with a production model

This step is needed for evaluation A and B. It can be run with different candidate production models to compare them and select the best.

poetry run python -m eval.takeaways.generate_dataset_takeaways \
  --input data/eval/takeaways/articles.parquet \
  --output data/eval/takeaways/prod_takeaways.parquet \
  --llm-id "$PROD_LLM_ID" \
  --llm-base-url "$PROD_LLM_BASE_URL"

4) Compare reference vs production takeaways (semantic similarity)

This is evaluation A.

poetry run python -m eval.takeaways.compare_takeaways \
  --ref data/eval/takeaways/ref_takeaways.parquet \
  --prod data/eval/takeaways/prod_takeaways_model_a.parquet \
  --prod data/eval/takeaways/prod_takeaways_model_b.parquet \
  --markdown-output data/eval/takeaways/compare_takeaways.md

5) Score takeaways with an LLM judge

This is evaluation B.

poetry run python -m eval.takeaways.llm_as_judge \
  --input data/eval/takeaways/prod_takeaways.parquet \
  --output data/eval/takeaways/prod_takeaways_judged.parquet \
  --llm-id gpt-5-nano \
  --llm-base-url https://api.openai.com/v1/ \
  --llm-api-key "$OPENAI_API_KEY"

6) Compare multiple LLM-as-a-judge evaluations

This step only reads Parquet files with scores from step 5) and computes summary statistics and score distributions for evaluation B.

poetry run python -m eval.takeaways.report_llm_as_judge \
  --input data/eval/takeaways/prod_takeaways_model_a_judged.parquet \
  --input data/eval/takeaways/prod_takeaways_model_b_judged.parquet \
  --markdown-output data/eval/takeaways/report_llm_as_judge.md

Quality checks

Run formatting and tests:

poetry run black --check .
poetry run mypy
poetry run pytest

License

GNU GPL v3.