Extracts novel information from daily news and generates dense daily digests

Python 97.2%
Shell 2.7%

Find a file

Vivien Mallet 97d2b15066 Squashed commit.		2026-05-24 18:37:13 -05:00
eval	Squashed commit.	2026-05-24 18:37:13 -05:00
examples	Squashed commit.	2026-05-24 18:37:13 -05:00
ops	Squashed commit.	2026-05-24 18:37:13 -05:00
research	Squashed commit.	2026-05-24 18:37:13 -05:00
src/news_curator	Squashed commit.	2026-05-24 18:37:13 -05:00
tests	Squashed commit.	2026-05-24 18:37:13 -05:00
.gitignore	Squashed commit.	2026-05-24 18:37:13 -05:00
.python-version	Squashed commit.	2026-05-24 18:37:13 -05:00
LICENSE	Squashed commit.	2026-05-24 18:37:13 -05:00
poetry.lock	Squashed commit.	2026-05-24 18:37:13 -05:00
pyproject.toml	Squashed commit.	2026-05-24 18:37:13 -05:00
README.md	Squashed commit.	2026-05-24 18:37:13 -05:00

README.md

News curator

This software extracts new information from daily news, and produces compact daily digests that report only novelties, excluding background or evergreen information. An individual well versed in a specific domain can then quickly access dense information that matters from selected article feeds.

This software is distributed under the GNU GPL v3.

Why this project

This repository is built as an end-to-end AI engineering project around a practical task (daily news monitoring):

ingestion of articles from RSS feeds,
content extraction from raw HTML,
LLM-based takeaway generation,
embedding-based deduplication with pgvector,
daily report generation,
offline evaluation scripts for model comparison.

Engineering choices

Here is the implemented workflow, with comments on the engineering choices:

The user provides a list of article feeds.
- The feeds may cover a specific topic, such as AI, or multiple topics.
The articles are downloaded and stored in PostgreSQL.
- PostgreSQL was chosen as a general-purpose database that can also serve as a vector database (see below).
- The download is a script designed to be run as an idempotent scheduled job.
A takeaway extracts the core new information conveyed by each downloaded article, and is stored in PostgreSQL with pgvector extension.
- A takeaway is generated by an LLM from the article's title and content.
- The embedding model BAAI/bge-large-en-v1.5 is used for strong retrieval performance.
- The vector database is PostgreSQL with extension pgvector, so as to rely on a consolidated database for articles and takeaways.
The daily review is generated from the takeaway store, with deduplication.
- The takeaways generated for today (or another target date) are compiled in a Markdown review, with links to the original articles for more information.
- Today's takeaways are deduplicated by selecting one takeaway (the longest) among all similar takeaways covering the same piece of news.
- Takeaways that are similar to past takeaways (from previous days) are filtered out, since they have already been included in a past review.
- Similar takeaways are detected using (1) a similarity threshold (cosine similarity) between the embeddings of the takeaway store, (2) a second and finer filtering with a cross-encoder.

Limitations and future work:

The similarity thresholds used to select duplicates, and potential duplicates before cross-encoder evaluation, were derived from manual investigation. It should be derived from a larger set of selected articles covering the same news, and implemented in a program included in the repository.
Integration tests run on SQLite (faster and easier to run temporarily), while production vector similarity uses PostgreSQL + pgvector.

Architecture

The core module follows a Domain-Driven Design (DDD):

src/news_curator/domain/: domain entities (Article, Takeaway, Review),
src/news_curator/application/: use cases and abstract interfaces,
src/news_curator/infrastructure/: fetchers, persistence, LLM adapters,
src/news_curator/presentation/: CLI entry points.

Additional directories:

eval/: evaluation scripts for takeaway quality,
ops/: operational scripts (PostgreSQL, LLM start, pipeline helpers),
tests/: unit and integration tests,
files/reviews/: generated daily reviews.

Tech stack and tools:

Fetching: Playwright
HTML parsing: Trafilatura
Storage: PostgreSQL (article store) with pgvector (takeaway store) and SQLAlchemy
LLM: OpenAI-compatible API (OpenAI, vLLM)
Bi-encoder: embeddings of vector DB
Cross-encoder: semantic textual similarity for evaluation and review deduplication
Packaging: Docker, Poetry
Quality: Pytest, Black, mypy

Installation

1) Prerequisites

Python 3.13.x
Poetry (Ubuntu package: python3-poetry)
Docker + Docker Compose v2 (Ubuntu packages: docker.io and docker-compose-v2)
(Optional) pyenv

2) Install dependencies

# Optional, if using pyenv:
pyenv install 3.13.11
pyenv local 3.13.11

poetry env use 3.13.11
poetry install

# Install Playwright browsers.
poetry run playwright install

# Ubuntu 24.04, if Playwright requires extra system libraries:
sudo apt-get install libgstreamer-plugins-bad1.0-0 libavif16

Quick demo

This workflow is isolated from the development workflow (database, LLM inference engine). It is provided to demonstrate an end-to-end run of the review generation pipeline.

1) Start and initialize the demo PostgreSQL

poetry run ops/scripts/demo/pg_start
poetry run ops/scripts/demo/pg_init

2) Start a local LLM server (vLLM)

ENV_FILE=examples/.env.demo poetry run ops/scripts/llm_start

3) Run the demo pipeline

poetry run ops/scripts/demo/run_pipeline

The demo pipeline uses examples/demo_ai.yaml and writes the review to examples/demo_review_2026-03-06.md.

The 2026-03-06 review should look like:

1. Microsoft and Google confirm Anthropic Claude remains available to non-defense customers despite Defense Department's supply-chain risk designation.   [ https://techcrunch.com/2026/03/06/microsoft-anthropic-claude-remains-available-to-customers-except-the-defense-depar>
2. Anthropic's Claude identified 22 vulnerabilities in Firefox, 14 of which are high-severity, in just two weeks, significantly contributing to Firefox's security updates.   [ https://techcrunch.com/2026/03/06/anthropics-claude-found-22-vulnerabilities-in-firefox-over-two>
3. City Detect, using AI to monitor building health, raises $13M Series A to automate building inspections, enabling cities to track and address issues faster than human crews.   [ https://techcrunch.com/2026/03/06/city-detect-uses-ai-to-help-cities-stay-safe-and-clean/ ]

The configuration file examples/demo_ai.yaml includes 6 articles, but 2 articles are from the previous day (2026-03-05), and one of the articles from 2026-03-06 (https://techcrunch.com/2026/03/06/after-europe-whatsapp-will-let-rival-ai-companies-offer-chatbots-in-brazil/) reports news already covered by an article from 2026-03-05 (https://www.globalbankingandfinance.com/meta-allow-ai-rivals-whatsapp-bid-stave-off-eu-action/). It is therefore detected as a duplicate, already covered in the 2026-03-05 review, and excluded from the review.

4) Stop or remove demo PostgreSQL resources

poetry run ops/scripts/demo/pg_stop
poetry run ops/scripts/demo/pg_remove

Development workflow

1) Configure environment

cp examples/.env .env

Then edit .env if needed. The default values are sufficient to run the local pipeline. The evaluation part (LLM-as-a-judge) requires an OpenAI key or a local OpenAI-compatible API.

2) Start and initialize PostgreSQL

poetry run ops/scripts/pg_start
poetry run ops/scripts/pg_init
poetry run ops/scripts/pg_init_test

ops/scripts/pg_init_test is creates an additional database named news_curator_test, with pgvector enabled. It is required to run certain PostgreSQL integration tests.

To stop PostgreSQL later:

poetry run ops/scripts/pg_stop

To fully clean PostgreSQL resources (container, network, image, volume):

poetry run ops/scripts/pg_remove

3) Start a local LLM server (vLLM)

poetry run ops/scripts/llm_start

Run the pipeline

1) Download articles

Save a YAML file listing RSS feeds, e.g., ai.yaml:

feeds:
  - name: TechCrunch
    url: https://techcrunch.com/category/artificial-intelligence/feed/
  - name: MIT
    url: https://news.mit.edu/topic/mitartificial-intelligence2-rss.xml

To download and store the articles from these sources, run:

poetry run python src/news_curator/presentation/run_download_articles.py -c ai.yaml

2) Extract takeaways

From the articles now in the article store, generate the takeaways for each article and store them in the takeaway store with:

poetry run python src/news_curator/presentation/run_extract_takeaways.py --llm-chat-completions

3) Generate daily review

The review for a certain date is generated with:

poetry run python src/news_curator/presentation/run_generate_review.py --date 2026-03-04

Write today's review to a Markdown file:

mkdir -p files/reviews
DATE=$(date -I)
poetry run python src/news_curator/presentation/run_generate_review.py --date "$DATE" > "files/reviews/review-${DATE}.md"

4) Run all steps with helper script

This generates yesterday's and today's reviews:

poetry run ops/scripts/run_pipeline

Evaluation workflow

The eval/takeaways/ scripts A) compare takeaway quality across models and B) support LLM-as-a-judge scoring. They are used to compare various LLMs and select the best for the task.

For both evaluations A and B, a number of articles (e.g., 200) are first extracted from the article store.

Evaluation A

Generate a takeaway per article, with a high-quality reference model and with the production model.
Compare, using a cross-encoder, the reference takeaways with the production takeaways, and assign semantic similarity scores.
Use the scores to compare multiple production models, and select the one closest to the reference LLM.

Example

This is an example for evaluation A, with cross-encoder similarity scores between production takeaways and reference takeaways (higher is better). This example is available in file examples/eval/takeaways/compare_takeaways.md and generated by eval.takeaways.compare_takeaways.

Reference model: gpt-5

Scoring model: cross-encoder/stsb-roberta-large

Production model	Articles	Min	P25	Median	Mean	P75	Max	Stddev
unsloth/gemma-3-12b-it-GGUF	100	0.3102	0.6412	0.6885	0.6698	0.7283	0.8031	0.0987
Qwen/Qwen2.5-3B-Instruct	100	0.3404	0.5990	0.6550	0.6464	0.7025	0.8433	0.0900
Qwen/Qwen2.5-0.5B-Instruct	100	0.2631	0.5644	0.6202	0.6074	0.6784	0.7887	0.1057

Evaluation B

Provide a strong LLM-as-a-judge with the articles and corresponding takeaways.
Let the LLM-as-a-judge return a numerical score and a written explanation of its evaluation.

Score meaning:

1 (poor): the takeaway, or part of it, is not supported by the article.
2 (insufficient): the takeaway is consistent with the article but misses the most important new information.
3 (good): the takeaway captures the most important new information but also includes information that is not new or less important.
4 (excellent): the takeaway perfectly fulfills the objective.

Example

This is an example for evaluation B, with mean score and score distribution. This example is available in file examples/eval/takeaways/llm_as_judge.md and generated by eval.takeaways.report_llm_as_judge.

Production model	Judge model	Articles	Mean	Score 1 (%)	Score 2 (%)	Score 3 (%)	Score 4 (%)
unsloth/gemma-3-12b-it-GGUF	gpt-5	100	3.0400	5.0%	14.0%	53.0%	28.0%
Qwen/Qwen2.5-3B-Instruct	gpt-5	100	2.3900	19.0%	26.0%	52.0%	3.0%
Qwen/Qwen2.5-0.5B-Instruct	gpt-5	100	2.0400	28.0%	43.0%	26.0%	3.0%

1) Build an evaluation article dataset

mkdir -p data/eval/takeaways

poetry run python -m eval.takeaways.extract_dataset_articles --num-articles 200 --output data/eval/takeaways/articles.parquet

2) Generate takeaways with a reference model

This step is needed for evaluation A only.

poetry run python -m eval.takeaways.generate_dataset_takeaways \
  --input data/eval/takeaways/articles.parquet \
  --output data/eval/takeaways/ref_takeaways.parquet \
  --llm-id gpt-5-nano \
  --llm-base-url https://api.openai.com/v1/ \
  --llm-api-key "$OPENAI_API_KEY"

3) Generate takeaways with a production model

This step is needed for evaluation A and B. It can be run with different candidate production models to compare them and select the best.

poetry run python -m eval.takeaways.generate_dataset_takeaways \
  --input data/eval/takeaways/articles.parquet \
  --output data/eval/takeaways/prod_takeaways.parquet \
  --llm-id "$PROD_LLM_ID" \
  --llm-base-url "$PROD_LLM_BASE_URL"

4) Compare reference vs production takeaways (semantic similarity)

This is evaluation A.

poetry run python -m eval.takeaways.compare_takeaways \
  --ref data/eval/takeaways/ref_takeaways.parquet \
  --prod data/eval/takeaways/prod_takeaways_model_a.parquet \
  --prod data/eval/takeaways/prod_takeaways_model_b.parquet \
  --markdown-output data/eval/takeaways/compare_takeaways.md

5) Score takeaways with an LLM judge

This is evaluation B.

poetry run python -m eval.takeaways.llm_as_judge \
  --input data/eval/takeaways/prod_takeaways.parquet \
  --output data/eval/takeaways/prod_takeaways_judged.parquet \
  --llm-id gpt-5-nano \
  --llm-base-url https://api.openai.com/v1/ \
  --llm-api-key "$OPENAI_API_KEY"

6) Compare multiple LLM-as-a-judge evaluations

This step only reads Parquet files with scores from step 5) and computes summary statistics and score distributions for evaluation B.

poetry run python -m eval.takeaways.report_llm_as_judge \
  --input data/eval/takeaways/prod_takeaways_model_a_judged.parquet \
  --input data/eval/takeaways/prod_takeaways_model_b_judged.parquet \
  --markdown-output data/eval/takeaways/report_llm_as_judge.md

Quality checks

Run formatting and tests:

poetry run black --check .
poetry run mypy
poetry run pytest

License

GNU GPL v3.