API for generating cloze quizzes for vocabulary learning in English, French, and Spanish

Python 96.3%
Shell 3%
Dockerfile 0.7%

Find a file

Vivien Mallet 8997f538bf Squashed commit.		2026-05-24 18:37:03 -05:00
eval/quiz	Squashed commit.	2026-05-24 18:37:03 -05:00
ops	Squashed commit.	2026-05-24 18:37:03 -05:00
src/word_teacher	Squashed commit.	2026-05-24 18:37:03 -05:00
tests	Squashed commit.	2026-05-24 18:37:03 -05:00
training	Squashed commit.	2026-05-24 18:37:03 -05:00
.env.example	Squashed commit.	2026-05-24 18:37:03 -05:00
.gitignore	Squashed commit.	2026-05-24 18:37:03 -05:00
.python-version	Squashed commit.	2026-05-24 18:37:03 -05:00
LICENSE	Squashed commit.	2026-05-24 18:37:03 -05:00
poetry.lock	Squashed commit.	2026-05-24 18:37:03 -05:00
pyproject.toml	Squashed commit.	2026-05-24 18:37:03 -05:00
README.md	Squashed commit.	2026-05-24 18:37:03 -05:00

README.md

Word teacher

Word teacher provides an API that generates cloze quizzes for vocabulary learning in English, French and Spanish.

When a quiz is requested for a given word to learn, the system generates:

a quiz sentence (cloze sentence) where the target word is replaced by [MASK],
a dictionary-style definition, generated by an LLM guided by a short definition randomly extracted from a WordNet,
and, when needed, a translation of the definition to the learner's native language.

The goal is to produce natural, unambiguous contexts so the missing word can be inferred reliably.

The system also supports expressions, not just single words, e.g., "ghost ant", and also varying forms for the word (plural, conjugated).

A GUI, not included in this repository, should show the cloze sentence. The missing word may be materialized by empty squares (one per letter), potentially giving away the first letter for more guidance.

This software is distributed under the GNU GPL v3.

Why this project

This repository is designed as an end-to-end AI engineering project:

LLM-driven generation of quiz sentences and definitions,
MLM-based scoring to rank quiz candidates by ambiguity,
API serving with FastAPI,
PostgreSQL persistence for both the word store and pre-generated quizzes,
observability with Langfuse,
optional fine-tuning (LoRA) and offline evaluation pipelines.

Design choices

The core workflow of the project is the generation of a quiz from a target word:

A short definition is first retrieved from a WordNet for the target language (English, French, or Spanish).
- Many words have multiple senses in the WordNet. One definition is selected at random, and the quiz is generated for that definition.
- If the word is not present in the WordNet, the LLM itself selects a definition.
- This approach makes it possible to cover multiple senses of a word over time while still selecting one specific sense for each quiz.
The LLM generates several candidate cloze sentences for the word.
- A correctness check verifies that each candidate contains exactly one [MASK] and that the expected solution is valid, including acceptable variants such as inflected forms.
- A masked language model (MLM) then scores the candidates by measuring how strongly the context points to the intended solution.
- The candidate with the best MLM score is selected, which helps reduce ambiguity.
The LLM generates a dictionary-style definition in the language of the word.
If the word is not in the learner's native language, the LLM also generates a translation.
All LLM calls are traced with Langfuse.
- Langfuse and its dependencies are launched locally with Docker Compose.

Multiple LLMs can be evaluated and compared with offline metrics:

Cloze validity: the sentence contains exactly one [MASK].
Solution validity: the returned solution is correct.
Expected-word logit(s): higher is better because the MLM considers the intended solution more likely.
Logit margin against the top predicted token(s): higher is better because it indicates a less ambiguous sentence.

The repository also includes a fine-tuning pipeline to improve cloze sentence generation:

Hundreds of difficult or rare words are extracted from specific Wiktionary pages for English, French, and Spanish.
Sentences containing these words are extracted from Project Gutenberg texts.
- This provides candidate sentences drawn from rich, literary language.
For each word, the selected cloze sentence is the candidate that maximizes the MLM score of the solution.
- In practice, this keeps the least ambiguous sentence among the candidates.
- The resulting dataset is exported as Parquet file and as a Hugging Face dataset.
Supervised fine-tuning is then carried out with unsloth.
- Each LLM input is a word in English, French, or Spanish.
- The corresponding target is the cloze sentence and its solution, in JSON format.
- The training uses standard LoRA fine-tuning with roughly 900 examples (300 per language).
- Example evaluations of a fine-tuned model are shown later in this README.

The final product is an API:

It allows to add words and request quizzes.
The word store and quiz store are in a PostgreSQL database.
- By default, PostgreSQL runs in a Docker container.
To reduce response latency, a scheduled job can pre-generate quizzes and store them for later serving by the API.

Architecture

The codebase follows a Domain-Driven Design (DDD):

src/word_teacher/domain/: domain entities and business rules,
src/word_teacher/application/: use cases,
src/word_teacher/infrastructure/: LLM and MLM access, and persistence (word and quiz stores),
src/word_teacher/presentation/: FastAPI layer and dependency wiring.

Around this core module word_teacher, the repository includes the following directories:

eval/: the evaluation (offline) of LLMs used for the generation of quizzes,
training/: fine-tuning using LoRA so as to generate better cloze sentences (less ambiguous, less generation errors),
ops/: deployment scripts (DevOps),
tests/: unit and integration tests, API smoke tests.

Tech stack and tools:

Storage: PostgreSQL (SQLAlchemy)
Lexical sources: WordNets (English, French, Spanish), Wiktionary, Project Gutenberg (Hugging Face dataset)
LLM: OpenAI-compatible API (vLLM)
MLM (masked language model): online selection and offline evaluation
Fine-tuning: unsloth, TRL
API: FastAPI
Observability: Langfuse
Packaging: Docker Compose, Poetry
Quality: Pytest, mypy, Ruff

Installation

1) Prerequisites

Python 3.13.x
Poetry
Docker + Docker Compose v2
(Optional) pyenv to install/manage Python versions

2) Install dependencies

# If using pyenv:
pyenv install 3.13.11
pyenv local 3.13.11

poetry env use 3.13.11
poetry install

3) Configure environment

cp .env.example .env

Edit .env following the instructions in the file. The configuration can work out-of-the-box after the placeholders SET are replaced (Langfuse-related secrets).

Local development

1) Start PostgreSQL and Langfuse

# Creates the Docker network "word_teacher_net".
ops/scripts/prepare

# Deploying the PostgreSQL DB in a Docker image.
ops/scripts/pg_start
poetry run python ops/scripts/pg_init

# Deploying Langfuse in a Docker image.
ops/scripts/langfuse_start

2) Start the LLM engine and the API

Before starting the API locally, download the WordNet data for English, French and Spanish. Make sure that $WN_DATA_DIR matches WN_DATA_DIR in .env, and run:

export WN_DATA_DIR="${HOME}/.wn_data/"
poetry run python -m wn -d "$WN_DATA_DIR" download "omw-en"
poetry run python -m wn -d "$WN_DATA_DIR" download "omw-es"
poetry run python -m wn -d "$WN_DATA_DIR" download "omw-fr"

# GPU inference with vLLM (default mode).
poetry run ops/scripts/llm_start --gpu

# API.
poetry run ops/scripts/api_start

You can now open:

http://localhost:8000/docs (FastAPI docs), the port is set by API_PORT in .env.
http://localhost:3000 (Langfuse), the port is set by LANGFUSE_BASE_URL_PORT in .env.

3) Optional: seed sample words

poetry run ops/scripts/pg_seed

This adds English, French and Spanish words to the word store. It is also possible to add words through the API (see below).

Docker deployment

This mode uses the same installation steps as the local development setup. PostgreSQL and Langfuse are still started through Docker, the LLM server still runs locally, and only the API is started from a Docker image.

The WordNet data is not shared with the local development setup. For the Docker API path, it is downloaded and exposed in the container as WN_DATA_DIR=/data/wn.

1) Start PostgreSQL and Langfuse

ops/scripts/prepare

# PostgreSQL
ops/scripts/pg_start
poetry run python ops/scripts/pg_init

# Langfuse
ops/scripts/langfuse_start

2) Start the LLM engine locally

poetry run ops/scripts/llm_start --gpu

Note: ops/scripts/llm_start --cpu requires llama.cpp (llama-server) installed separately. Also set PROD_LLM_USE_RESPONSES_API=0 with llama.cpp.

3) Start the API Docker image

ops/scripts/api_docker

API usage examples

Add words:

curl -X POST "http://localhost:8000/add_words" \
  -H "Content-Type: application/json" \
  --data '[
    {"word": "antler", "lang": "en"},
    {"word": "sistre", "lang": "fr"},
    {"word": "hipálage", "lang": "es"}
  ]'

Generate one quiz (here, with optional query parameter lang for English):

curl "http://localhost:8000/generate_quiz?lang=en"

Pre-generate multiple quizzes:

time poetry run python src/word_teacher/presentation/cron_generate_quizzes.py --num-quizzes 10

Read the oldest pre-generated and stored quiz:

curl "http://localhost:8000/get_quiz"

Notes on quiz generation:

GET /generate_quiz generates a quiz and returns it (no storage).
cron_generate_quizzes.py pre-generates quizzes and inserts them into the quiz store.
GET /get_quiz reads the oldest pre-generated quiz from the quiz store (?drop=true also removes it).

ML pipeline

Fine-tuning

The fine-tuning dataset is generated for English, French and Spanish candidate rare words from the Wiktionary. Quality sentences for roughly 900 words (300 for each language) are extracted from Project Gutenberg (Hugging Face dataset manu/project_gutenberg).

mkdir -p training/data
cd training

poetry run python find_candidate_words.py
time poetry run python generate_sentence_training_parquet.py -i data/candidate_words.parquet -n 1000 -o data/sentence_training_dataset.parquet
time poetry run python generate_sentence_training_dataset.py -i data/sentence_training_dataset.parquet -o data/sentence_training_dataset
time poetry run python train_sentence.py -i data/sentence_training_dataset -m Qwen/Qwen2.5-3B-Instruct -o data/LoRA-Qwen2.5-3B-Instruct -O data/trained-Qwen2.5-3B-Instruct

# If only a LoRA adapter checkpoint was saved or stored, rebuild a merged model.
time poetry run python merge_lora_adapter.py -a data/LoRA-Qwen2.5-3B-Instruct -o data/trained-Qwen2.5-3B-Instruct -m Qwen/Qwen2.5-3B-Instruct --dtype float16 --device cpu

Evaluation

Generate quizzes and evaluate them:

cd eval/quiz/
mkdir -p data reports

time poetry run python generate_quizzes.py -i evaluation_words.json -n 5 -o data/quizzes-{session_id}.parquet
# Or English only.
time poetry run python generate_quizzes.py -i evaluation_words.json -l en -n 6 -o data/quizzes-{session_id}.parquet

poetry run python evaluate_quizzes.py -i data/quizzes-2602161632.parquet --report-md reports/report-{session_id}.md --report-json reports/report-{session_id}.json

evaluate_quizzes.py reports metrics to Langfuse under session ID session_id and writes:

eval/quiz/reports/report-{session_id}.json (machine-readable metrics/metadata),
eval/quiz/reports/report-{session_id}.md (human-readable report).

Compare two evaluation reports:

cd eval/quiz/
poetry run python compare_reports.py -b reports/report-2603010701.json -c reports/report-2603021103.json -o reports/compare-2603010701-vs-2603021103.md --base-label base --candidate-label fine-tuned

Example metric comparison

Below is an example produced by compare_reports.py for Qwen/Qwen2.5-3B-Instruct as baseline.

Metrics:

Mask validity: sentence contains exactly one [MASK]
Valid solution: returned solution is correct
Zero mask: sentence contains no [MASK]
Multiple masks: sentence contains more than one [MASK]
Avg logit margin: expected-token (solution) logit minus predicted-token logits, according to the MLM (higher is better)
Avg predicted logits: expected-token logits, according to the MLM (higher is better)

Inputs:

Baseline: reports/report-2603010701.json
Candidate: reports/report-2603021103.json
Baseline Langfuse session: 2603010701
Candidate Langfuse session: 2603021103
Baseline model: Qwen/Qwen2.5-3B-Instruct
Candidate model: data/trained-Qwen2.5-3B-Instruct

Metric	base	fine-tuned	Delta (candidate - base)
Mask validity %	85.60	89.60	+4.0000 (improved)
Valid solution %	80.00	96.80	+16.8000 (improved)
Zero mask %	8.80	6.40	-2.4000 (improved)
Multiple masks %	5.60	4.00	-1.6000 (improved)
Avg logit margin	-25.2521	-24.6688	+0.5833 (improved)
Avg predicted logits	39.7602	42.9442	+3.1839 (improved)

Quality checks

Run linting and tests:

poetry run ruff check src tests
poetry run mypy
poetry run pytest

Install the local git pre-commit hook (includes the same: linting and tests):

ops/scripts/install_git_hooks

Current limitations

This project was built to explore realistic AI engineering constraints, but some choices still reflect its role as a portfolio project. The main example is the prompt used for cloze sentence generation. The current prompt asks the model to return a single string containing exactly one [MASK]. A more production-oriented design would likely request a structured response with the text before and after [MASK], which would force exactly one [MASK]. The current prompt is intentionally harder, because it makes the generation and fine-tuning problem more interesting.

There are also a few operational limitations:

The Docker workflow still requires manual database initialization via ops/scripts/pg_init.
The LLM engine is not packaged in Docker Compose; vLLM or another engine must be started separately.

License

GNU GPL v3.