Self-Hosted ML Data Annotation

Your users are already
labeling your data.
HiveGuard captures it.

Every bot verification challenge is a labeling task. HiveGuard sits in front of your web app — suspicious traffic gets an interactive challenge, and the user's answer labels a training item in your dataset. One feedback loop: protect your site, build your dataset, train a better model.

See it live → View on GitHub

Runs on-premise, behind a firewall, or on any cloud. MIT-licensed. No per-solve fees, ever.

Traditional CAPTCHAs

Friction. Zero data value returned.

Static rules — no learning from outcomes, easily bypassed by modern bots
User work is wasted — solving a CAPTCHA produces zero reusable data
Depends on external services — Google reCAPTCHA, hCaptcha, Turnstile
No control over what challenges are shown or how data is used
Privacy concerns — fingerprinting and tracking by third-party scripts

$0 upfront / but zero data value returned

HiveGuard — Every challenge labels a training item

Protection that builds your dataset.

ML model trained on YOUR traffic — risk scoring improves with use
Every challenge labels your training data — bot check = annotation task
Fully self-hosted — data never leaves your infrastructure
You choose the datasets, modalities, and labeling tasks
Open-source, MIT-licensed — inspect, customize, extend without restrictions

$0 platform cost / + labeled training data

🧠

Adaptive ML Scoring

Starts with heuristic rules, graduates to a trained ML model. Challenge outcomes (solved = human, expired = bot) feed back into a LogisticRegression classifier that improves with every interaction.

🖼️

Multi-Modal Challenges

Image grids, audio snippets, text classification, and more. Each challenge type is pluggable — serve the modality that makes sense for your audience and your labeling pipeline.

🤝

Consensus Engine

Responses from multiple users are merged through majority voting. Ground-truth items verify that the human is real; unknown items accumulate votes until a high-confidence label emerges.

🔄

Closed Feedback Loop

Every challenge outcome becomes training data. Retrain the risk model from the dashboard with one click. The model learns which request patterns (headers, UA, timing) correlate with bots vs. humans.

🎯

Flexible Deployment

Run as a reverse proxy, embed the widget standalone, or call the REST API directly. Works on-premise, in the cloud, or air-gapped. Single Docker Compose command to deploy.

📊

Dashboard & Export

Monitor challenge throughput, model accuracy, dataset growth, and consensus convergence. Export labeled datasets for your ML pipeline. Manage datasets, API keys, and proxy rules.

Use Cases

Built for ML Teams.

🏢

ML Teams

Turn existing web traffic into an annotation workforce. Label images, text, or audio without hiring annotators — every bot check is a labeling task. Scale with your traffic, not your budget.

🎓

Research Labs

Self-hosted annotation on your own infrastructure. Data never leaves your network. Deploy on campus servers with Docker — GDPR and ethics-board friendly.

🔐

On-Premise & Air-Gap

No SaaS dependency. The ML model trains locally from your own traffic data. Works completely offline — risk scoring, challenges, and consensus all self-contained.

🌐

Web Applications

Drop-in reverse proxy that protects login pages, APIs, and forms from automated traffic while building labeled datasets from real user interactions.

How It Works

The Feedback Loop.

Score & Intercept

Every inbound request is scored by the ML risk model (or heuristic fallback). Requests above the threshold are intercepted and served a verification challenge. Request features are captured for training.

12 features → risk score 0.0–1.0

Challenge & Label

Each challenge pairs a ground-truth item (known answer, verifies the human) with an unknown item (collects a label). Correct GT answers are recorded as "solved" — model learns this = human.

solved → human · expired → bot

Train & Improve

Retrain the model from the dashboard API. The classifier learns which request patterns correlate with bots vs. humans. Unknown item labels converge via consensus voting and are exported for your ML pipeline.

POST /dashboard/api/ml/train

Live Demo

Every Challenge Type,
Right Here.

Five modalities — the exact same widget your users will see.
Pick one and interact with it live.

# docker-compose.yml
services:
  hiveguard:
    image: ghcr.io/buiapp/hiveguard-internal:latest
    ports: ["8080:8080"]
    environment:
      UPSTREAM_URL: http://your-app:3000
      OWNER_ORG: acme
      PROTECTED_PATHS: '["/login", "/register", "/api/submit"]'
      DATABASE_URL: postgresql+asyncpg://user:pass@db/hiveguard
      REDIS_URL: redis://redis:6379
    depends_on: [db, redis]

# nginx.conf — point your domain at HiveGuard
upstream hiveguard {
    server localhost:8080;
}

server {
    listen 443 ssl;
    server_name app.example.com;

    location / {
        proxy_pass         http://hiveguard;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
    }
}

<!-- 1. Load the widget script once -->
<script src="https://guard.example.com/_hiveguard/widget.js" defer></script>

<!-- 2. Place the mount point -->
<!--    data-modality: pattern | grid | image | text | audio -->
<div
  data-hg-widget
  data-api-key="hg_your_widget_token"
  data-modality="pattern"
  data-callback="onHiveGuardPass"
></div>

<!-- 3. Handle the verified token -->
<script>
function onHiveGuardPass(token) {
  // pass token to your server for verification
  document.getElementById('hg-token').value = token;
  document.getElementById('my-form').submit();
}
</script>

# Verify a submitted challenge token server-side
import httpx

response = httpx.post(
    "https://guard.example.com/api/challenge/verify",
    headers={"X-API-Key": "your-api-key"},
    json={"token": request.form["hg-token"]},
)

if response.json()["success"]:
    # human verified — proceed
    create_account(request)
else:
    # bot or failed — reject / re-challenge
    abort(403)

Full documentation and OpenAPI spec available in the GitHub repository. Need help? Open an issue or start a discussion.

Your users are already
labeling your data.
HiveGuard captures it.

Traditional CAPTCHAs
Waste the Work.

Annotation That
Runs Itself.

Built for ML Teams.

The Feedback Loop.

Every Challenge Type,
Right Here.

Up in 5 Minutes.

Your Annotators
Are Already There

Your users are already labeling your data. HiveGuard captures it.

Traditional CAPTCHAsWaste the Work.

Annotation ThatRuns Itself.

Built for ML Teams.

The Feedback Loop.

Every Challenge Type,Right Here.

Up in 5 Minutes.

Your AnnotatorsAre Already There

Your users are already
labeling your data.
HiveGuard captures it.

Traditional CAPTCHAs
Waste the Work.

Annotation That
Runs Itself.

Every Challenge Type,
Right Here.

Your Annotators
Are Already There