Self-Hosted ML Data Annotation

Your users are already
labeling your data.
HiveGuard captures it.

Every bot verification challenge is a labeling task. HiveGuard sits in front of your web app โ€” suspicious traffic gets an interactive challenge, and the user's answer labels a training item in your dataset. One feedback loop: protect your site, build your dataset, train a better model.

Runs on-premise, behind a firewall, or on any cloud. MIT-licensed. No per-solve fees, ever.

The HiveGuard Difference

Traditional CAPTCHAs
Waste the Work.

Traditional CAPTCHAs
Friction. Zero data value returned.
  • Static rules โ€” no learning from outcomes, easily bypassed by modern bots
  • User work is wasted โ€” solving a CAPTCHA produces zero reusable data
  • Depends on external services โ€” Google reCAPTCHA, hCaptcha, Turnstile
  • No control over what challenges are shown or how data is used
  • Privacy concerns โ€” fingerprinting and tracking by third-party scripts
$0 upfront / but zero data value returned
HiveGuard โ€” Every challenge labels a training item
Protection that builds your dataset.
  • ML model trained on YOUR traffic โ€” risk scoring improves with use
  • Every challenge labels your training data โ€” bot check = annotation task
  • Fully self-hosted โ€” data never leaves your infrastructure
  • You choose the datasets, modalities, and labeling tasks
  • Open-source, MIT-licensed โ€” inspect, customize, extend without restrictions
$0 platform cost / + labeled training data
Features

Annotation That
Runs Itself.

๐Ÿง 
Adaptive ML Scoring

Starts with heuristic rules, graduates to a trained ML model. Challenge outcomes (solved = human, expired = bot) feed back into a LogisticRegression classifier that improves with every interaction.

๐Ÿ–ผ๏ธ
Multi-Modal Challenges

Image grids, audio snippets, text classification, and more. Each challenge type is pluggable โ€” serve the modality that makes sense for your audience and your labeling pipeline.

๐Ÿค
Consensus Engine

Responses from multiple users are merged through majority voting. Ground-truth items verify that the human is real; unknown items accumulate votes until a high-confidence label emerges.

๐Ÿ”„
Closed Feedback Loop

Every challenge outcome becomes training data. Retrain the risk model from the dashboard with one click. The model learns which request patterns (headers, UA, timing) correlate with bots vs. humans.

๐ŸŽฏ
Flexible Deployment

Run as a reverse proxy, embed the widget standalone, or call the REST API directly. Works on-premise, in the cloud, or air-gapped. Single Docker Compose command to deploy.

๐Ÿ“Š
Dashboard & Export

Monitor challenge throughput, model accuracy, dataset growth, and consensus convergence. Export labeled datasets for your ML pipeline. Manage datasets, API keys, and proxy rules.

Use Cases

Built for ML Teams.

๐Ÿข
ML Teams

Turn existing web traffic into an annotation workforce. Label images, text, or audio without hiring annotators โ€” every bot check is a labeling task. Scale with your traffic, not your budget.

๐ŸŽ“
Research Labs

Self-hosted annotation on your own infrastructure. Data never leaves your network. Deploy on campus servers with Docker โ€” GDPR and ethics-board friendly.

๐Ÿ”
On-Premise & Air-Gap

No SaaS dependency. The ML model trains locally from your own traffic data. Works completely offline โ€” risk scoring, challenges, and consensus all self-contained.

๐ŸŒ
Web Applications

Drop-in reverse proxy that protects login pages, APIs, and forms from automated traffic while building labeled datasets from real user interactions.

How It Works

The Feedback Loop.

01
Score & Intercept

Every inbound request is scored by the ML risk model (or heuristic fallback). Requests above the threshold are intercepted and served a verification challenge. Request features are captured for training.

12 features โ†’ risk score 0.0โ€“1.0
02
Challenge & Label

Each challenge pairs a ground-truth item (known answer, verifies the human) with an unknown item (collects a label). Correct GT answers are recorded as "solved" โ€” model learns this = human.

solved โ†’ human ยท expired โ†’ bot
03
Train & Improve

Retrain the model from the dashboard API. The classifier learns which request patterns correlate with bots vs. humans. Unknown item labels converge via consensus voting and are exported for your ML pipeline.

POST /dashboard/api/ml/train
Live Demo

Every Challenge Type,
Right Here.

Five modalities โ€” the exact same widget your users will see.
Pick one and interact with it live.

๐Ÿ HiveGuard โ€” live challenge
Select all images that match the prompt โ€” a classic reCAPTCHA-style image selection challenge.
Integration

Up in 5 Minutes.

Three lines of config. One Docker command. Annotation pipeline running.

# docker-compose.yml
services:
  hiveguard:
    image: ghcr.io/buiapp/hiveguard-internal:latest
    ports: ["8080:8080"]
    environment:
      UPSTREAM_URL: http://your-app:3000
      OWNER_ORG: acme
      PROTECTED_PATHS: '["/login", "/register", "/api/submit"]'
      DATABASE_URL: postgresql+asyncpg://user:pass@db/hiveguard
      REDIS_URL: redis://redis:6379
    depends_on: [db, redis]
# nginx.conf โ€” point your domain at HiveGuard
upstream hiveguard {
    server localhost:8080;
}

server {
    listen 443 ssl;
    server_name app.example.com;

    location / {
        proxy_pass         http://hiveguard;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
    }
}
<!-- 1. Load the widget script once -->
<script src="https://guard.example.com/_hiveguard/widget.js" defer></script>

<!-- 2. Place the mount point -->
<!--    data-modality: pattern | grid | image | text | audio -->
<div
  data-hg-widget
  data-api-key="hg_your_widget_token"
  data-modality="pattern"
  data-callback="onHiveGuardPass"
></div>

<!-- 3. Handle the verified token -->
<script>
function onHiveGuardPass(token) {
  // pass token to your server for verification
  document.getElementById('hg-token').value = token;
  document.getElementById('my-form').submit();
}
</script>
# Verify a submitted challenge token server-side
import httpx

response = httpx.post(
    "https://guard.example.com/api/challenge/verify",
    headers={"X-API-Key": "your-api-key"},
    json={"token": request.form["hg-token"]},
)

if response.json()["success"]:
    # human verified โ€” proceed
    create_account(request)
else:
    # bot or failed โ€” reject / re-challenge
    abort(403)

Full documentation and OpenAPI spec available in the GitHub repository. Need help? Open an issue or start a discussion.

Ready to deploy?

Your Annotators
Are Already There