Jump to content

OSINT Stack Architecture

From Archive
Revision as of 16:50, 5 March 2026 by Claude (talk | contribs) (Flesh out full architecture: collectors, anomalywatch hub, entityhub, briefing, orchestration, quirks)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

OSINT Stack Architecture

Personal open-source intelligence collection and analysis system running on the VPS. Designed for Hudson Valley regional monitoring with a focus on ICE/immigration enforcement, government contracts, aviation anomalies, and civic infrastructure.

Last updated: 2026-03-05.

Overview

The stack follows a hub-and-spoke architecture:

  • Domain collectors each monitor a narrow slice of open data and push signals to a central hub
  • Anomalywatch aggregates all signals, scores them, detects correlations, and maintains active investigations
  • Entityhub maintains a knowledge graph of entities (people, orgs, aircraft, locations) with cross-app mention tracking
  • Intel Dashboard (intel.tools.ejfox.com) provides a unified ops view

All apps run as Smallweb Deno apps under a shared reverse proxy on port 7777. Internal routing uses http://localhost:7777 + Host: appname.tools.ejfox.com header.

Signal Flow

Domain Collector --> POST /api/report-alert --> Anomalywatch
                                                      |
                                              score + deduplicate
                                                      |
                                           Investigations + Correlations
                                                      |
                                             Intel Dashboard
                                                      |
                                            Briefing (claude-briefing.sh)

Entity resolution runs in parallel:

Collector --> GET /api/resolve?q=NAME&create=true --> Entityhub
          --> POST /api/entity/:id/mention          --> Entityhub

Domain Collectors

Aviation / Maritime / Movement

Skywatch (skywatch.tools.ejfox.com)
FAA ADSB data via adsb.lol. Monitors Hudson Valley airspace for military flights, ICE Air Operations, surveillance patterns, unidentified aircraft. Flags: military callsigns, known ICE prefixes (ICE+mixed alphanumeric), low-altitude surveillance loops.
DB: /opt/docker/smallweb/data/skywatch/data/skywatch.db
AIS Collector / Riverwatch
AIS vessel tracking via aisstream.io WebSocket. Monitors Hudson River and surrounding waters. Flags: military vessels, law enforcement, tankers, unusual speeds (>25kt for unknown type, >15kt for known).
PM2 process: ais-collector
DB: ~/scripts/scripts/ais-collector/riverwatch.db
Masintwatch (masintwatch.tools.ejfox.com)
9 MASINT (measurement and signature intelligence) sources: FAA NOTAMs, USGS streamflow, NRC nuclear events, EPA air quality, NWS/NEXRAD radar, USCG notices to mariners, USGS seismic, NASA FIRMS fire detection, NYISO grid load. Monitors for environmental anomalies near sensitive HV sites (Stewart ANG, Indian Point, Chester ICE facility, West Point, Algonquin Pipeline).

Government / Financial

Contractwatch (contractwatch.tools.ejfox.com)
USASpending.gov federal contracts. Tags contracts with geo_region=hudson_valley for 9 NY counties (Orange, Ulster, Dutchess, Sullivan, Greene, Columbia, Putnam, Rockland, Westchester). Flags DHS/CBP/ICE awards.
DB: /opt/docker/smallweb/data/contractwatch/data/contractwatch.db
Key columns: recipient_name, award_amount, geo_region
Donorwatch (donorwatch.tools.ejfox.com)
FEC campaign finance. Monitors watched donors via name-pattern substring matching. Reference implementation for anomalywatch reporting + entityhub resolution patterns.
DB: /opt/docker/smallweb/data/donorwatch/data/donorwatch.db
Key table: watched_donors (name_pattern passed to FEC API)
Filingwatch (filingwatch.tools.ejfox.com)
SEC filings and regulatory documents.
990watch
IRS Form 990 nonprofit data. Monitors watched nonprofits for compensation anomalies, revenue changes, new filings. Alert dedup: pre-insert SELECT check prevents repeat alerts for same finding (NRA $5.57M, Candid $1.38M previously fired every scan cycle).
Key table: watched_nonprofits

Location / Property

Countywatch (countywatch.tools.ejfox.com)
County government RSS feeds (Orange County Legislature, Planning Board, ZBA), Chester cit-e.net planning docs, local news. ICE keyword monitoring on meeting docs. ~30 feeds.
Key tables: news (column: first_seen), meetings
Egoscan (egoscan.tools.ejfox.com)
Domain/WHOIS monitoring and username enumeration. Alerts on new domain registrations, expiring domains (30 days or fewer), WHOIS changes, username appearances on new platforms.
Scan endpoints: /api/scan/core, /api/scan/quick
Changewatch (changewatch.tools.ejfox.com)
Web page change detection. Monitors government and organizational URLs for content changes. Each watched URL can optionally link to an entityhub entity.

Background Data (Socrata batch feeds)

These sources load bulk public records. Signals are tagged socrata:* or openfda:* and filtered from Top Signals in the intel dashboard and briefing -- they appear in a collapsed "Background Data" section instead.

  • Restaurant inspections (NYC DOH)
  • Remediation sites
  • Building permits
  • Liquor licenses
  • Hate crime reports
  • FDA recalls

Security / Infrastructure

Honeypot
HTTP honeypot logging all unauthorized access attempts. File-based logging (JSONL at data/honeypot-log.jsonl). 4,388+ attacks logged. Anomalywatch integration active.
Log rotation: archives >5MB, keeps last 7 days.

Central Hub: Anomalywatch

/opt/docker/smallweb/data/anomalywatch/main.ts (~2,100 lines)

Signal Scoring

Signals arrive via POST /api/report-alert. The scoring pipeline:

  1. Source weight multiplied by base score
  2. Geographic relevance bonus (hudson_valley tagged signals get a boost)
  3. Recency decay
  4. Final score stored as final_score

Source weight taxonomy:

  • intelligence sources (skywatch, donorwatch, contractwatch, changewatch, filingwatch, egoscan): full weight
  • background_data (socrata:*, openfda:*): low weight, filtered from top signals
  • system (osint-health, honeypot): excluded from briefings

Deduplication

Two-layer dedup (fires in order):

  1. Exact title + same source within 24h -- deduplicated, no new signal row
  2. Word overlap >= 0.7 within 2h -- deduplicated

Correlations

Computes Pearson r on z-score vectors (not raw counts). Only flags correlations where both series show z > 1.5 in the same window. This prevents batch-loaded Socrata sources from correlating with everything (they grow on predictable daily schedules, so raw count r is approximately 1.0 but z-score r is approximately 0).

Active Investigations

ID Name Signal Types Focus
inv-mm9cq6jy-v6qi Chester ICE Detention Complex aviation, government, crime hudson_valley / Chester NY
inv-mm9cq6n7-i1n3 Stewart ANG Regional Activity aviation hudson_valley / Stewart Airport
inv-mm9cq6or-l8pt Hudson Valley Federal Contracts government hudson_valley

New signals matching investigation keywords/types/geography automatically link to the investigation_signals table.

Key API Endpoints

  • GET /api/signals?hours=24&min_score=2.0&limit=20 -- recent high-score signals
  • GET /api/intelligence/summary -- aggregate stats (signals, anomalies counts)
  • GET /api/investigations -- active investigations
  • GET /api/correlations -- cross-source correlation matrix
  • GET /api/query?geo_region=hudson_valley&since=30d&sources=skywatch,contractwatch -- cross-signal join query
  • POST /api/report-alert -- collector ingestion endpoint

Entity Graph: Entityhub

/opt/docker/smallweb/data/entityhub/main.ts

Central knowledge graph. All collectors resolve named entities before recording signals.

Schema

  • entities -- canonical_name, entity_type (person/organization/aircraft/location/vessel), external IDs
  • entity_external_ids -- Wikidata QIDs, FAA reg numbers, FEC IDs, etc.
  • entity_mentions -- app_name, entity_id, context, timestamp
  • intel_alerts -- elevated alerts derived from mention patterns

Entity Resolution Pattern

Used by donorwatch (reference implementation), 990watch, AIS collector, contractwatch:

GET /api/resolve?q=NAME&create=true&type=TYPE
--> returns entity_id (existing or newly created)

POST /api/entity/:id/mention
--> { app_name, context, metadata }

Auto-enrichment fires on create=true for person/org/location: non-blocking Wikidata lookup with 500ms delay.

Statistics (2026-03-02 post-cleanup)

  • 4,797 entities (down from 8,001 after NER noise removal)
  • Types: aircraft, organization, person, location, vessel
  • 3,163 unknown-type and 3,765 concept-type entities removed (NER noise from scrapbook bulk ingestion)

Smallweb Permissions

Entityhub DB owned by debian:debian -- safe to query directly from CLI. Sibling app DBs owned by www-data may create ownership issues on WAL files if queried as debian.

Intel Dashboard

intel.tools.ejfox.com -- /opt/docker/smallweb/data/intel/main.ts

Aggregates data from anomalywatch, skywatch, entityhub, countywatch DBs directly (requires "admin": true in smallweb.json for cross-app DB access).

Layout: header + stats bar + 3-column main (signals | sky activity | investigations+alerts) + 3-column bottom (health | county news | correlations).

Filters applied:

  • Excludes osint-health signals from top signals
  • Excludes socrata:* and openfda:* from top signals
  • Shows background data in separate collapsed section

Briefing System

/home/debian/scripts/claude-briefing.sh

Generates a markdown briefing that Claude reads at session start. Pulls:

  • Top intelligence signals (min_score >= 2.0, excludes socrata/openfda)
  • Active investigation status
  • Skywatch daily stats + military activity
  • Countywatch recent news
  • Kanban board status
  • Background data section (socrata/openfda, collapsed)

Runs every 2h, output at ~/claude/briefing.md.

Anomaly Alerts

/home/debian/scripts/claude/anomaly-alerts.sh

Fires Discord alerts for high-score signals. Threshold: min_score=2.0. Excludes: redditwatch, osint-health, socrata:*, openfda:*. JSON validity guard prevents parse errors when anomalywatch is restarting.

Orchestration

osint-scan.sh

/home/debian/scripts/scripts/osint/osint-scan.sh

Tiered scan intervals via cron:

  • 5 min: honeypot, owntracks
  • 15 min: skywatch, masintwatch, countywatch, changewatch
  • 30 min: donorwatch, egoscan, contractwatch
  • 60 min: filingwatch, 990watch, transitwatch
  • 240 min: AIS collector (PM2-managed, runs continuously)

Internal routing: curl --max-time N -H "Host: app.tools.ejfox.com" "http://localhost:7777/api/scan?secret=..."

Note: Cloudflare 524s occur if scanning via public hostname -- always use localhost with Host header.

health-check.sh

/home/debian/scripts/scripts/osint/health-check.sh

Checks recent activity in each app DB using rolling windows (last 4h for skywatch, last 8h for riverwatch) rather than calendar-day counts. This prevents false midnight/8am "nothing today" alerts.

Weekly Rollup

/home/debian/scripts/weekly-rollup.sh

Sunday 9 AM. Queries anomalywatch /api/intelligence/summary, wiki stats, kanban, entityhub, captures DB. Saves to ~/claude/rollups/YYYY-WWW.md.

Token / Auth Patterns

Anomalywatch bearer token loading order (used by all collectors):

  1. /opt/docker/smallweb/data/_shared/.anomalywatch-token
  2. Fallback: /opt/docker/smallweb/data/anomalywatch/data/.api-token

Collector scan secret: OSINT_SECRET env var (set in smallweb env or .env file).

Known Issues / Quirks

  • localhost in Deno postgres client resolves to ::1 (IPv6) first; postgres-central binds only to 127.0.0.1. Always use explicit "127.0.0.1" in PG_CONFIG.
  • Smallweb app DB files owned by www-data. Running sqlite3 as debian creates -shm/-wal files with wrong ownership, breaking app writes. Check ownership after any direct CLI queries.
  • Cron environment does not load .bashrc or NVM. Scripts that call claude CLI or use env vars from .bashrc must export PATH and env vars explicitly.
  • Cloudflare tunnel adds latency -- anomalywatch HTTPS reports from collectors need at least 30s timeout (not 10s).
  • AIS collector anomalywatch reports must use HTTPS (not localhost) because Deno fetch cannot override Host headers for HTTP.