OSINT Stack Architecture
OSINT Stack Architecture
Personal open-source intelligence collection and analysis system running on the VPS. Designed for Hudson Valley regional monitoring with a focus on ICE/immigration enforcement, government contracts, aviation anomalies, and civic infrastructure.
Last updated: 2026-03-05.
Overview
The stack follows a hub-and-spoke architecture:
- Domain collectors each monitor a narrow slice of open data and push signals to a central hub
- Anomalywatch aggregates all signals, scores them, detects correlations, and maintains active investigations
- Entityhub maintains a knowledge graph of entities (people, orgs, aircraft, locations) with cross-app mention tracking
- Intel Dashboard (intel.tools.ejfox.com) provides a unified ops view
All apps run as Smallweb Deno apps under a shared reverse proxy on port 7777. Internal routing uses http://localhost:7777 + Host: appname.tools.ejfox.com header.
Signal Flow
Domain Collector --> POST /api/report-alert --> Anomalywatch
|
score + deduplicate
|
Investigations + Correlations
|
Intel Dashboard
|
Briefing (claude-briefing.sh)
Entity resolution runs in parallel:
Collector --> GET /api/resolve?q=NAME&create=true --> Entityhub
--> POST /api/entity/:id/mention --> Entityhub
Domain Collectors
Aviation / Maritime / Movement
- Skywatch (skywatch.tools.ejfox.com)
- FAA ADSB data via adsb.lol. Monitors Hudson Valley airspace for military flights, ICE Air Operations, surveillance patterns, unidentified aircraft. Flags: military callsigns, known ICE prefixes (ICE+mixed alphanumeric), low-altitude surveillance loops.
- DB:
/opt/docker/smallweb/data/skywatch/data/skywatch.db
- AIS Collector / Riverwatch
- AIS vessel tracking via aisstream.io WebSocket. Monitors Hudson River and surrounding waters. Flags: military vessels, law enforcement, tankers, unusual speeds (>25kt for unknown type, >15kt for known).
- PM2 process:
ais-collector - DB:
~/scripts/scripts/ais-collector/riverwatch.db
- Masintwatch (masintwatch.tools.ejfox.com)
- 9 MASINT (measurement and signature intelligence) sources: FAA NOTAMs, USGS streamflow, NRC nuclear events, EPA air quality, NWS/NEXRAD radar, USCG notices to mariners, USGS seismic, NASA FIRMS fire detection, NYISO grid load. Monitors for environmental anomalies near sensitive HV sites (Stewart ANG, Indian Point, Chester ICE facility, West Point, Algonquin Pipeline).
Government / Financial
- Contractwatch (contractwatch.tools.ejfox.com)
- USASpending.gov federal contracts. Tags contracts with
geo_region=hudson_valleyfor 9 NY counties (Orange, Ulster, Dutchess, Sullivan, Greene, Columbia, Putnam, Rockland, Westchester). Flags DHS/CBP/ICE awards. - DB:
/opt/docker/smallweb/data/contractwatch/data/contractwatch.db - Key columns:
recipient_name,award_amount,geo_region
- Donorwatch (donorwatch.tools.ejfox.com)
- FEC campaign finance. Monitors watched donors via name-pattern substring matching. Reference implementation for anomalywatch reporting + entityhub resolution patterns.
- DB:
/opt/docker/smallweb/data/donorwatch/data/donorwatch.db - Key table:
watched_donors(name_pattern passed to FEC API)
- Filingwatch (filingwatch.tools.ejfox.com)
- SEC filings and regulatory documents.
- 990watch
- IRS Form 990 nonprofit data. Monitors watched nonprofits for compensation anomalies, revenue changes, new filings. Alert dedup: pre-insert SELECT check prevents repeat alerts for same finding (NRA $5.57M, Candid $1.38M previously fired every scan cycle).
- Key table:
watched_nonprofits
Location / Property
- Countywatch (countywatch.tools.ejfox.com)
- County government RSS feeds (Orange County Legislature, Planning Board, ZBA), Chester cit-e.net planning docs, local news. ICE keyword monitoring on meeting docs. ~30 feeds.
- Key tables:
news(column:first_seen),meetings
- Egoscan (egoscan.tools.ejfox.com)
- Domain/WHOIS monitoring and username enumeration. Alerts on new domain registrations, expiring domains (30 days or fewer), WHOIS changes, username appearances on new platforms.
- Scan endpoints:
/api/scan/core,/api/scan/quick
- Changewatch (changewatch.tools.ejfox.com)
- Web page change detection. Monitors government and organizational URLs for content changes. Each watched URL can optionally link to an entityhub entity.
Background Data (Socrata batch feeds)
These sources load bulk public records. Signals are tagged socrata:* or openfda:* and filtered from Top Signals in the intel dashboard and briefing -- they appear in a collapsed "Background Data" section instead.
- Restaurant inspections (NYC DOH)
- Remediation sites
- Building permits
- Liquor licenses
- Hate crime reports
- FDA recalls
Security / Infrastructure
- Honeypot
- HTTP honeypot logging all unauthorized access attempts. File-based logging (JSONL at
data/honeypot-log.jsonl). 4,388+ attacks logged. Anomalywatch integration active. - Log rotation: archives >5MB, keeps last 7 days.
Central Hub: Anomalywatch
/opt/docker/smallweb/data/anomalywatch/main.ts (~2,100 lines)
Signal Scoring
Signals arrive via POST /api/report-alert. The scoring pipeline:
- Source weight multiplied by base score
- Geographic relevance bonus (hudson_valley tagged signals get a boost)
- Recency decay
- Final score stored as
final_score
Source weight taxonomy:
- intelligence sources (skywatch, donorwatch, contractwatch, changewatch, filingwatch, egoscan): full weight
- background_data (socrata:*, openfda:*): low weight, filtered from top signals
- system (osint-health, honeypot): excluded from briefings
Deduplication
Two-layer dedup (fires in order):
- Exact title + same source within 24h -- deduplicated, no new signal row
- Word overlap >= 0.7 within 2h -- deduplicated
Correlations
Computes Pearson r on z-score vectors (not raw counts). Only flags correlations where both series show z > 1.5 in the same window. This prevents batch-loaded Socrata sources from correlating with everything (they grow on predictable daily schedules, so raw count r is approximately 1.0 but z-score r is approximately 0).
Active Investigations
| ID | Name | Signal Types | Focus |
|---|---|---|---|
| inv-mm9cq6jy-v6qi | Chester ICE Detention Complex | aviation, government, crime | hudson_valley / Chester NY |
| inv-mm9cq6n7-i1n3 | Stewart ANG Regional Activity | aviation | hudson_valley / Stewart Airport |
| inv-mm9cq6or-l8pt | Hudson Valley Federal Contracts | government | hudson_valley |
New signals matching investigation keywords/types/geography automatically link to the investigation_signals table.
Key API Endpoints
GET /api/signals?hours=24&min_score=2.0&limit=20-- recent high-score signalsGET /api/intelligence/summary-- aggregate stats (signals, anomalies counts)GET /api/investigations-- active investigationsGET /api/correlations-- cross-source correlation matrixGET /api/query?geo_region=hudson_valley&since=30d&sources=skywatch,contractwatch-- cross-signal join queryPOST /api/report-alert-- collector ingestion endpoint
Entity Graph: Entityhub
/opt/docker/smallweb/data/entityhub/main.ts
Central knowledge graph. All collectors resolve named entities before recording signals.
Schema
entities-- canonical_name, entity_type (person/organization/aircraft/location/vessel), external IDsentity_external_ids-- Wikidata QIDs, FAA reg numbers, FEC IDs, etc.entity_mentions-- app_name, entity_id, context, timestampintel_alerts-- elevated alerts derived from mention patterns
Entity Resolution Pattern
Used by donorwatch (reference implementation), 990watch, AIS collector, contractwatch:
GET /api/resolve?q=NAME&create=true&type=TYPE
--> returns entity_id (existing or newly created)
POST /api/entity/:id/mention
--> { app_name, context, metadata }
Auto-enrichment fires on create=true for person/org/location: non-blocking Wikidata lookup with 500ms delay.
Statistics (2026-03-02 post-cleanup)
- 4,797 entities (down from 8,001 after NER noise removal)
- Types: aircraft, organization, person, location, vessel
- 3,163 unknown-type and 3,765 concept-type entities removed (NER noise from scrapbook bulk ingestion)
Smallweb Permissions
Entityhub DB owned by debian:debian -- safe to query directly from CLI. Sibling app DBs owned by www-data may create ownership issues on WAL files if queried as debian.
Intel Dashboard
intel.tools.ejfox.com -- /opt/docker/smallweb/data/intel/main.ts
Aggregates data from anomalywatch, skywatch, entityhub, countywatch DBs directly (requires "admin": true in smallweb.json for cross-app DB access).
Layout: header + stats bar + 3-column main (signals | sky activity | investigations+alerts) + 3-column bottom (health | county news | correlations).
Filters applied:
- Excludes osint-health signals from top signals
- Excludes socrata:* and openfda:* from top signals
- Shows background data in separate collapsed section
Briefing System
/home/debian/scripts/claude-briefing.sh
Generates a markdown briefing that Claude reads at session start. Pulls:
- Top intelligence signals (min_score >= 2.0, excludes socrata/openfda)
- Active investigation status
- Skywatch daily stats + military activity
- Countywatch recent news
- Kanban board status
- Background data section (socrata/openfda, collapsed)
Runs every 2h, output at ~/claude/briefing.md.
Anomaly Alerts
/home/debian/scripts/claude/anomaly-alerts.sh
Fires Discord alerts for high-score signals. Threshold: min_score=2.0. Excludes: redditwatch, osint-health, socrata:*, openfda:*. JSON validity guard prevents parse errors when anomalywatch is restarting.
Orchestration
osint-scan.sh
/home/debian/scripts/scripts/osint/osint-scan.sh
Tiered scan intervals via cron:
- 5 min: honeypot, owntracks
- 15 min: skywatch, masintwatch, countywatch, changewatch
- 30 min: donorwatch, egoscan, contractwatch
- 60 min: filingwatch, 990watch, transitwatch
- 240 min: AIS collector (PM2-managed, runs continuously)
Internal routing: curl --max-time N -H "Host: app.tools.ejfox.com" "http://localhost:7777/api/scan?secret=..."
Note: Cloudflare 524s occur if scanning via public hostname -- always use localhost with Host header.
health-check.sh
/home/debian/scripts/scripts/osint/health-check.sh
Checks recent activity in each app DB using rolling windows (last 4h for skywatch, last 8h for riverwatch) rather than calendar-day counts. This prevents false midnight/8am "nothing today" alerts.
Weekly Rollup
/home/debian/scripts/weekly-rollup.sh
Sunday 9 AM. Queries anomalywatch /api/intelligence/summary, wiki stats, kanban, entityhub, captures DB. Saves to ~/claude/rollups/YYYY-WWW.md.
Token / Auth Patterns
Anomalywatch bearer token loading order (used by all collectors):
/opt/docker/smallweb/data/_shared/.anomalywatch-token- Fallback:
/opt/docker/smallweb/data/anomalywatch/data/.api-token
Collector scan secret: OSINT_SECRET env var (set in smallweb env or .env file).
Known Issues / Quirks
localhostin Deno postgres client resolves to::1(IPv6) first; postgres-central binds only to 127.0.0.1. Always use explicit"127.0.0.1"in PG_CONFIG.- Smallweb app DB files owned by www-data. Running
sqlite3as debian creates -shm/-wal files with wrong ownership, breaking app writes. Check ownership after any direct CLI queries. - Cron environment does not load .bashrc or NVM. Scripts that call
claudeCLI or use env vars from .bashrc must export PATH and env vars explicitly. - Cloudflare tunnel adds latency -- anomalywatch HTTPS reports from collectors need at least 30s timeout (not 10s).
- AIS collector anomalywatch reports must use HTTPS (not localhost) because Deno fetch cannot override Host headers for HTTP.