Jump to content

Data Journalism

From Archive

Data Journalism

File:Data-Visualization-Sketch.jpg
Where data meets narrative: finding stories in datasets

Data Journalism uses data analysis, statistics, and visualization to uncover and tell stories. It transforms raw information into public knowledge.

Core Workflow

Phase Activities Tools
Acquisition FOIA requests, scraping, APIs, leaks requests, Selenium, wget
Cleaning Parsing, deduplication, standardization pandas, OpenRefine, csvkit
Analysis Statistics, pattern detection, verification SQL, R, Python
Visualization Charts, maps, interactives D3.js, Observable, Datawrapper
Publication Story integration, documentation HTML, static sites, CMS

Data Acquisition

Public Records

FOIA (Freedom of Information Act):

  • Federal agencies must respond within 20 business days
  • State equivalents vary (FOIL in NY, PRA in CA)
  • Track requests with MuckRock
  • Appeal denials - agencies often over-redact

Government Data Portals:

Web Scraping

# Basic scraping with Python
import requests
from bs4 import BeautifulSoup

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find_all('table')

Ethics: Respect robots.txt, rate limit requests, don't overload servers. Scraping is not hacking, but be responsible.

APIs

  • AP Elections - Real-time election results
  • ProPublica Congress - Legislative data
  • OpenSecrets - Money in politics
  • Twitter/X - Social media analysis (API now restricted)

Data Cleaning

"80% of data work is cleaning."

Common Problems

Problem Solution
Inconsistent formats Standardize dates, addresses, names
Missing values Document, impute, or exclude
Duplicates Deduplicate with fuzzy matching
Encoding issues Force UTF-8, handle special characters
Truncation Check for cut-off values at round numbers

Tools

OpenRefine: Visual data cleaning with clustering, faceting, and transformation. Essential for messy real-world data.

csvkit: Command-line CSV tools

csvstat data.csv          # Quick statistics
csvcut -c 1,3 data.csv    # Select columns
csvjoin file1.csv file2.csv  # Merge datasets

pandas:

import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace=True)
df['date'] = pd.to_datetime(df['date'])
df.to_csv('clean.csv', index=False)

Dataproofer

Automated data quality checks - "spellcheck for data."

  • GitHub Repository
  • Knight Foundation funded project
  • Checks for truncation, outliers, formatting issues

Analysis Methods

Statistical Approaches

Descriptive: Mean, median, mode, standard deviation. What does the data look like?

Comparative: Year-over-year changes, group comparisons. How do things differ?

Geospatial: Location patterns, clustering, proximity analysis. Where is it happening?

Time Series: Trends, seasonality, anomalies. How has it changed?

SQL for Journalism

-- Find top 10 campaign donors
SELECT donor_name, SUM(amount) as total
FROM contributions
WHERE election_cycle = 2024
GROUP BY donor_name
ORDER BY total DESC
LIMIT 10;

-- Join donor data with employer info
SELECT c.donor_name, c.amount, e.industry
FROM contributions c
JOIN employers e ON c.employer_id = e.id
WHERE e.industry = 'Oil & Gas';

R for Statistical Analysis

library(tidyverse)

data %>%
  filter(year >= 2020) %>%
  group_by(state) %>%
  summarize(
    total = sum(amount),
    count = n()
  ) %>%
  arrange(desc(total))

Visualization

Chart Selection

Data Type Chart
Comparison Bar chart
Time series Line chart
Part-to-whole Pie chart (sparingly), stacked bar
Distribution Histogram, box plot
Relationship Scatter plot
Geographic Choropleth, dot map

D3.js

Primary visualization library for custom interactives. See Dataviz for project examples.

Datawrapper

Fast, publication-ready charts. No code required. Good for daily journalism.

Observable

Reactive notebooks for exploratory analysis and prototyping. EJ's notebooks

Verification

Never publish unverified data.

Verification Checklist

  • Source: Where did the data come from? Is it authoritative?
  • Methodology: How was it collected? What are the limitations?
  • Sample: Is it representative? What's excluded?
  • Recency: When was it collected? Is it current?
  • Corroboration: Can you verify key findings independently?
  • Human check: Do the numbers make sense to domain experts?

Red Flags

  • Round numbers that suggest truncation
  • Impossible values (negative ages, future dates)
  • Suspiciously perfect distributions
  • Missing data in politically sensitive areas
  • Data that perfectly confirms your hypothesis

Publication

Transparency

  • Publish methodology explaining data sources and analysis
  • Link to raw data when possible
  • Document cleaning and transformation steps
  • Acknowledge limitations and uncertainties

Accessibility

  • Provide alt text for visualizations
  • Include data tables for screen readers
  • Use colorblind-friendly palettes
  • Test on mobile devices

Ethics

Privacy: Aggregate when possible. Be careful with location data, medical records, minor's information.

Harm: Consider who might be harmed by publication. Data about individuals can enable harassment.

Context: Raw numbers without context mislead. Provide rates, comparisons, and uncertainty.

Sources: Protect sources who provide leaked data. Use secure channels.

Resources

Books

  • The Functional Art - Alberto Cairo
  • Data Points - Nathan Yau
  • The Truthful Art - Alberto Cairo

Training

Communities

  • NICAR (IRE's computer-assisted reporting conference)
  • #ddj on social media
  • DataJournalism.com
  • Dataviz - Visualization projects and techniques
  • Journalism - Broader journalism practice
  • FOIA - Freedom of Information requests


Journalism & Investigations
Core Journalism · Investigations · Source Handling
Methods FOIA · Data Journalism · Dataviz · Documentation Discipline
Tools ArchiveBox · Scrapbook-core · Personal APIs
Culture Hacker Culture · PGP Communication Guide