Data Journalism
Data Journalism
Data Journalism uses data analysis, statistics, and visualization to uncover and tell stories. It transforms raw information into public knowledge.
Core Workflow
| Phase | Activities | Tools |
|---|---|---|
| Acquisition | FOIA requests, scraping, APIs, leaks | requests, Selenium, wget |
| Cleaning | Parsing, deduplication, standardization | pandas, OpenRefine, csvkit |
| Analysis | Statistics, pattern detection, verification | SQL, R, Python |
| Visualization | Charts, maps, interactives | D3.js, Observable, Datawrapper |
| Publication | Story integration, documentation | HTML, static sites, CMS |
Data Acquisition
Public Records
FOIA (Freedom of Information Act):
- Federal agencies must respond within 20 business days
- State equivalents vary (FOIL in NY, PRA in CA)
- Track requests with MuckRock
- Appeal denials - agencies often over-redact
Government Data Portals:
- data.gov - Federal datasets
- Census.gov - Demographics, economic data
- SEC EDGAR - Corporate filings
- FEC - Campaign finance
Web Scraping
# Basic scraping with Python
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find_all('table')
Ethics: Respect robots.txt, rate limit requests, don't overload servers. Scraping is not hacking, but be responsible.
APIs
- AP Elections - Real-time election results
- ProPublica Congress - Legislative data
- OpenSecrets - Money in politics
- Twitter/X - Social media analysis (API now restricted)
Data Cleaning
"80% of data work is cleaning."
Common Problems
| Problem | Solution |
|---|---|
| Inconsistent formats | Standardize dates, addresses, names |
| Missing values | Document, impute, or exclude |
| Duplicates | Deduplicate with fuzzy matching |
| Encoding issues | Force UTF-8, handle special characters |
| Truncation | Check for cut-off values at round numbers |
Tools
OpenRefine: Visual data cleaning with clustering, faceting, and transformation. Essential for messy real-world data.
csvkit: Command-line CSV tools
csvstat data.csv # Quick statistics csvcut -c 1,3 data.csv # Select columns csvjoin file1.csv file2.csv # Merge datasets
pandas:
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace=True)
df['date'] = pd.to_datetime(df['date'])
df.to_csv('clean.csv', index=False)
Dataproofer
Automated data quality checks - "spellcheck for data."
- GitHub Repository
- Knight Foundation funded project
- Checks for truncation, outliers, formatting issues
Analysis Methods
Statistical Approaches
Descriptive: Mean, median, mode, standard deviation. What does the data look like?
Comparative: Year-over-year changes, group comparisons. How do things differ?
Geospatial: Location patterns, clustering, proximity analysis. Where is it happening?
Time Series: Trends, seasonality, anomalies. How has it changed?
SQL for Journalism
-- Find top 10 campaign donors SELECT donor_name, SUM(amount) as total FROM contributions WHERE election_cycle = 2024 GROUP BY donor_name ORDER BY total DESC LIMIT 10; -- Join donor data with employer info SELECT c.donor_name, c.amount, e.industry FROM contributions c JOIN employers e ON c.employer_id = e.id WHERE e.industry = 'Oil & Gas';
R for Statistical Analysis
library(tidyverse)
data %>%
filter(year >= 2020) %>%
group_by(state) %>%
summarize(
total = sum(amount),
count = n()
) %>%
arrange(desc(total))
Visualization
Chart Selection
| Data Type | Chart |
|---|---|
| Comparison | Bar chart |
| Time series | Line chart |
| Part-to-whole | Pie chart (sparingly), stacked bar |
| Distribution | Histogram, box plot |
| Relationship | Scatter plot |
| Geographic | Choropleth, dot map |
D3.js
Primary visualization library for custom interactives. See Dataviz for project examples.
Datawrapper
Fast, publication-ready charts. No code required. Good for daily journalism.
Observable
Reactive notebooks for exploratory analysis and prototyping. EJ's notebooks
Verification
Never publish unverified data.
Verification Checklist
- Source: Where did the data come from? Is it authoritative?
- Methodology: How was it collected? What are the limitations?
- Sample: Is it representative? What's excluded?
- Recency: When was it collected? Is it current?
- Corroboration: Can you verify key findings independently?
- Human check: Do the numbers make sense to domain experts?
Red Flags
- Round numbers that suggest truncation
- Impossible values (negative ages, future dates)
- Suspiciously perfect distributions
- Missing data in politically sensitive areas
- Data that perfectly confirms your hypothesis
Publication
Transparency
- Publish methodology explaining data sources and analysis
- Link to raw data when possible
- Document cleaning and transformation steps
- Acknowledge limitations and uncertainties
Accessibility
- Provide alt text for visualizations
- Include data tables for screen readers
- Use colorblind-friendly palettes
- Test on mobile devices
Ethics
Privacy: Aggregate when possible. Be careful with location data, medical records, minor's information.
Harm: Consider who might be harmed by publication. Data about individuals can enable harassment.
Context: Raw numbers without context mislead. Provide rates, comparisons, and uncertainty.
Sources: Protect sources who provide leaked data. Use secure channels.
Resources
Books
- The Functional Art - Alberto Cairo
- Data Points - Nathan Yau
- The Truthful Art - Alberto Cairo
Training
Communities
- NICAR (IRE's computer-assisted reporting conference)
- #ddj on social media
- DataJournalism.com
Related
- Dataviz - Visualization projects and techniques
- Journalism - Broader journalism practice
- FOIA - Freedom of Information requests
| Journalism & Investigations | |
|---|---|
| Core | Journalism · Investigations · Source Handling |
| Methods | FOIA · Data Journalism · Dataviz · Documentation Discipline |
| Tools | ArchiveBox · Scrapbook-core · Personal APIs |
| Culture | Hacker Culture · PGP Communication Guide |