Data Journalism: Difference between revisions
Create stub: Data Journalism |
Major expansion: complete workflow, tools, analysis methods, verification, ethics |
||
| Line 1: | Line 1: | ||
= Data Journalism = | |||
[[File:Data-Visualization-Sketch.jpg|thumb|right|280px|Where data meets narrative: finding stories in datasets]] | |||
== | '''Data Journalism''' uses data analysis, statistics, and visualization to uncover and tell stories. It transforms raw information into public knowledge. | ||
* Data | |||
* | == Core Workflow == | ||
* Visualization techniques | |||
* | {| class="wikitable" | ||
! Phase !! Activities !! Tools | |||
|- | |||
| '''Acquisition''' || FOIA requests, scraping, APIs, leaks || requests, Selenium, wget | |||
|- | |||
| '''Cleaning''' || Parsing, deduplication, standardization || pandas, OpenRefine, csvkit | |||
|- | |||
| '''Analysis''' || Statistics, pattern detection, verification || SQL, R, Python | |||
|- | |||
| '''Visualization''' || Charts, maps, interactives || D3.js, Observable, Datawrapper | |||
|- | |||
| '''Publication''' || Story integration, documentation || HTML, static sites, CMS | |||
|} | |||
== Data Acquisition == | |||
=== Public Records === | |||
'''FOIA (Freedom of Information Act):''' | |||
* Federal agencies must respond within 20 business days | |||
* State equivalents vary (FOIL in NY, PRA in CA) | |||
* Track requests with [https://www.muckrock.com/ MuckRock] | |||
* Appeal denials - agencies often over-redact | |||
'''Government Data Portals:''' | |||
* [https://data.gov/ data.gov] - Federal datasets | |||
* [https://www.census.gov/data.html Census.gov] - Demographics, economic data | |||
* [https://www.sec.gov/edgar/searchedgar/companysearch.html SEC EDGAR] - Corporate filings | |||
* [https://www.fec.gov/data/ FEC] - Campaign finance | |||
=== Web Scraping === | |||
<pre> | |||
# Basic scraping with Python | |||
import requests | |||
from bs4 import BeautifulSoup | |||
response = requests.get(url) | |||
soup = BeautifulSoup(response.text, 'html.parser') | |||
data = soup.find_all('table') | |||
</pre> | |||
'''Ethics:''' Respect robots.txt, rate limit requests, don't overload servers. Scraping is not hacking, but be responsible. | |||
=== APIs === | |||
* '''AP Elections''' - Real-time election results | |||
* '''ProPublica Congress''' - Legislative data | |||
* '''OpenSecrets''' - Money in politics | |||
* '''Twitter/X''' - Social media analysis (API now restricted) | |||
== Data Cleaning == | |||
''"80% of data work is cleaning."'' | |||
=== Common Problems === | |||
{| class="wikitable" | |||
! Problem !! Solution | |||
|- | |||
| Inconsistent formats || Standardize dates, addresses, names | |||
|- | |||
| Missing values || Document, impute, or exclude | |||
|- | |||
| Duplicates || Deduplicate with fuzzy matching | |||
|- | |||
| Encoding issues || Force UTF-8, handle special characters | |||
|- | |||
| Truncation || Check for cut-off values at round numbers | |||
|} | |||
=== Tools === | |||
'''OpenRefine:''' Visual data cleaning with clustering, faceting, and transformation. Essential for messy real-world data. | |||
'''csvkit:''' Command-line CSV tools | |||
<pre> | |||
csvstat data.csv # Quick statistics | |||
csvcut -c 1,3 data.csv # Select columns | |||
csvjoin file1.csv file2.csv # Merge datasets | |||
</pre> | |||
'''pandas:''' | |||
<pre> | |||
import pandas as pd | |||
df = pd.read_csv('data.csv') | |||
df.drop_duplicates(inplace=True) | |||
df['date'] = pd.to_datetime(df['date']) | |||
df.to_csv('clean.csv', index=False) | |||
</pre> | |||
=== Dataproofer === | |||
Automated data quality checks - "spellcheck for data." | |||
* [https://github.com/dataproofer/Dataproofer GitHub Repository] | |||
* Knight Foundation funded project | |||
* Checks for truncation, outliers, formatting issues | |||
== Analysis Methods == | |||
=== Statistical Approaches === | |||
'''Descriptive:''' Mean, median, mode, standard deviation. What does the data look like? | |||
'''Comparative:''' Year-over-year changes, group comparisons. How do things differ? | |||
'''Geospatial:''' Location patterns, clustering, proximity analysis. Where is it happening? | |||
'''Time Series:''' Trends, seasonality, anomalies. How has it changed? | |||
=== SQL for Journalism === | |||
<pre> | |||
-- Find top 10 campaign donors | |||
SELECT donor_name, SUM(amount) as total | |||
FROM contributions | |||
WHERE election_cycle = 2024 | |||
GROUP BY donor_name | |||
ORDER BY total DESC | |||
LIMIT 10; | |||
-- Join donor data with employer info | |||
SELECT c.donor_name, c.amount, e.industry | |||
FROM contributions c | |||
JOIN employers e ON c.employer_id = e.id | |||
WHERE e.industry = 'Oil & Gas'; | |||
</pre> | |||
=== R for Statistical Analysis === | |||
<pre> | |||
library(tidyverse) | |||
data %>% | |||
filter(year >= 2020) %>% | |||
group_by(state) %>% | |||
summarize( | |||
total = sum(amount), | |||
count = n() | |||
) %>% | |||
arrange(desc(total)) | |||
</pre> | |||
== Visualization == | |||
=== Chart Selection === | |||
{| class="wikitable" | |||
! Data Type !! Chart | |||
|- | |||
| Comparison || Bar chart | |||
|- | |||
| Time series || Line chart | |||
|- | |||
| Part-to-whole || Pie chart (sparingly), stacked bar | |||
|- | |||
| Distribution || Histogram, box plot | |||
|- | |||
| Relationship || Scatter plot | |||
|- | |||
| Geographic || Choropleth, dot map | |||
|} | |||
=== D3.js === | |||
Primary visualization library for custom interactives. See [[Dataviz]] for project examples. | |||
=== Datawrapper === | |||
Fast, publication-ready charts. No code required. Good for daily journalism. | |||
=== Observable === | |||
Reactive notebooks for exploratory analysis and prototyping. [https://observablehq.com/@ejfox EJ's notebooks] | |||
== Verification == | |||
'''Never publish unverified data.''' | |||
=== Verification Checklist === | |||
* '''Source:''' Where did the data come from? Is it authoritative? | |||
* '''Methodology:''' How was it collected? What are the limitations? | |||
* '''Sample:''' Is it representative? What's excluded? | |||
* '''Recency:''' When was it collected? Is it current? | |||
* '''Corroboration:''' Can you verify key findings independently? | |||
* '''Human check:''' Do the numbers make sense to domain experts? | |||
=== Red Flags === | |||
* Round numbers that suggest truncation | |||
* Impossible values (negative ages, future dates) | |||
* Suspiciously perfect distributions | |||
* Missing data in politically sensitive areas | |||
* Data that perfectly confirms your hypothesis | |||
== Publication == | |||
=== Transparency === | |||
* Publish methodology explaining data sources and analysis | |||
* Link to raw data when possible | |||
* Document cleaning and transformation steps | |||
* Acknowledge limitations and uncertainties | |||
=== Accessibility === | |||
* Provide alt text for visualizations | |||
* Include data tables for screen readers | |||
* Use colorblind-friendly palettes | |||
* Test on mobile devices | |||
== Ethics == | |||
'''Privacy:''' Aggregate when possible. Be careful with location data, medical records, minor's information. | |||
'''Harm:''' Consider who might be harmed by publication. Data about individuals can enable harassment. | |||
'''Context:''' Raw numbers without context mislead. Provide rates, comparisons, and uncertainty. | |||
'''Sources:''' Protect sources who provide leaked data. Use secure channels. | |||
== Resources == | |||
=== Books === | |||
* ''The Functional Art'' - Alberto Cairo | |||
* ''Data Points'' - Nathan Yau | |||
* ''The Truthful Art'' - Alberto Cairo | |||
=== Training === | |||
* [https://www.ire.org/ IRE (Investigative Reporters and Editors)] | |||
* [https://www.propublica.org/nerds ProPublica Data Institute] | |||
* [https://knightcenter.utexas.edu/ Knight Center MOOCs] | |||
=== Communities === | |||
* NICAR (IRE's computer-assisted reporting conference) | |||
* #ddj on social media | |||
* [https://datajournalism.com/ DataJournalism.com] | |||
== Related == | |||
* [[Dataviz]] - Visualization projects and techniques | |||
* [[Journalism]] - Broader journalism practice | |||
* [[FOIA]] - Freedom of Information requests | |||
[[Category:Journalism]] | |||
[[Category:Data Visualization]] | |||
[[Category:Investigation]] | |||
{{Navbox Journalism}} | {{Navbox Journalism}} | ||
Latest revision as of 05:37, 15 January 2026
Data Journalism
Data Journalism uses data analysis, statistics, and visualization to uncover and tell stories. It transforms raw information into public knowledge.
Core Workflow
| Phase | Activities | Tools |
|---|---|---|
| Acquisition | FOIA requests, scraping, APIs, leaks | requests, Selenium, wget |
| Cleaning | Parsing, deduplication, standardization | pandas, OpenRefine, csvkit |
| Analysis | Statistics, pattern detection, verification | SQL, R, Python |
| Visualization | Charts, maps, interactives | D3.js, Observable, Datawrapper |
| Publication | Story integration, documentation | HTML, static sites, CMS |
Data Acquisition
Public Records
FOIA (Freedom of Information Act):
- Federal agencies must respond within 20 business days
- State equivalents vary (FOIL in NY, PRA in CA)
- Track requests with MuckRock
- Appeal denials - agencies often over-redact
Government Data Portals:
- data.gov - Federal datasets
- Census.gov - Demographics, economic data
- SEC EDGAR - Corporate filings
- FEC - Campaign finance
Web Scraping
# Basic scraping with Python
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find_all('table')
Ethics: Respect robots.txt, rate limit requests, don't overload servers. Scraping is not hacking, but be responsible.
APIs
- AP Elections - Real-time election results
- ProPublica Congress - Legislative data
- OpenSecrets - Money in politics
- Twitter/X - Social media analysis (API now restricted)
Data Cleaning
"80% of data work is cleaning."
Common Problems
| Problem | Solution |
|---|---|
| Inconsistent formats | Standardize dates, addresses, names |
| Missing values | Document, impute, or exclude |
| Duplicates | Deduplicate with fuzzy matching |
| Encoding issues | Force UTF-8, handle special characters |
| Truncation | Check for cut-off values at round numbers |
Tools
OpenRefine: Visual data cleaning with clustering, faceting, and transformation. Essential for messy real-world data.
csvkit: Command-line CSV tools
csvstat data.csv # Quick statistics csvcut -c 1,3 data.csv # Select columns csvjoin file1.csv file2.csv # Merge datasets
pandas:
import pandas as pd
df = pd.read_csv('data.csv')
df.drop_duplicates(inplace=True)
df['date'] = pd.to_datetime(df['date'])
df.to_csv('clean.csv', index=False)
Dataproofer
Automated data quality checks - "spellcheck for data."
- GitHub Repository
- Knight Foundation funded project
- Checks for truncation, outliers, formatting issues
Analysis Methods
Statistical Approaches
Descriptive: Mean, median, mode, standard deviation. What does the data look like?
Comparative: Year-over-year changes, group comparisons. How do things differ?
Geospatial: Location patterns, clustering, proximity analysis. Where is it happening?
Time Series: Trends, seasonality, anomalies. How has it changed?
SQL for Journalism
-- Find top 10 campaign donors SELECT donor_name, SUM(amount) as total FROM contributions WHERE election_cycle = 2024 GROUP BY donor_name ORDER BY total DESC LIMIT 10; -- Join donor data with employer info SELECT c.donor_name, c.amount, e.industry FROM contributions c JOIN employers e ON c.employer_id = e.id WHERE e.industry = 'Oil & Gas';
R for Statistical Analysis
library(tidyverse)
data %>%
filter(year >= 2020) %>%
group_by(state) %>%
summarize(
total = sum(amount),
count = n()
) %>%
arrange(desc(total))
Visualization
Chart Selection
| Data Type | Chart |
|---|---|
| Comparison | Bar chart |
| Time series | Line chart |
| Part-to-whole | Pie chart (sparingly), stacked bar |
| Distribution | Histogram, box plot |
| Relationship | Scatter plot |
| Geographic | Choropleth, dot map |
D3.js
Primary visualization library for custom interactives. See Dataviz for project examples.
Datawrapper
Fast, publication-ready charts. No code required. Good for daily journalism.
Observable
Reactive notebooks for exploratory analysis and prototyping. EJ's notebooks
Verification
Never publish unverified data.
Verification Checklist
- Source: Where did the data come from? Is it authoritative?
- Methodology: How was it collected? What are the limitations?
- Sample: Is it representative? What's excluded?
- Recency: When was it collected? Is it current?
- Corroboration: Can you verify key findings independently?
- Human check: Do the numbers make sense to domain experts?
Red Flags
- Round numbers that suggest truncation
- Impossible values (negative ages, future dates)
- Suspiciously perfect distributions
- Missing data in politically sensitive areas
- Data that perfectly confirms your hypothesis
Publication
Transparency
- Publish methodology explaining data sources and analysis
- Link to raw data when possible
- Document cleaning and transformation steps
- Acknowledge limitations and uncertainties
Accessibility
- Provide alt text for visualizations
- Include data tables for screen readers
- Use colorblind-friendly palettes
- Test on mobile devices
Ethics
Privacy: Aggregate when possible. Be careful with location data, medical records, minor's information.
Harm: Consider who might be harmed by publication. Data about individuals can enable harassment.
Context: Raw numbers without context mislead. Provide rates, comparisons, and uncertainty.
Sources: Protect sources who provide leaked data. Use secure channels.
Resources
Books
- The Functional Art - Alberto Cairo
- Data Points - Nathan Yau
- The Truthful Art - Alberto Cairo
Training
Communities
- NICAR (IRE's computer-assisted reporting conference)
- #ddj on social media
- DataJournalism.com
Related
- Dataviz - Visualization projects and techniques
- Journalism - Broader journalism practice
- FOIA - Freedom of Information requests
| Journalism & Investigations | |
|---|---|
| Core | Journalism · Investigations · Source Handling |
| Methods | FOIA · Data Journalism · Dataviz · Documentation Discipline |
| Tools | ArchiveBox · Scrapbook-core · Personal APIs |
| Culture | Hacker Culture · PGP Communication Guide |