duckspending_connection()
defaults to attaching only the newest snapshot, which is what most
analyses want. When you need to compare months — to track when a grant
program changed, to watch obligations accrue over a quarter, or to
validate that historical years haven’t shifted — you can attach several
at once. This vignette shows the patterns and the caveats.
library(duckspending)
library(dplyr)
# Attach every snapshot listed in /snapshots.txt:
con <- duckspending_connection(snapshots = "all")
# Or a specific subset:
con <- duckspending_connection(snapshots = c("20260206", "20260306", "20260506"))Each attached date becomes a top-level schema named for the date
string, plus a latest schema that aliases the newest.
duckspending_describe() reports them all:
duckspending_describe() |>
distinct(schema)
#> # A tibble: 4 × 1
#> schema
#> <chr>
#> 1 20260206
#> 2 20260306
#> 3 20260506
#> 4 latestduckspending_tbl() accepts a snapshot =
argument to address any attached snapshot:
duckspending_tbl("subaward_search") # latest
duckspending_tbl("subaward_search", snapshot = "20260206") # FebruaryCost: attaching each snapshot downloads its catalog file (~10 MB) on first use. The package caches catalogs on disk, so the second R session that attaches the same snapshots is nearly free.
dplyr doesn’t know how to span schemas in a single
pipeline, so for diffs the simplest path is two collect()s
plus a base-R or dplyr join in memory:
feb <- duckspending_tbl("subaward_search", snapshot = "20260206") |>
filter(sub_fiscal_year == 2024L) |>
group_by(awarding_agency_id) |>
summarise(total_feb = sum(subaward_amount, na.rm = TRUE), n_feb = n()) |>
collect()
may <- duckspending_tbl("subaward_search", snapshot = "20260506") |>
filter(sub_fiscal_year == 2024L) |>
group_by(awarding_agency_id) |>
summarise(total_may = sum(subaward_amount, na.rm = TRUE), n_may = n()) |>
collect()
diff <- full_join(feb, may, by = "awarding_agency_id") |>
mutate(delta = coalesce(total_may, 0) - coalesce(total_feb, 0)) |>
arrange(desc(abs(delta)))
diffIf you want the diff to stay inside DuckDB (faster, less memory),
drop to SQL — see vignette("sql-escape-hatch") for the
pattern.
Each monthly snapshot is a fresh re-export of the entire USAspending PostgreSQL database, not a “what changed since last month” diff. Rows can change retroactively when agencies amend prior filings:
action_date itself gets corrected (sometimes by years,
when an agency migrates from a legacy system).The implication: the row-count delta between snapshots is not
“new activity that month.” If your question is “what activity
happened in March”, filter the latest snapshot by
action_date / fiscal_year and trust that. Use
cross-snapshot diffs to answer the different question — what’s
different about the record of past activity now versus
then?
For row-level diffs, use a stable transaction key.
unique_award_key identifies awards across snapshots;
published_fabs_id (assistance) and
detached_award_procurement_id (procurement) identify
specific actions. fain and piid are
not reliable diff keys — agencies do correct them.
How many FY2023 assistance transactions appeared between February and May? FY2023 was closed by Feb in the everyday sense, but late filings backfill.
feb_ids <- duckspending_tbl("source_assistance_transaction",
snapshot = "20260206") |>
filter(fiscal_year == 2023L) |>
select(published_fabs_id) |>
collect()
may_fy23 <- duckspending_tbl("source_assistance_transaction",
snapshot = "20260506") |>
filter(fiscal_year == 2023L) |>
select(published_fabs_id, awarding_agency_code,
federal_action_obligation, cfda_number) |>
collect()
backfilled <- anti_join(may_fy23, feb_ids, by = "published_fabs_id")
backfilled |>
group_by(awarding_agency_code) |>
summarise(n = n(),
total_obligation = sum(federal_action_obligation, na.rm = TRUE),
.groups = "drop") |>
arrange(desc(n))A period_of_performance_curr that moves later is
typically a no-cost extension. Earlier = early termination or
close-out.
end_dates <- function(snap) {
duckspending_tbl("source_assistance_transaction", snapshot = snap) |>
filter(fiscal_year >= 2020L) |>
group_by(unique_award_key) |>
summarise(end_date = max(period_of_performance_curr, na.rm = TRUE),
cfda_number = max(cfda_number, na.rm = TRUE),
.groups = "drop") |>
collect()
}
end_changes <- inner_join(
end_dates("20260206"),
end_dates("20260506"),
by = "unique_award_key",
suffix = c("_feb", "_may")
) |>
mutate(delta_days = as.integer(as.Date(end_date_may) - as.Date(end_date_feb))) |>
filter(!is.na(delta_days), delta_days != 0)
# Top extensions
end_changes |> arrange(desc(delta_days)) |> head(10)
# CFDA programs with the most revision activity
end_changes |>
group_by(cfda_number_may) |>
summarise(n_awards = n(),
median_shift_days = median(delta_days, na.rm = TRUE),
.groups = "drop") |>
arrange(desc(n_awards)) |>
head(20)The same pattern works for federal_action_obligation
(find awards whose published obligation amount was revised) and
action_date (find transactions whose posting date got
corrected).
Closed fiscal years should be approximately stable across snapshots. A quick sanity check:
feb_rows <- duckspending_tbl("appropriation_account_balances",
snapshot = "20260206") |>
count() |> pull(n)
may_rows <- duckspending_tbl("appropriation_account_balances") |> # latest
count() |> pull(n)
c(feb = feb_rows, may = may_rows, delta = may_rows - feb_rows)Small deltas are expected — agencies submit revised File A periodically. Large deltas in pre-FY2024 totals warrant a closer look at the upstream dump.