Querying across snapshots

duckspending_connection() defaults to attaching only the newest snapshot, which is what most analyses want. When you need to compare months — to track when a grant program changed, to watch obligations accrue over a quarter, or to validate that historical years haven’t shifted — you can attach several at once. This vignette shows the patterns and the caveats.

Attaching multiple snapshots

library(duckspending)
library(dplyr)

# Attach every snapshot listed in /snapshots.txt:
con <- duckspending_connection(snapshots = "all")

# Or a specific subset:
con <- duckspending_connection(snapshots = c("20260206", "20260306", "20260506"))

Each attached date becomes a top-level schema named for the date string, plus a latest schema that aliases the newest. duckspending_describe() reports them all:

duckspending_describe() |>
  distinct(schema)
#> # A tibble: 4 × 1
#>   schema
#>   <chr>
#> 1 20260206
#> 2 20260306
#> 3 20260506
#> 4 latest

duckspending_tbl() accepts a snapshot = argument to address any attached snapshot:

duckspending_tbl("subaward_search")                          # latest
duckspending_tbl("subaward_search", snapshot = "20260206")   # February

Cost: attaching each snapshot downloads its catalog file (~10 MB) on first use. The package caches catalogs on disk, so the second R session that attaches the same snapshots is nearly free.

Adding / removing snapshots at runtime

duckspending_attached(con)             # list currently-attached snapshots
duckspending_attach("20260306", con)   # attach mid-session
duckspending_detach("20260306", con)   # detach (catalog stays cached)

Cross-snapshot diff

dplyr doesn’t know how to span schemas in a single pipeline, so for diffs the simplest path is two collect()s plus a base-R or dplyr join in memory:

feb <- duckspending_tbl("subaward_search", snapshot = "20260206") |>
  filter(sub_fiscal_year == 2024L) |>
  group_by(awarding_agency_id) |>
  summarise(total_feb = sum(subaward_amount, na.rm = TRUE), n_feb = n()) |>
  collect()

may <- duckspending_tbl("subaward_search", snapshot = "20260506") |>
  filter(sub_fiscal_year == 2024L) |>
  group_by(awarding_agency_id) |>
  summarise(total_may = sum(subaward_amount, na.rm = TRUE), n_may = n()) |>
  collect()

diff <- full_join(feb, may, by = "awarding_agency_id") |>
  mutate(delta = coalesce(total_may, 0) - coalesce(total_feb, 0)) |>
  arrange(desc(abs(delta)))
diff

If you want the diff to stay inside DuckDB (faster, less memory), drop to SQL — see vignette("sql-escape-hatch") for the pattern.

Caveat: snapshots are not strictly additive

Each monthly snapshot is a fresh re-export of the entire USAspending PostgreSQL database, not a “what changed since last month” diff. Rows can change retroactively when agencies amend prior filings:

  • An action obligation can be revised — sometimes years after the fact.
  • A FAIN or PIID can be retroactively corrected, moving rows between recipients.
  • Period-of-performance dates shift (no-cost extensions are common).
  • action_date itself gets corrected (sometimes by years, when an agency migrates from a legacy system).
  • Historical fiscal years that were thought “closed” sometimes pick up late-filed records.

The implication: the row-count delta between snapshots is not “new activity that month.” If your question is “what activity happened in March”, filter the latest snapshot by action_date / fiscal_year and trust that. Use cross-snapshot diffs to answer the different question — what’s different about the record of past activity now versus then?

Worked examples: detecting specific kinds of revision

For row-level diffs, use a stable transaction key. unique_award_key identifies awards across snapshots; published_fabs_id (assistance) and detached_award_procurement_id (procurement) identify specific actions. fain and piid are not reliable diff keys — agencies do correct them.

Backfills into a closed fiscal year

How many FY2023 assistance transactions appeared between February and May? FY2023 was closed by Feb in the everyday sense, but late filings backfill.

feb_ids <- duckspending_tbl("source_assistance_transaction",
                            snapshot = "20260206") |>
  filter(fiscal_year == 2023L) |>
  select(published_fabs_id) |>
  collect()

may_fy23 <- duckspending_tbl("source_assistance_transaction",
                             snapshot = "20260506") |>
  filter(fiscal_year == 2023L) |>
  select(published_fabs_id, awarding_agency_code,
         federal_action_obligation, cfda_number) |>
  collect()

backfilled <- anti_join(may_fy23, feb_ids, by = "published_fabs_id")

backfilled |>
  group_by(awarding_agency_code) |>
  summarise(n = n(),
            total_obligation = sum(federal_action_obligation, na.rm = TRUE),
            .groups = "drop") |>
  arrange(desc(n))

End-date revisions (no-cost extensions, early terminations)

A period_of_performance_curr that moves later is typically a no-cost extension. Earlier = early termination or close-out.

end_dates <- function(snap) {
  duckspending_tbl("source_assistance_transaction", snapshot = snap) |>
    filter(fiscal_year >= 2020L) |>
    group_by(unique_award_key) |>
    summarise(end_date = max(period_of_performance_curr, na.rm = TRUE),
              cfda_number = max(cfda_number, na.rm = TRUE),
              .groups = "drop") |>
    collect()
}

end_changes <- inner_join(
  end_dates("20260206"),
  end_dates("20260506"),
  by = "unique_award_key",
  suffix = c("_feb", "_may")
) |>
  mutate(delta_days = as.integer(as.Date(end_date_may) - as.Date(end_date_feb))) |>
  filter(!is.na(delta_days), delta_days != 0)

# Top extensions
end_changes |> arrange(desc(delta_days)) |> head(10)

# CFDA programs with the most revision activity
end_changes |>
  group_by(cfda_number_may) |>
  summarise(n_awards = n(),
            median_shift_days = median(delta_days, na.rm = TRUE),
            .groups = "drop") |>
  arrange(desc(n_awards)) |>
  head(20)

The same pattern works for federal_action_obligation (find awards whose published obligation amount was revised) and action_date (find transactions whose posting date got corrected).

Validating time-travel consistency

Closed fiscal years should be approximately stable across snapshots. A quick sanity check:

feb_rows <- duckspending_tbl("appropriation_account_balances",
                             snapshot = "20260206") |>
  count() |> pull(n)

may_rows <- duckspending_tbl("appropriation_account_balances") |>  # latest
  count() |> pull(n)

c(feb = feb_rows, may = may_rows, delta = may_rows - feb_rows)

Small deltas are expected — agencies submit revised File A periodically. Large deltas in pre-FY2024 totals warrant a closer look at the upstream dump.

Disconnect

DBI::dbDisconnect(con, shutdown = TRUE)

The catalog files for every snapshot you’ve ever attached live in tools::R_user_dir("duckspending", "cache") and persist between sessions.