--- title: "Querying across snapshots" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Querying across snapshots} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(eval = FALSE, collapse = TRUE, comment = "#>") ``` `duckspending_connection()` defaults to attaching only the newest snapshot, which is what most analyses want. When you need to compare months — to track when a grant program changed, to watch obligations accrue over a quarter, or to validate that historical years haven't shifted — you can attach several at once. This vignette shows the patterns and the caveats. ## Attaching multiple snapshots ```{r} library(duckspending) library(dplyr) # Attach every snapshot listed in /snapshots.txt: con <- duckspending_connection(snapshots = "all") # Or a specific subset: con <- duckspending_connection(snapshots = c("20260206", "20260306", "20260506")) ``` Each attached date becomes a top-level schema named for the date string, plus a `latest` schema that aliases the newest. `duckspending_describe()` reports them all: ```{r} duckspending_describe() |> distinct(schema) #> # A tibble: 4 × 1 #> schema #> #> 1 20260206 #> 2 20260306 #> 3 20260506 #> 4 latest ``` `duckspending_tbl()` accepts a `snapshot =` argument to address any attached snapshot: ```{r} duckspending_tbl("subaward_search") # latest duckspending_tbl("subaward_search", snapshot = "20260206") # February ``` Cost: attaching each snapshot downloads its catalog file (~10 MB) on first use. The package caches catalogs on disk, so the second R session that attaches the same snapshots is nearly free. ## Adding / removing snapshots at runtime ```{r} duckspending_attached(con) # list currently-attached snapshots duckspending_attach("20260306", con) # attach mid-session duckspending_detach("20260306", con) # detach (catalog stays cached) ``` ## Cross-snapshot diff `dplyr` doesn't know how to span schemas in a single pipeline, so for diffs the simplest path is two `collect()`s plus a base-R or `dplyr` join in memory: ```{r} feb <- duckspending_tbl("subaward_search", snapshot = "20260206") |> filter(sub_fiscal_year == 2024L) |> group_by(awarding_agency_id) |> summarise(total_feb = sum(subaward_amount, na.rm = TRUE), n_feb = n()) |> collect() may <- duckspending_tbl("subaward_search", snapshot = "20260506") |> filter(sub_fiscal_year == 2024L) |> group_by(awarding_agency_id) |> summarise(total_may = sum(subaward_amount, na.rm = TRUE), n_may = n()) |> collect() diff <- full_join(feb, may, by = "awarding_agency_id") |> mutate(delta = coalesce(total_may, 0) - coalesce(total_feb, 0)) |> arrange(desc(abs(delta))) diff ``` If you want the diff to stay inside DuckDB (faster, less memory), drop to SQL — see `vignette("sql-escape-hatch")` for the pattern. ## Caveat: snapshots are not strictly additive Each monthly snapshot is a fresh re-export of the entire USAspending PostgreSQL database, not a "what changed since last month" diff. Rows can change retroactively when agencies amend prior filings: - An action obligation can be revised — sometimes years after the fact. - A FAIN or PIID can be retroactively corrected, moving rows between recipients. - Period-of-performance dates shift (no-cost extensions are common). - `action_date` itself gets corrected (sometimes by years, when an agency migrates from a legacy system). - Historical fiscal years that were thought "closed" sometimes pick up late-filed records. The implication: **the row-count delta between snapshots is not "new activity that month."** If your question is "what activity happened in March", filter the latest snapshot by `action_date` / `fiscal_year` and trust that. Use cross-snapshot diffs to answer the different question — **what's different about the record of past activity now versus then?** ## Worked examples: detecting specific kinds of revision For row-level diffs, use a stable transaction key. `unique_award_key` identifies awards across snapshots; `published_fabs_id` (assistance) and `detached_award_procurement_id` (procurement) identify specific actions. `fain` and `piid` are *not* reliable diff keys — agencies do correct them. ### Backfills into a closed fiscal year How many FY2023 assistance transactions appeared between February and May? FY2023 was closed by Feb in the everyday sense, but late filings backfill. ```{r} feb_ids <- duckspending_tbl("source_assistance_transaction", snapshot = "20260206") |> filter(fiscal_year == 2023L) |> select(published_fabs_id) |> collect() may_fy23 <- duckspending_tbl("source_assistance_transaction", snapshot = "20260506") |> filter(fiscal_year == 2023L) |> select(published_fabs_id, awarding_agency_code, federal_action_obligation, cfda_number) |> collect() backfilled <- anti_join(may_fy23, feb_ids, by = "published_fabs_id") backfilled |> group_by(awarding_agency_code) |> summarise(n = n(), total_obligation = sum(federal_action_obligation, na.rm = TRUE), .groups = "drop") |> arrange(desc(n)) ``` ### End-date revisions (no-cost extensions, early terminations) A `period_of_performance_curr` that moves later is typically a no-cost extension. Earlier = early termination or close-out. ```{r} end_dates <- function(snap) { duckspending_tbl("source_assistance_transaction", snapshot = snap) |> filter(fiscal_year >= 2020L) |> group_by(unique_award_key) |> summarise(end_date = max(period_of_performance_curr, na.rm = TRUE), cfda_number = max(cfda_number, na.rm = TRUE), .groups = "drop") |> collect() } end_changes <- inner_join( end_dates("20260206"), end_dates("20260506"), by = "unique_award_key", suffix = c("_feb", "_may") ) |> mutate(delta_days = as.integer(as.Date(end_date_may) - as.Date(end_date_feb))) |> filter(!is.na(delta_days), delta_days != 0) # Top extensions end_changes |> arrange(desc(delta_days)) |> head(10) # CFDA programs with the most revision activity end_changes |> group_by(cfda_number_may) |> summarise(n_awards = n(), median_shift_days = median(delta_days, na.rm = TRUE), .groups = "drop") |> arrange(desc(n_awards)) |> head(20) ``` The same pattern works for `federal_action_obligation` (find awards whose published obligation amount was revised) and `action_date` (find transactions whose posting date got corrected). ## Validating time-travel consistency Closed fiscal years should be approximately stable across snapshots. A quick sanity check: ```{r} feb_rows <- duckspending_tbl("appropriation_account_balances", snapshot = "20260206") |> count() |> pull(n) may_rows <- duckspending_tbl("appropriation_account_balances") |> # latest count() |> pull(n) c(feb = feb_rows, may = may_rows, delta = may_rows - feb_rows) ``` Small deltas are expected — agencies submit revised File A periodically. Large deltas in pre-FY2024 totals warrant a closer look at the upstream dump. ## Disconnect ```{r} DBI::dbDisconnect(con, shutdown = TRUE) ``` The catalog files for every snapshot you've ever attached live in `tools::R_user_dir("duckspending", "cache")` and persist between sessions.