--- title: "Getting started with duckspending" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with duckspending} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(eval = FALSE, collapse = TRUE, comment = "#>") ``` `duckspending` connects R to the public USAspending DuckLake archive at — a monthly mirror of the USAspending.gov PostgreSQL database stored as Parquet files on Cloudflare R2. You query it with DuckDB and `dplyr`; nothing else has to be installed on your machine and no database server has to run. This vignette walks from a clean R session through to a small analysis. ## Install ```{r, eval = FALSE} # Any of these or remotes::install_url(), renv::install("url:...) will work # pak::pak("url::https://usaspending.grant-witness.us/r-package/duckspending.tar.gz") install.packages( "https://usaspending.grant-witness.us/r-package/duckspending.tar.gz", type = "source", repos = NULL ) ``` For dplyr access you also want: ```{r} install.packages(c("dplyr", "dbplyr")) ``` ## Connect ```{r} library(duckspending) library(dplyr) con <- duckspending_connection() ``` The first call to `duckspending_connection()` does three things: 1. Reads `https://usaspending.grant-witness.us/snapshots.txt` to find the newest snapshot date. 2. Downloads that snapshot's ~10 MB DuckLake catalog file into `tools::R_user_dir("duckspending", "cache")` if it's not already cached. 3. Opens an in-memory DuckDB connection, attaches the catalog read-only, and projects a schema named `latest` over it. Subsequent default calls return the same cached connection without hitting the network. Catalog files persist between R sessions, so warm starts are sub-second. To see what's available: ```{r} duckspending_snapshots() # all dates published on R2, newest first duckspending_latest() # the newest date duckspending_describe() # tables + columns visible on this connection ``` ## First query After `duckspending_connection()` the `latest` schema has one view per user table. Use `duckspending_tbl()` for a lazy `dplyr` reference: ```{r} duckspending_tbl("subaward_search") |> filter(sub_fiscal_year == 2024L) |> group_by(recipient_country_name) |> summarise( total = sum(subaward_amount, na.rm = TRUE), n = n() ) |> arrange(desc(total)) |> head(20) |> collect() ``` For the headline tables there are explicit accessors that save typing: ```{r} duckspending_subawards() # subaward_search duckspending_assistance_transactions() # source_assistance_transaction duckspending_procurement_transactions() # source_procurement_transaction duckspending_agencies() # toptier_agency duckspending_recipients() # recipient_profile ``` Each returns a lazy `tbl_lazy` from the latest snapshot — `dplyr` verbs all work normally. ## Filter on the partition columns The four transaction/account tables are partitioned by `fiscal_year` and `awarding_agency_code`. Filtering on those columns at the start of the pipeline lets DuckDB skip whole Parquet files without reading them. ```{r} # Fast: prunes to one fiscal-year × agency partition (~1 file on R2) duckspending_procurement_transactions() |> filter(fiscal_year == 2024L, awarding_agency_code == "097") |> # DoD group_by(naics, naics_description) |> summarise( total = sum(federal_action_obligation, na.rm = TRUE), .groups = "drop" ) |> arrange(desc(total)) |> head(10) |> collect() ``` ```{r} # Slow: action_date is not a partition column, so this reads everything # for FY2024 + matching agency_name — full scan. duckspending_procurement_transactions() |> filter( action_date >= "2023-10-01", awarding_agency_name == "Department of Defense" ) |> ... ``` If you only know the agency name, look up the code first: ```{r} codes <- duckspending_tbl("toptier_agency") |> filter(name == "Department of Defense") |> pull(toptier_code) # "097" ``` For `subaward_search` the partition column is `awarding_agency_id` (the integer agency surrogate). Resolve it via `agency`: ```{r} dod_agency_ids <- duckspending_tbl("agency") |> inner_join( duckspending_tbl("toptier_agency") |> select(toptier_agency_id, toptier_name = name), by = "toptier_agency_id" ) |> filter(toptier_name == "Department of Defense") |> pull(id) duckspending_subawards() |> filter(awarding_agency_id %in% dod_agency_ids) ``` ## When to use `collect()` DuckDB streams aggregates and joins, but R's `dplyr` bindings materialise the final result into memory. If the final result is small (a grouped summary, a `head(20)`), `collect()` is safe. If you `collect()` a non-aggregated table — say all rows of `source_procurement_transaction` — R will run out of memory long before the query finishes. A safe pattern: 1. Filter early on partition columns. 2. Project only the columns you need with `select()`. 3. Aggregate or `head()`. 4. `collect()` at the end. ```{r} duckspending_assistance_transactions() |> filter(fiscal_year == 2024L, cfda_number == "47.041") |> # NSF Engineering select( awardee_or_recipient_legal, federal_action_obligation, place_of_performance_state_code ) |> group_by(place_of_performance_state_code) |> summarise( total = sum(federal_action_obligation, na.rm = TRUE), .groups = "drop" ) |> arrange(desc(total)) |> collect() ``` ## Inspect tables and columns `duckspending_describe()` returns a tibble combining the live schema with the published column descriptions from `descriptions.json`: ```{r} duckspending_describe() duckspending_describe(table = "subaward_search") ``` The Rd help topic `?duckspending_tables` renders the same descriptions dynamically — it always reflects the latest `descriptions.json`, so the help is current without a package release. ## Disconnect ```{r} DBI::dbDisconnect(con, shutdown = TRUE) ``` Data fetched from the database will also be cached, both in-RAM during a session, and on-disk across sessions. The on-disk cache will accumulate as you make different queries. The current size of the cache is printed when showing the connection object, and the cache automatically evicts least-recently-used items when disk usage is high. You can clear the cache manually with `duckspending_clear_cache()`. The cached catalog files in `tools::R_user_dir("duckspending", "cache")` persist. Delete that directory to start fresh. ## Where to next - `vignette("cross-snapshot")` — month-over-month and historical analysis using multiple snapshots at once. - `vignette("sql-escape-hatch")` — drop down to raw `DBI::dbGetQuery()` when `dplyr` isn't enough. - [`SCHEMA.md`](https://usaspending.grant-witness.us/schema.html) on the archive's website for table descriptions and partition layouts.