Getting started with duckspending

duckspending connects R to the public USAspending DuckLake archive at https://usaspending.grant-witness.us — a monthly mirror of the USAspending.gov PostgreSQL database stored as Parquet files on Cloudflare R2. You query it with DuckDB and dplyr; nothing else has to be installed on your machine and no database server has to run.

This vignette walks from a clean R session through to a small analysis.

Install

# Any of these or remotes::install_url(), renv::install("url:...) will work
# pak::pak("url::https://usaspending.grant-witness.us/r-package/duckspending.tar.gz")
install.packages(
  "https://usaspending.grant-witness.us/r-package/duckspending.tar.gz",
  type = "source",
  repos = NULL
)

For dplyr access you also want:

install.packages(c("dplyr", "dbplyr"))

Connect

library(duckspending)
library(dplyr)

con <- duckspending_connection()

The first call to duckspending_connection() does three things:

  1. Reads https://usaspending.grant-witness.us/snapshots.txt to find the newest snapshot date.
  2. Downloads that snapshot’s ~10 MB DuckLake catalog file into tools::R_user_dir("duckspending", "cache") if it’s not already cached.
  3. Opens an in-memory DuckDB connection, attaches the catalog read-only, and projects a schema named latest over it.

Subsequent default calls return the same cached connection without hitting the network. Catalog files persist between R sessions, so warm starts are sub-second.

To see what’s available:

duckspending_snapshots() # all dates published on R2, newest first
duckspending_latest() # the newest date
duckspending_describe() # tables + columns visible on this connection

First query

After duckspending_connection() the latest schema has one view per user table. Use duckspending_tbl() for a lazy dplyr reference:

duckspending_tbl("subaward_search") |>
  filter(sub_fiscal_year == 2024L) |>
  group_by(recipient_country_name) |>
  summarise(
    total = sum(subaward_amount, na.rm = TRUE),
    n = n()
  ) |>
  arrange(desc(total)) |>
  head(20) |>
  collect()

For the headline tables there are explicit accessors that save typing:

duckspending_subawards() # subaward_search
duckspending_assistance_transactions() # source_assistance_transaction
duckspending_procurement_transactions() # source_procurement_transaction
duckspending_agencies() # toptier_agency
duckspending_recipients() # recipient_profile

Each returns a lazy tbl_lazy from the latest snapshot — dplyr verbs all work normally.

Filter on the partition columns

The four transaction/account tables are partitioned by fiscal_year and awarding_agency_code. Filtering on those columns at the start of the pipeline lets DuckDB skip whole Parquet files without reading them.

# Fast: prunes to one fiscal-year × agency partition (~1 file on R2)
duckspending_procurement_transactions() |>
  filter(fiscal_year == 2024L, awarding_agency_code == "097") |> # DoD
  group_by(naics, naics_description) |>
  summarise(
    total = sum(federal_action_obligation, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total)) |>
  head(10) |>
  collect()
# Slow: action_date is not a partition column, so this reads everything
# for FY2024 + matching agency_name — full scan.
duckspending_procurement_transactions() |>
  filter(
    action_date >= "2023-10-01",
    awarding_agency_name == "Department of Defense"
  ) |>
  ...

If you only know the agency name, look up the code first:

codes <- duckspending_tbl("toptier_agency") |>
  filter(name == "Department of Defense") |>
  pull(toptier_code) # "097"

For subaward_search the partition column is awarding_agency_id (the integer agency surrogate). Resolve it via agency:

dod_agency_ids <- duckspending_tbl("agency") |>
  inner_join(
    duckspending_tbl("toptier_agency") |>
      select(toptier_agency_id, toptier_name = name),
    by = "toptier_agency_id"
  ) |>
  filter(toptier_name == "Department of Defense") |>
  pull(id)

duckspending_subawards() |>
  filter(awarding_agency_id %in% dod_agency_ids)

When to use collect()

DuckDB streams aggregates and joins, but R’s dplyr bindings materialise the final result into memory. If the final result is small (a grouped summary, a head(20)), collect() is safe. If you collect() a non-aggregated table — say all rows of source_procurement_transaction — R will run out of memory long before the query finishes.

A safe pattern:

  1. Filter early on partition columns.
  2. Project only the columns you need with select().
  3. Aggregate or head().
  4. collect() at the end.
duckspending_assistance_transactions() |>
  filter(fiscal_year == 2024L, cfda_number == "47.041") |> # NSF Engineering
  select(
    awardee_or_recipient_legal,
    federal_action_obligation,
    place_of_performance_state_code
  ) |>
  group_by(place_of_performance_state_code) |>
  summarise(
    total = sum(federal_action_obligation, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total)) |>
  collect()

Inspect tables and columns

duckspending_describe() returns a tibble combining the live schema with the published column descriptions from descriptions.json:

duckspending_describe()
duckspending_describe(table = "subaward_search")

The Rd help topic ?duckspending_tables renders the same descriptions dynamically — it always reflects the latest descriptions.json, so the help is current without a package release.

Disconnect

DBI::dbDisconnect(con, shutdown = TRUE)

Data fetched from the database will also be cached, both in-RAM during a session, and on-disk across sessions. The on-disk cache will accumulate as you make different queries. The current size of the cache is printed when showing the connection object, and the cache automatically evicts least-recently-used items when disk usage is high. You can clear the cache manually with duckspending_clear_cache().

The cached catalog files in tools::R_user_dir("duckspending", "cache") persist. Delete that directory to start fresh.

Where to next

  • vignette("cross-snapshot") — month-over-month and historical analysis using multiple snapshots at once.
  • vignette("sql-escape-hatch") — drop down to raw DBI::dbGetQuery() when dplyr isn’t enough.
  • SCHEMA.md on the archive’s website for table descriptions and partition layouts.