duckspending connects
R to the public USAspending DuckLake archive at https://usaspending.grant-witness.us — a monthly mirror
of the USAspending.gov PostgreSQL database stored as Parquet files on
Cloudflare R2. You query it with DuckDB and dplyr; nothing
else has to be installed on your machine and no database server has to
run.
This vignette walks from a clean R session through to a small analysis.
# Any of these or remotes::install_url(), renv::install("url:...) will work
# pak::pak("url::https://usaspending.grant-witness.us/r-package/duckspending.tar.gz")
install.packages(
"https://usaspending.grant-witness.us/r-package/duckspending.tar.gz",
type = "source",
repos = NULL
)For dplyr access you also want:
The first call to duckspending_connection() does three
things:
https://usaspending.grant-witness.us/snapshots.txt to find
the newest snapshot date.tools::R_user_dir("duckspending", "cache") if it’s not
already cached.latest over it.Subsequent default calls return the same cached connection without hitting the network. Catalog files persist between R sessions, so warm starts are sub-second.
To see what’s available:
After duckspending_connection() the latest
schema has one view per user table. Use duckspending_tbl()
for a lazy dplyr reference:
duckspending_tbl("subaward_search") |>
filter(sub_fiscal_year == 2024L) |>
group_by(recipient_country_name) |>
summarise(
total = sum(subaward_amount, na.rm = TRUE),
n = n()
) |>
arrange(desc(total)) |>
head(20) |>
collect()For the headline tables there are explicit accessors that save typing:
duckspending_subawards() # subaward_search
duckspending_assistance_transactions() # source_assistance_transaction
duckspending_procurement_transactions() # source_procurement_transaction
duckspending_agencies() # toptier_agency
duckspending_recipients() # recipient_profileEach returns a lazy tbl_lazy from the latest snapshot —
dplyr verbs all work normally.
The four transaction/account tables are partitioned by
fiscal_year and awarding_agency_code.
Filtering on those columns at the start of the pipeline lets DuckDB skip
whole Parquet files without reading them.
# Fast: prunes to one fiscal-year × agency partition (~1 file on R2)
duckspending_procurement_transactions() |>
filter(fiscal_year == 2024L, awarding_agency_code == "097") |> # DoD
group_by(naics, naics_description) |>
summarise(
total = sum(federal_action_obligation, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(total)) |>
head(10) |>
collect()# Slow: action_date is not a partition column, so this reads everything
# for FY2024 + matching agency_name — full scan.
duckspending_procurement_transactions() |>
filter(
action_date >= "2023-10-01",
awarding_agency_name == "Department of Defense"
) |>
...If you only know the agency name, look up the code first:
codes <- duckspending_tbl("toptier_agency") |>
filter(name == "Department of Defense") |>
pull(toptier_code) # "097"For subaward_search the partition column is
awarding_agency_id (the integer agency surrogate). Resolve
it via agency:
dod_agency_ids <- duckspending_tbl("agency") |>
inner_join(
duckspending_tbl("toptier_agency") |>
select(toptier_agency_id, toptier_name = name),
by = "toptier_agency_id"
) |>
filter(toptier_name == "Department of Defense") |>
pull(id)
duckspending_subawards() |>
filter(awarding_agency_id %in% dod_agency_ids)collect()DuckDB streams aggregates and joins, but R’s dplyr
bindings materialise the final result into memory. If the final result
is small (a grouped summary, a head(20)),
collect() is safe. If you collect() a
non-aggregated table — say all rows of
source_procurement_transaction — R will run out of memory
long before the query finishes.
A safe pattern:
select().head().collect() at the end.duckspending_assistance_transactions() |>
filter(fiscal_year == 2024L, cfda_number == "47.041") |> # NSF Engineering
select(
awardee_or_recipient_legal,
federal_action_obligation,
place_of_performance_state_code
) |>
group_by(place_of_performance_state_code) |>
summarise(
total = sum(federal_action_obligation, na.rm = TRUE),
.groups = "drop"
) |>
arrange(desc(total)) |>
collect()duckspending_describe() returns a tibble combining the
live schema with the published column descriptions from
descriptions.json:
The Rd help topic ?duckspending_tables renders the same
descriptions dynamically — it always reflects the latest
descriptions.json, so the help is current without a package
release.
Data fetched from the database will also be cached, both in-RAM
during a session, and on-disk across sessions. The on-disk cache will
accumulate as you make different queries. The current size of the cache
is printed when showing the connection object, and the cache
automatically evicts least-recently-used items when disk usage is high.
You can clear the cache manually with
duckspending_clear_cache().
The cached catalog files in
tools::R_user_dir("duckspending", "cache") persist. Delete
that directory to start fresh.
vignette("cross-snapshot") — month-over-month and
historical analysis using multiple snapshots at once.vignette("sql-escape-hatch") — drop down to raw
DBI::dbGetQuery() when dplyr isn’t
enough.SCHEMA.md
on the archive’s website for table descriptions and partition
layouts.