| Title: | Access to USAspending Data in a DuckLake |
|---|---|
| Description: | Provides a connection to USAspending data in a DuckLake, allowing users to perform bulk analyses difficult to achieve with the USAspending API. |
| Authors: | Grant Witness [aut, cre] |
| Maintainer: | Grant Witness <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-31 17:31:03 UTC |
| Source: | https://usaspending.grant-witness.us/r-package/duckspending |
S3 method for connections::connection_view(). Builds an
rscontract::rscontract_spec() that wires the pane's drill-down to
pane_object_list() and pane_object_columns() so the IDE sees a
filtered, user-facing view of the DuckLake.
connection_view.duckspending_connection( con, connection_code = "", host = "", name = "", connection_id = "" )connection_view.duckspending_connection( con, connection_code = "", host = "", name = "", connection_id = "" )
con |
A |
connection_code |
Optional connection code shown by the IDE. |
host, name, connection_id
|
The result of rscontract::rscontract_open(), invisibly.
Caches the snapshot's catalog locally (if needed), ATTACHes it under
a hidden alias, and creates a top-level schema named after the date
with one view per user table. After this call the snapshot is queryable
as <conn>.<YYYYMMDD>.<table>.
duckspending_attach(snapshot, conn = duckspending_connection())duckspending_attach(snapshot, conn = duckspending_connection())
snapshot |
Snapshot identifier (YYYYMMDD character, |
conn |
A |
If conn is the package's cached default connection (the one
duckspending_connection() returns with no args), the cache entry is
updated in place so subsequent default-arg callers see the new state.
conn, invisibly, with the new snapshot in its metadata.
Currently-attached snapshots on a connection
duckspending_attached(conn = duckspending_connection())duckspending_attached(conn = duckspending_connection())
conn |
A |
Character vector of YYYYMMDD dates, newest first.
Calls cache_httpfs_clear_cache() on the underlying DuckDB session to
delete all cached parquet blocks from disk. Useful when you want to force
fresh fetches or reclaim disk space.
duckspending_clear_cache(conn = duckspending_connection())duckspending_clear_cache(conn = duckspending_connection())
conn |
A |
Has no effect (with a message) if the connection was opened with
cache = FALSE or if the cache_httpfs extension failed to load.
conn, invisibly.
Opens a DuckDB connection with one or more monthly DuckLake snapshots
attached. Each attached snapshot is exposed as a top-level schema named
after its date ("20260206", "20260306", ...). A latest schema
mirrors the newest attached snapshot for convenience.
duckspending_connection( snapshots = "latest", base_url = default_base_url(), catalog_dir = tools::R_user_dir("duckspending", which = "cache"), download_catalog = TRUE, open_pane = TRUE, cache = .duckspending_cache_default(), new_conn = FALSE )duckspending_connection( snapshots = "latest", base_url = default_base_url(), catalog_dir = tools::R_user_dir("duckspending", which = "cache"), download_catalog = TRUE, open_pane = TRUE, cache = .duckspending_cache_default(), new_conn = FALSE )
snapshots |
Snapshot selector. One of:
|
base_url |
Base URL of the public archive. Override for testing
via |
catalog_dir |
Local cache directory for snapshot catalog files. |
download_catalog |
Logical; when |
open_pane |
Logical; when |
cache |
Logical; when |
new_conn |
Logical; when |
By default a single snapshot — the newest one announced by the public
manifest — is attached. To attach a specific set, pass snapshots = ....
The first call with default arguments caches its connection in a
package-level environment; subsequent default calls reuse it, so all
the other duckspending_* helpers that default conn = duckspending_connection()
share a single underlying DuckDB session. Calls with non-default
arguments are not cached.
A duckspending_connection (duckdb_connection subclass).
## Not run: con <- duckspending_connection() duckspending_tbl("subaward_search") # uses the cached default DBI::dbDisconnect(con, shutdown = TRUE) ## End(Not run)## Not run: con <- duckspending_connection() duckspending_tbl("subaward_search") # uses the cached default DBI::dbDisconnect(con, shutdown = TRUE) ## End(Not run)
Three modes:
With no table, returns one row per user-facing table in each
attached snapshot (plus the latest alias).
With a table, returns a column-level summary for that table in
the requested snapshot (defaults to latest).
duckspending_describe( table = NULL, snapshot = NULL, descriptions = duckspending_descriptions(), include_source = FALSE, conn = duckspending_connection() )duckspending_describe( table = NULL, snapshot = NULL, descriptions = duckspending_descriptions(), include_source = FALSE, conn = duckspending_connection() )
table |
Optional table name. When |
snapshot |
Snapshot selector for the column-level mode.
|
descriptions |
Optional list returned by
|
include_source |
If |
conn |
A |
Internal bookkeeping (DuckLake metadata, hidden _dl_* databases,
_snapshots marker, DuckDB metadata schemas) is filtered out.
A tibble.
Descriptions live in a JSON file at the bucket root next to the catalog
(default URL is derived from the production catalog URL). When the file
is unreachable (e.g. not yet published, transient network error),
returns NULL so callers can fall back to schema-only output.
duckspending_descriptions( url = NULL, dir = tools::R_user_dir("duckspending", which = "cache"), refresh = FALSE )duckspending_descriptions( url = NULL, dir = tools::R_user_dir("duckspending", which = "cache"), refresh = FALSE )
url |
URL or local path to a descriptions JSON. Defaults to
|
dir |
Directory to cache the file in. |
refresh |
Force a re-check this session (ignores the per-session memoisation). The conditional ETag/Last-Modified check still applies, so a still-fresh remote file isn't re-downloaded. |
The file is cached under tools::R_user_dir() and refreshed
conditionally via the ETag / Last-Modified headers: the first call
per R session sends a HEAD request, and a GET only follows when the
remote validators differ from the cached ones. Subsequent calls in the
same session reuse the cached parse without any HTTP traffic.
If the per-session check fails (e.g. no network), the failure is
silent: a cached copy is returned when available, NULL otherwise, and
the next call within the same session re-attempts the check — so a
transient outage doesn't permanently disable the help-text refresh.
Expected JSON shape:
{
"tables": {
"<schema>.<table>": {
"description": "...",
"columns": {"<column>": "..."}
}
}
}
A parsed list, or NULL if the file could not be fetched.
Drops the user-facing schema ("YYYYMMDD") and DETACHes the hidden
underlying catalog. Removes the snapshot from the connection metadata.
Refuses to detach the snapshot currently bound to latest.
duckspending_detach(snapshot, conn = duckspending_connection())duckspending_detach(snapshot, conn = duckspending_connection())
snapshot |
Snapshot identifier. |
conn |
A |
conn, invisibly.
Stamps attr(col, "label") on every column in data whose name has a
matching description in the USAspending descriptions dictionary. Both
Positron's Data Explorer and RStudio's View() honour attr(., "label")
and surface it under the column header — the same convention used by
the labelled and haven packages.
duckspending_label( data, table = NULL, descriptions = duckspending_descriptions() )duckspending_label( data, table = NULL, descriptions = duckspending_descriptions() )
data |
A data frame or tibble. |
table |
Optional table name. When supplied, only descriptions for
that table are used (unambiguous; recommended when you know the
source). When |
descriptions |
Optional list returned by |
Columns whose names aren't in the dictionary are left untouched.
data, with attr(., "label") set on matching columns.
Checks (in order): options(duckspending.label = ...), the
DUCKSPENDING_LABEL env var, and falls back to TRUE. The env var
is treated as falsy if set to "false", "0", "no", "off"
(case-insensitive); anything else (including unset-but-checked) is truthy.
duckspending_label_default()duckspending_label_default()
A single logical.
Convenience equivalent of duckspending_snapshots()[1]. Returns
NA_character_ if no manifest is reachable.
duckspending_latest( refresh = FALSE, base_url = default_base_url(), cache_dir = tools::R_user_dir("duckspending", which = "cache") )duckspending_latest( refresh = FALSE, base_url = default_base_url(), cache_dir = tools::R_user_dir("duckspending", which = "cache") )
refresh |
Logical; bypass the local cache and re-fetch. |
base_url |
Base URL (default: production). Override with
|
cache_dir |
Local cache directory. |
YYYYMMDD character scalar or NA_character_.
Fetches the small text manifest at
{base_url}/snapshots.txt and returns the date strings, newest first.
The manifest is cached in tools::R_user_dir("duckspending", "cache")
for one hour to keep repeat calls cheap.
duckspending_snapshots( refresh = FALSE, base_url = default_base_url(), cache_dir = tools::R_user_dir("duckspending", which = "cache"), ttl_seconds = 3600L )duckspending_snapshots( refresh = FALSE, base_url = default_base_url(), cache_dir = tools::R_user_dir("duckspending", which = "cache"), ttl_seconds = 3600L )
refresh |
Logical; bypass the local cache and re-fetch. |
base_url |
Base URL (default: production). Override with
|
cache_dir |
Local cache directory. |
ttl_seconds |
Cache TTL in seconds (default 1 hour). |
A character vector of YYYYMMDD dates, newest first. Empty character if the manifest is not found.
Each function below returns a non-collected dplyr table reference
(i.e. a tbl_lazy) for one of the headline tables in the archive.
They are thin wrappers around duckspending_tbl() that exist so the
table names users care about most are exported, autocompletable, and
discoverable via ?duckspending — you don't have to remember the
exact column-name spelling of subaward_search etc.
duckspending_subawards( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() ) duckspending_assistance_transactions( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() ) duckspending_procurement_transactions( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() ) duckspending_agencies( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() ) duckspending_recipients( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() )duckspending_subawards( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() ) duckspending_assistance_transactions( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() ) duckspending_procurement_transactions( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() ) duckspending_agencies( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() ) duckspending_recipients( snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() )
snapshot |
Optional snapshot selector (YYYYMMDD, |
label |
Whether to attach column descriptions as |
conn |
A |
Everything is lazy: build a dplyr pipeline and call
dplyr::collect() when you want the result in R, or run
dplyr::show_query() to inspect the SQL. To use the legacy raw
tables not listed here, call duckspending_tbl() directly.
A tbl_lazy reference.
duckspending_subawards(): Sub-award table
(subaward_search) — one row per disclosed sub-award.
duckspending_assistance_transactions(): Federal Assistance (FABS)
transactions — source_assistance_transaction.
duckspending_procurement_transactions(): Federal Procurement (FPDS)
transactions — source_procurement_transaction.
duckspending_agencies(): Toptier agencies
(toptier_agency) — one row per department/agency at the top of
the federal hierarchy.
duckspending_recipients(): Recipient profile table
(recipient_profile) — one row per known funding recipient with
rolled-up totals.
This help topic is rendered dynamically from
<base>/descriptions.json on the public archive, so it always
reflects the descriptions currently shipping alongside the catalog –
no package release is required when descriptions change.
The table below lists every user-facing table in the archive
together with a short description. Column-level descriptions are
available via duckspending_describe().
If the descriptions JSON is unreachable (network outage, file not
yet published) the table will say so and you can still inspect the
schema interactively with duckspending_describe().
| Table | Description |
| agency | Joining table that pairs each subtier agency with its toptier parent. The `agency_id` foreign key on transaction tables resolves to a row here, from which you can reach `toptier_agency` and `subtier_agency`. |
| appropriation_account_balances | DATA Act File A: TAS-level totals for each agency reporting period — budgetary resources, obligations incurred, gross outlays, and unobligated balances. One row per submission × TAS. |
| award_category | Reference table mapping internal award type/category codes to broad human-readable labels (e.g. contract vs. grant). |
| budget_authority | Aggregated agency-level budget authority figures by fiscal year, sourced from OMB MAX. Used by summary displays — not transaction-level data. |
| bureau_title_lookup | Crosswalk of agency/bureau codes to human-readable bureau names for display in agency-hierarchy surfaces. |
| c_to_d_linkage_updates | Manual corrections to the linkage between File C financial-account rows and File D award rows. Applied during the USAspending ETL pipeline. |
| cgac | Common Government-wide Accounting Classification (CGAC) codes — the 2-digit agency codes used in TAS strings. Check `is_frec_agency` to determine whether to join via `cgac` or `frec`. |
| dabs_submission_window_schedule | Schedule of DATA Act reporting periods: open/close dates and revealed dates for each agency submission window. |
| disaster_emergency_fund_code | Disaster Emergency Fund Codes (DEFC) — tag obligations and outlays that trace to supplemental appropriations (COVID-19, hurricane relief, etc.). |
| duns | Crosswalk between legacy DUNS numbers and current Unique Entity Identifiers (UEI). The federal government completed the DUNS → UEI transition in April 2022; this table preserves the mapping. |
| external_data_load_date | Tracks when each external feed (SAM.gov, FPDS, FABS, etc.) was last loaded into USAspending. Useful for freshness checks. |
| external_data_type | Reference table naming the external data sources tracked by `external_data_load_date`. |
| federal_account | Federal Account groupings — the parent level above TAS. Each federal account aggregates related TASes under a single toptier agency and is the most useful grain for cross-TAS analysis. |
| financial_accounts_by_awards | DATA Act File C: the bridge between budget appropriations and specific awards. One row per submission × TAS × award per reporting period, with obligation and outlay amounts. Partitioned by `fiscal_year`, `awarding_agency_code` (note: the `awarding_agency_code` column is derived from `submission_attributes.toptier_code` and was added by this pipeline to support partition pruning). |
| financial_accounts_by_program_activity_object_class | DATA Act File B: how each TAS spent money, broken down by program activity and object class (personnel, contracts, grants, travel…). Includes obligations, outlays, and DEFC array tagging. Partitioned by `fiscal_year`, `awarding_agency_code` (the `awarding_agency_code` column is derived from `submission_attributes.toptier_code`). |
| frec | Federal Reporting Entity Codes — used by certain Treasury bureaus in place of CGAC in TAS strings. Join here when `cgac.is_frec_agency` is TRUE. |
| frec_map | FREC-to-CGAC reconciliation table used to map FREC-coded TASes back to their parent CGAC agency. |
| gtas_sf133_balances | GTAS/SF-133 government-wide budget execution balances by fiscal year/period and Treasury account. Independent of the DATA Act submission path; useful for cross-checking File A totals. |
| historic_parent_duns | Legacy parent-DUNS crosswalk for historical recipient matching. Captures parent organization relationships that pre-date the UEI transition. |
| historical_appropriation_account_balances | Pre-DATA Act TAS-level account balances, sourced from GTAS/SF-133 reports. Coverage varies by agency; provides the long-tail historical view that complements `appropriation_account_balances`. Partitioned by `reporting_fiscal_year`. |
| naics | North American Industry Classification System codes — used on procurement transactions to classify the type of product or service procured. |
| object_class | Object class codes (OMB Circular A-11) — classify federal spending by type of resource purchased (personnel, travel, contracts, grants, etc.). Referenced by File B. |
| office | Contracting offices and sub-organizations within subtier agencies, with their AAC office codes. |
| overall_totals | Top-line fiscal-year aggregates (total budget authority) used for trend headers on USAspending.gov. |
| parent_award | IDV (Indefinite Delivery Vehicle) and ordering-agreement hierarchy. Links parent contracts (GWACs, BPAs, MACs) to their child award IDs and provides aggregate rollups. |
| psc | Product and Service Codes — more granular than NAICS, specific to federal procurement. Used on contract transactions. |
| recipient_profile | Recipient-level summary records used for recipient search/profile surfaces. One row per recipient profile entity (recipient or parent-level depending on `recipient_level`), with 12-month rollups across contracts, grants, loans, and direct payments. |
| ref_city_county_state_code | City × county × state FIPS code reference used for place-of-performance and recipient-location matching. |
| ref_country_code | ISO and USAspending country codes used in place-of-performance and recipient-country fields on transactions. |
| ref_population_cong_district | ACS-derived population estimates by state × congressional district. Used by USAspending.gov for per-capita normalisations. |
| ref_population_county | ACS-derived population estimates by state × county. Used by USAspending.gov for per-capita normalisations. |
| ref_program_activity | Program activity codes — identify the specific activity or project funded within a TAS/budget account. Defined in agency budget justifications and referenced by File B. |
| references_cfda | Assistance Listings (formerly CFDA — Catalog of Federal Domestic Assistance). Defines each federal grant or assistance program, referenced by `cfda_number` on assistance transactions. |
| references_definition | Glossary table with plain-language and official definitions for USAspending terms (e.g. "obligation", "outlay"). Powers the definition tooltips on USAspending.gov. |
| source_assistance_transaction | Assistance transaction actions (grants, loans, direct payments, insurance) as submitted to the Financial Assistance Broker Submission (FABS). One row per action. Equivalent to USAspending's `transaction_search_fabs` view but without denormalisation. Partitioned by `fiscal_year`, `awarding_agency_code`. |
| source_procurement_transaction | Contract transaction actions (FPDS) — one row per contract action (base award, modification, IDV award, etc.). Equivalent to USAspending's `transaction_search_fpds` view but without denormalisation. Partitioned by `fiscal_year`, `awarding_agency_code`. |
| state_data | State-level metadata: FIPS codes, abbreviations, and ACS population. Used for state-level rollups and per-capita comparisons. |
| subaward_search | Subaward records reported by prime recipients to FSRS via SAM.gov. Covers grants ≥ $25K and contracts ≥ $30K. Many subawards are not reported and submissions lag prime award data by 30–90 days. Partitioned by `awarding_agency_id`. |
| submission_attributes | One row per DATA Act submission. The authoritative source of fiscal year/period metadata — joins to Files A/B/C via `submission_id`. Use `quarter_format_flag` to distinguish quarterly from monthly filers. |
| subtier_agency | Subtier agencies — bureaus, administrations, and offices within a department (e.g. NIH under HHS, Army under DoD). |
| toptier_agency | Cabinet-level agencies (Department of Defense, Department of Health and Human Services, etc.). Top of the agency hierarchy. The 3-digit (or 4-digit FREC) `toptier_code` is the partition key on the transaction tables. |
| treasury_appropriation_account | Treasury Account Symbol (TAS) definitions — the atomic unit of appropriation. Each row is one line of congressional appropriation authority, with budget function/subfunction classification. |
| zips_grouped | ZIP-to-county/congressional-district crosswalk used by geographic rollups. |
duckspending_describe() for live, column-level details on
a connected DuckLake; duckspending_descriptions() for the raw
parsed JSON.
Returns a non-collected dplyr table reference to <snapshot>.<name>,
where <snapshot> defaults to the connection's latest schema. Verifies
the table exists first so users get a clear error instead of a deferred
SQL failure.
duckspending_tbl( name, snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() )duckspending_tbl( name, snapshot = NULL, label = duckspending_label_default(), conn = duckspending_connection() )
name |
Table name (no quoting needed). |
snapshot |
Optional snapshot selector. One of:
|
label |
Whether to attach column descriptions as |
conn |
A |
A tbl_lazy (dbplyr) reference to the table.
Thin convenience wrapper around connections::connection_view() that
dispatches to the duckspending_connection S3 method. Useful after
creating a connection with open_pane = FALSE, or to re-open the
pane after closing it.
duckspending_view(conn = duckspending_connection())duckspending_view(conn = duckspending_connection())
conn |
A |
conn, invisibly.