---
title: "Getting started with duckspending"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with duckspending}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(eval = FALSE, collapse = TRUE, comment = "#>")
```

`duckspending` connects R to the public USAspending DuckLake archive at
<https://usaspending.grant-witness.us> — a monthly mirror of the
USAspending.gov PostgreSQL database stored as Parquet files on Cloudflare
R2. You query it with DuckDB and `dplyr`; nothing else has to be installed
on your machine and no database server has to run.

This vignette walks from a clean R session through to a small analysis.

## Install

```{r, eval = FALSE}
# Any of these or remotes::install_url(), renv::install("url:...) will work
# pak::pak("url::https://usaspending.grant-witness.us/r-package/duckspending.tar.gz")
install.packages(
  "https://usaspending.grant-witness.us/r-package/duckspending.tar.gz",
  type = "source",
  repos = NULL
)
```

For dplyr access you also want:

```{r}
install.packages(c("dplyr", "dbplyr"))
```

## Connect

```{r}
library(duckspending)
library(dplyr)

con <- duckspending_connection()
```

The first call to `duckspending_connection()` does three things:

1. Reads `https://usaspending.grant-witness.us/snapshots.txt` to find the
   newest snapshot date.
2. Downloads that snapshot's ~10 MB DuckLake catalog file into
   `tools::R_user_dir("duckspending", "cache")` if it's not already
   cached.
3. Opens an in-memory DuckDB connection, attaches the catalog read-only,
   and projects a schema named `latest` over it.

Subsequent default calls return the same cached connection without
hitting the network. Catalog files persist between R sessions, so warm
starts are sub-second.

To see what's available:

```{r}
duckspending_snapshots() # all dates published on R2, newest first
duckspending_latest() # the newest date
duckspending_describe() # tables + columns visible on this connection
```

## First query

After `duckspending_connection()` the `latest` schema has one view per
user table. Use `duckspending_tbl()` for a lazy `dplyr` reference:

```{r}
duckspending_tbl("subaward_search") |>
  filter(sub_fiscal_year == 2024L) |>
  group_by(recipient_country_name) |>
  summarise(
    total = sum(subaward_amount, na.rm = TRUE),
    n = n()
  ) |>
  arrange(desc(total)) |>
  head(20) |>
  collect()
```

For the headline tables there are explicit accessors that save typing:

```{r}
duckspending_subawards() # subaward_search
duckspending_assistance_transactions() # source_assistance_transaction
duckspending_procurement_transactions() # source_procurement_transaction
duckspending_agencies() # toptier_agency
duckspending_recipients() # recipient_profile
```

Each returns a lazy `tbl_lazy` from the latest snapshot — `dplyr` verbs
all work normally.

## Filter on the partition columns

The four transaction/account tables are partitioned by `fiscal_year` and
`awarding_agency_code`. Filtering on those columns at the start of the
pipeline lets DuckDB skip whole Parquet files without reading them.

```{r}
# Fast: prunes to one fiscal-year × agency partition (~1 file on R2)
duckspending_procurement_transactions() |>
  filter(fiscal_year == 2024L, awarding_agency_code == "097") |> # DoD
  group_by(naics, naics_description) |>
  summarise(
    total = sum(federal_action_obligation, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total)) |>
  head(10) |>
  collect()
```

```{r}
# Slow: action_date is not a partition column, so this reads everything
# for FY2024 + matching agency_name — full scan.
duckspending_procurement_transactions() |>
  filter(
    action_date >= "2023-10-01",
    awarding_agency_name == "Department of Defense"
  ) |>
  ...
```

If you only know the agency name, look up the code first:

```{r}
codes <- duckspending_tbl("toptier_agency") |>
  filter(name == "Department of Defense") |>
  pull(toptier_code) # "097"
```

For `subaward_search` the partition column is `awarding_agency_id` (the
integer agency surrogate). Resolve it via `agency`:

```{r}
dod_agency_ids <- duckspending_tbl("agency") |>
  inner_join(
    duckspending_tbl("toptier_agency") |>
      select(toptier_agency_id, toptier_name = name),
    by = "toptier_agency_id"
  ) |>
  filter(toptier_name == "Department of Defense") |>
  pull(id)

duckspending_subawards() |>
  filter(awarding_agency_id %in% dod_agency_ids)
```

## When to use `collect()`

DuckDB streams aggregates and joins, but R's `dplyr` bindings materialise
the final result into memory. If the final result is small (a grouped
summary, a `head(20)`), `collect()` is safe. If you `collect()` a
non-aggregated table — say all rows of `source_procurement_transaction`
— R will run out of memory long before the query finishes.

A safe pattern:

1. Filter early on partition columns.
2. Project only the columns you need with `select()`.
3. Aggregate or `head()`.
4. `collect()` at the end.

```{r}
duckspending_assistance_transactions() |>
  filter(fiscal_year == 2024L, cfda_number == "47.041") |> # NSF Engineering
  select(
    awardee_or_recipient_legal,
    federal_action_obligation,
    place_of_performance_state_code
  ) |>
  group_by(place_of_performance_state_code) |>
  summarise(
    total = sum(federal_action_obligation, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total)) |>
  collect()
```

## Inspect tables and columns

`duckspending_describe()` returns a tibble combining the live schema with
the published column descriptions from `descriptions.json`:

```{r}
duckspending_describe()
duckspending_describe(table = "subaward_search")
```

The Rd help topic `?duckspending_tables` renders the same descriptions
dynamically — it always reflects the latest `descriptions.json`, so the
help is current without a package release.

## Disconnect

```{r}
DBI::dbDisconnect(con, shutdown = TRUE)
```


Data fetched from the database will also be cached, both in-RAM during a session,
and on-disk across sessions. The on-disk cache will accumulate as you make different queries.
The current size of the cache is printed when showing the connection object, and the cache
automatically evicts least-recently-used items when disk usage is high.  You can clear
the cache manually with `duckspending_clear_cache()`.

The cached catalog files in `tools::R_user_dir("duckspending", "cache")`
persist. Delete that directory to start fresh.


## Where to next

- `vignette("cross-snapshot")` — month-over-month and historical
  analysis using multiple snapshots at once.
- `vignette("sql-escape-hatch")` — drop down to raw `DBI::dbGetQuery()`
  when `dplyr` isn't enough.
- [`SCHEMA.md`](https://usaspending.grant-witness.us/schema.html) on the
  archive's website for table descriptions and partition layouts.