--- title: "Raw SQL escape hatch" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Raw SQL escape hatch} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(eval = FALSE, collapse = TRUE, comment = "#>") ``` `duckspending_tbl()` and `dplyr` get you most of the way for ad-hoc queries. A few situations want raw SQL: - Cross-schema joins that span multiple snapshots in one query. - DuckDB-specific syntax (`AT (SNAPSHOT n)`, `QUALIFY`, list/struct expressions) that `dbplyr` doesn't translate. - A query you've already written and tested in DuckDB CLI and want to paste verbatim. The connection returned by `duckspending_connection()` is a regular `DBI` connection, so `DBI::dbGetQuery()` and friends work directly. ## Basic usage ```{r} library(duckspending) con <- duckspending_connection() DBI::dbGetQuery(con, " SELECT awarding_agency_code, COUNT(*) AS n, SUM(federal_action_obligation) AS total FROM latest.source_procurement_transaction WHERE fiscal_year = 2024 GROUP BY awarding_agency_code ORDER BY total DESC LIMIT 10 ") ``` Note the `latest.` schema prefix. Without it, you'd have to `USE latest.main` first — but that changes per-session state. Qualifying table names in the query is safer. ## Schema-qualified queries across snapshots The interesting raw-SQL use case is comparing snapshots inside the database, without bouncing data through R: ```{r} con <- duckspending_connection(snapshots = c("20260206", "20260506")) DBI::dbGetQuery(con, ' SELECT feb.awarding_agency_code, feb.total AS total_feb, may.total AS total_may, may.total - feb.total AS delta FROM ( SELECT awarding_agency_code, SUM(federal_action_obligation) AS total FROM "20260206".source_procurement_transaction WHERE fiscal_year = 2024 GROUP BY awarding_agency_code ) AS feb FULL OUTER JOIN ( SELECT awarding_agency_code, SUM(federal_action_obligation) AS total FROM "20260506".source_procurement_transaction WHERE fiscal_year = 2024 GROUP BY awarding_agency_code ) AS may USING (awarding_agency_code) ORDER BY ABS(delta) DESC NULLS LAST LIMIT 20 ') ``` ## Quoting digit-leading identifiers The schema names are date strings (`"20260206"`). SQL identifiers that begin with a digit **must** be double-quoted, or DuckDB will parse them as numeric literals. So: ```{sql, eval = FALSE} -- WRONG: parses as integer 20260206 SELECT * FROM 20260206.subaward_search LIMIT 5; -- RIGHT: double-quoted identifier SELECT * FROM "20260206".subaward_search LIMIT 5; ``` Inside `DBI::dbGetQuery()` strings, use single quotes around the SQL and literal double quotes around identifiers. R's `'...'` strings tolerate embedded `"` without escaping, which is why the example above uses single-quoted R strings: ```{r} DBI::dbGetQuery(con, 'SELECT COUNT(*) FROM "20260206".subaward_search') ``` If you'd rather use double-quoted R strings, escape the inner quotes: ```{r} DBI::dbGetQuery(con, "SELECT COUNT(*) FROM \"20260206\".subaward_search") ``` For safer identifier interpolation use `DBI::dbQuoteIdentifier()`: ```{r} snap <- "20260206" schema_lit <- DBI::dbQuoteIdentifier(con, snap) DBI::dbGetQuery(con, paste0("SELECT COUNT(*) FROM ", schema_lit, ".subaward_search")) ``` ## Parameterised values Always use parameterised queries for user-supplied values, never paste them into the SQL string: ```{r} agency_code <- "097" # DoD DBI::dbGetQuery(con, " SELECT recipient_name, SUM(federal_action_obligation) AS total FROM latest.source_procurement_transaction WHERE fiscal_year = 2024 AND awarding_agency_code = ? GROUP BY recipient_name ORDER BY total DESC LIMIT 20 ", params = list(agency_code)) ``` ## When to stay in `dplyr` Raw SQL is power-user mode. For most queries: - `duckspending_tbl()` + `dplyr` produces equivalent SQL and is refactor-safe (no string concatenation). - `dbplyr::sql()` lets you inject a SQL snippet inside an otherwise-`dplyr` pipeline when only one expression needs raw SQL. - `dplyr::show_query()` prints the generated SQL — useful for sanity checking what `dbplyr` is producing. ```{r} duckspending_tbl("subaward_search") |> filter(dbplyr::sql("list_contains(business_categories, 'small_business')")) |> count() |> show_query() ``` Drop into raw SQL when `dbplyr` can't express what you want, when you're porting a query verbatim from another tool, or when the SQL is the clearer way to communicate intent.