You will need the files in the r folder of this repository.
The script
extract_data.R contains the functions to extract data from the database.
The config
The config.yml file is our suggested way of passing filepaths in to the R functions, so that if you want to modify these filepaths you need only do so in this one central configuration file, instead of having to trawl through all your scripts doing find-and-replace. It is written in yaml format (similar to json).
In R, we use the yaml package to parse yaml files, storing them as named lists (dictionaries) so you can access the configuration values by name.
The mapping
We suggest that you have a mapping by which you convert the UK Biobank field IDs into human-readable names for ease of use. The database contains the data in raw form, with the original UKB Field IDs, which means you can apply any mapping of your choice.
These scripts expect this mapping to be in the form of Renaming_List.csv.
Our suggested format for a mapping file is as follows (the row in italics is here for documentation only!):
| Field_ID | Field_Description | NewVarName | Coded | Notes |
|---|---|---|---|---|
| UKB Field ID | Text description of the field | Human-readable name | Initials of person who named the field | Any notes |
| eid | Pseudonymised participant ID | ID | JC | Please note this “eid” is necessary for the automated renaming process |
| 53 | Date when a participant attended a UK Biobank assessment centre | Rec_DateAssess | JC | |
| 6150 | Vascular/heart problems diagnosed by doctor | HMH_HeartProbs | JC | |
| … | etc |
Required:
But we encourage the other columns as sensible bits of info to keep there!
First you will need to source extract_data.R so the functions are loaded for use.
source("/path/to/extract_data.R")
You need to use the DB_extract() function which takes the following form:
DB_extract(
extract_cols,
db = config$data$database,
name_map = config$cleaning$renaming,
withdrawals = config$cleaning$withdrawals
)
extract_cols
db
name_map
withdrawals
The arguments db, name_map and withdrawals take default values from your config.yml if you haven’t explicitly supplied alternatives.
Note that you will need to manually specify the instances and arrays for each field you want to extract. We recommend that you always look at the Data Showcase page for your fields of interest, and understand the columns available (described by the instance and array)
data <- DB_extract(
extract_cols = c(
"ID",
"Rec_DateAssess.0.0",
paste0("HMH_HeartProbs.0.", seq(0, 3, by=1))
)
)
In this example, we are extracting
You can of course extract the data in any way you prefer, DB_extract() is just a wrapper around functions provided in the duckDB and DBI packages.
If you are familiar with SQL, you may wish to submit queries directly using dbGetQuery from DBI.
Another useful R package for interfacing with databases is dbplyr which allows you to write dplyr style code that is converted into SQL queries for you.