UKB_database

Setup

You will need the files in the r folder of this repository.

The script

extract_data.R contains the functions to extract data from the database.

The config

The config.yml file is our suggested way of passing filepaths in to the R functions, so that if you want to modify these filepaths you need only do so in this one central configuration file, instead of having to trawl through all your scripts doing find-and-replace. It is written in yaml format (similar to json).

In R, we use the yaml package to parse yaml files, storing them as named lists (dictionaries) so you can access the configuration values by name.

The mapping

We suggest that you have a mapping by which you convert the UK Biobank field IDs into human-readable names for ease of use. The database contains the data in raw form, with the original UKB Field IDs, which means you can apply any mapping of your choice.

These scripts expect this mapping to be in the form of Renaming_List.csv.

Our suggested format for a mapping file is as follows (the row in italics is here for documentation only!):

Field_ID Field_Description NewVarName Coded Notes
UKB Field ID Text description of the field Human-readable name Initials of person who named the field Any notes
eid Pseudonymised participant ID ID JC Please note this “eid” is necessary for the automated renaming process
53 Date when a participant attended a UK Biobank assessment centre Rec_DateAssess JC  
6150 Vascular/heart problems diagnosed by doctor HMH_HeartProbs JC  
etc      

Required:

But we encourage the other columns as sensible bits of info to keep there!

Extracting the data

First you will need to source extract_data.R so the functions are loaded for use.

source("/path/to/extract_data.R")

Usage

You need to use the DB_extract() function which takes the following form:

DB_extract(
    extract_cols,
    db = config$data$database,
    name_map = config$cleaning$renaming,
    withdrawals = config$cleaning$withdrawals
)

Arguments

The arguments db, name_map and withdrawals take default values from your config.yml if you haven’t explicitly supplied alternatives.

Note that you will need to manually specify the instances and arrays for each field you want to extract. We recommend that you always look at the Data Showcase page for your fields of interest, and understand the columns available (described by the instance and array)

Example

data <- DB_extract(
    extract_cols = c(
        "ID", 
        "Rec_DateAssess.0.0",
        paste0("HMH_HeartProbs.0.", seq(0, 3, by=1))
    )
) 

In this example, we are extracting

Alternatives

You can of course extract the data in any way you prefer, DB_extract() is just a wrapper around functions provided in the duckDB and DBI packages.

If you are familiar with SQL, you may wish to submit queries directly using dbGetQuery from DBI.

Another useful R package for interfacing with databases is dbplyr which allows you to write dplyr style code that is converted into SQL queries for you.