You will need the scripts in the database folder.
r optiondocs optionukbXXXXX with the name of your data file
split -l 25000 -d --additional-suffix=.tab ukbXXXXX.tab ukbXXXXX_
-l specifies how many lines per file (25k lines will split it into 21 files)-d specifies numeric suffixes--additional suffix lets us specify that we want the files to have extension “.tab””data_toDB.R and make sure that the filepath sourced at the beginning points to your copy of dataset.Rdata_toDB.R with your own values, according to the documentation in the file
download_runID = The prefix of the .html, .tab and .R files (eg ukb12345)file_path = The path to the folder containing the three ukbXXXXX filesdb_name = The name of the database to write tochunks = The number of chunks the .tab file has been split intomapping = [Optional] A mapping from UKB field IDs to human-readable namesstata = Whether you want to write the data in Stata-usable format as well as Rdata_toDB.R, which calls the ukb_db() function from dataset.Rukbconv to each chunk, so the variables in the database will already have had that basic formatting (eg labelling factor levels) appliedTo extract data from your database, in R, see Extracting data in R.
If you want to access data from the database using Stata, you will additionally need to generate the .do and .dct from the download using ukbconv.
This might make the whole database system seem superfluous - but the benefit is: if your data download is too large for ukbconv to convert it to a stata .dta file in one go, it doesn’t matter. We literally only need the .do and .dct files, and ukbconv makes those first, after which you can kill the process.
To extract the data, follow the procedure described in Extracting Data in Stata.
When running ukb_db() you can also pass in the path to a mapping sheet giving human-readable names to the UKB variables. This will produce a csv file containing all un-named variables and their descriptions, so you can add them to your mapping sheet.
The variables are saved in the database with the raw UKB variable names (eg f.52.0.0). When you extract the data, you can supply a mapping sheet (the default is our standard renaming spreadsheet Renaming_List_UPDATE_Sep2020_TEU.csv) and the selected names will be applied to the extracted data.
By using the raw names in the database we can be more flexible with our choice of renaming - if we want to change one of the variable name in the spreadsheet, or convert to an entirely new naming system, we don’t have to regenerate the database.
The R package duckdb is still under development, which means that unfortunately new versions of the package are often not backwards compatible. This means a database written under one version of duckdb cannot be read by a later version.
Please consider using some form of package management, for example renv to facilitate control over package versions.