> For the complete documentation index, see [llms.txt](https://uk-biobank.gitbook.io/data-access-guide/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://uk-biobank.gitbook.io/data-access-guide/the-main-dataset/the-structure-of-a-main-dataset.md).

# The structure of a main dataset

After having run the helper programs, a researcher will now have a UK Biobank main dataset. This page gives some indication of what this would look like, focusing in particular on the meanings of the column headers.

A main dataset will be rectangular with one participant per row, and columns headers giving the Showcase Data-Field number that the data in that column relates to together with the “instance index” and “array index” of that item. Broadly speaking, the instance index is used to distinguish data for a Data-Field which were gathered at different times, and the array index is used to distinguish multiple pieces of data for that field which were gathered at the same time.

These will display differently depending on the format that the dataset has been converted to (see Table 2.13 at the end of this section). The example given in Table 2.12 below shows a small portion of a sample dataset as it would appear in .csv format opened in Excel:

<table data-header-hidden><thead><tr><th width="117"></th><th width="133"></th><th width="125"></th><th width="125"></th><th width="118"></th><th width="118"></th><th width="114"></th><th width="116"></th><th width="119"></th><th width="117"></th><th></th></tr></thead><tbody><tr><td> eid</td><td> 53-0.0</td><td> 53-1.0</td><td> 53-2.0</td><td> 20002-0.0</td><td> 20002-0.1</td><td> 20002-1.0</td><td> 20002-1.1</td><td> 20002-2.0</td><td>20002-2.1</td><td> …</td></tr><tr><td> 1256847</td><td> 11/04/2007</td><td></td><td> 03/01/2017</td><td> 1077</td><td></td><td></td><td></td><td> 1077</td><td>1075</td><td></td></tr><tr><td> 8645816</td><td>29/10/2009</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>4652658 </td><td>15/08/2009</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>2328974</td><td> 12/07/2008</td><td>09/03/2013</td><td></td><td></td><td></td><td> 1002</td><td></td><td></td><td></td><td></td></tr><tr><td> 3315794</td><td> 22/02/2010</td><td> 01/12/2012</td><td> 19/11/2018</td><td> 1111</td><td></td><td> 1111</td><td></td><td> 1111</td><td>1065</td><td></td></tr><tr><td>9497726</td><td> 25/02/2006</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>4582852 </td><td> 06/06/2008</td><td></td><td> </td><td> 1222</td><td>1265</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>…</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td> </td></tr></tbody></table>

*Table 2.12: A portion of a sample main dataset*

The eid is the encoded participant identifier for the project in question. The remaining column headers are in the format `F-I.A` where `F` is the Data-Field number, `I` is the instance index and `A` is the array index.

Two Data-Fields are shown in the sample dataset: [Data-Field 53](http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=53) (Date of attending assessment centre) and [Data-Field 20002](http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=20002) (Non-cancer illness code, self-reported). In each case there are three “instances” of the variable (the first number after the -). Using the “Instances” tab on the Data-Field 53 page on Showcase, or clicking on the “2” of “Instancing [2](https://biobank.ctsu.ox.ac.uk/crystal/instance.cgi?id=2)” on the Data-Field 20002 page, we can see that these correspond to the visit type: 0 for the initial (baseline) visit, 1 for the repeat assessment and 2 for the first imaging assessment. Instance 3, corresponding to repeat imaging, is omitted here for space reasons.&#x20;

The columns 53-0.0, 53-1.0 and 53-2.0 therefore hold the dates each participant attended that particular type of assessment centre. In the above example data, all participants attended a baseline assessment centre (this would always be the case), but only two (2328974 & 3315794) attended the repeat assessment, one of whom (3315794) also attended an imaging centre. The first participant attended an imaging centre, but did not attended the repeat assessment.

At each assessment centre visit a participant can self-report illnesses, and these are recorded in Data-Field 20002. The illnesses are coded using Coding 6, as indicated on the Data-Field 20002 page on Showcase. Clicking on the “6” of “Coding [6](https://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=6)” on that page gives the meanings of these codes.

For example: looking at the participant with eid 3315794 we see that at each of their three assessment centre visits they self-reported having asthma (code 1111). As the “first” condition reported this is assigned to have array index 0 (the final number in 20002-0.0 etc). At their imaging assessment visit (instance 2) they also report hypertension (code 1065), and this being the second reported condition at that visit it is assigned to array index 1, i.e. in the column with header 20002-2.1.

Note that in reality Data-Field 20002 has array indices running from 0 to 33 (indicating at least one participant self-reported 34 illness codes), and so the real dataset would be considerably wider than that shown above, even with only these two Data-Fields in it.

Note also that due to the nature of Data-Field 20002 being a self-report field (i.e. reported at an assessment centre), it is only possible to have data for a particular instance index for Data-Field 20002 if that same instance index in Data-Field 53 has a value. For example, since the participant with eid 4582852 only attended baseline assessment they can only have values for field 20002 with instance index equal to 0.

The instance index is not exclusively used to refer to the assessment centre visit. For example, the “Diet by 24-hour recall” fields (see [Category 100090](http://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=100090)) use instance 0 to refer to the baseline assessment centre (as above), but then instances 1 to 4 refer to the four on-line cycles of this questionnaire. As another example, reports from the cancer register (see [Category 100092](http://biobank.ctsu.ox.ac.uk/crystal/label.cgi?id=100092)) are given a new instance index for each additional type of cancer reported.

As indicated above, the column headers appear slightly differently depending on which package you are using. The various output formats display the headers as follows:

| **File type** | **Column header** | **Notes**                                                                                        | **Example** |
| ------------- | ----------------- | ------------------------------------------------------------------------------------------------ | ----------- |
| csv & txt     | `F-I.A`           |                                                                                                  | 31-0.0      |
| R             | f.`F.I.A`         | with f. preceding all fields                                                                     | f.31.0.0    |
| SAS & Stata   | a\_`F_I_A`        | a indicates the type of variable, e.g. a will be n for numerical fields and s for string fields. | n\_31\_0\_0 |

where, as previously, `F` represents the field number, `I` the instance index and `A` the array index.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://uk-biobank.gitbook.io/data-access-guide/the-main-dataset/the-structure-of-a-main-dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
