Getting Started
Please complete the following steps to get started as a contributor.
-
You will need an account on the BYU Supercomputer. If you don’t currently have an active account, please go to https://marylou.byu.edu and click on “Request an Account.” Read the information on that page. Then request an account, listing me as a mentor. When it asks what type of work you plan to perform, explain briefly that you will execute custom Python scripts, that you expect to execute only single-core jobs, and that these jobs will typically require 1-16 GB of memory per job.
-
If you haven’t already done so, create a GitHub account.
-
Send an email to me with your GitHub user ID and request to be added as a contributor to the WishBuilder repository.
-
After you receive access to the Supercomputer, log in to it. At the command line, enter the following commands (but substitute your actual email address where it says
your_email@example.com):
ssh-keygen -t rsa -b 4096 -C "your_email@example.com"
-
When it asks you to “Enter a file in which to save the key,” press Enter. This uses the default file location.
-
When it asks you to enter a passphrase, press Enter (twice).
-
Now there should be a file at ~/.ssh/id_rsa.pub. Enter the following command to display the contents of this file. Then copy the output to your clipboard. This is your public key and enables you to connect from Linux to GitHub without a password.
cat ~/.ssh/id_rsa.pub
- Go to https://github.com/settings/keys. This should display the SSH keys that are currently specified for your GitHub account. Click on “New SSH key”, enter a Title (maybe “FSL”), paste the public key from your clipboard, and click on “Add SSH key.”
Preparing a Dataset
Please complete the following steps for each dataset that you prepare. Let me know if you have any questions or run into any problems.
-
Examine the list of open issues. Each “issue” represents a dataset that needs to be prepared. Identify one issue that you would like to work on (and that nobody else is currently working on).
-
Send an email to me indicating which issue you would like to work on.
-
At the command line on the Supercomputer, clone the WishBuilder git repository:
git clone https://github.com/srp33/WishBuilder.git
cd WishBuilder
- Or if you previously cloned the WishBuilder git repository, make sure it is up to date:
git pull origin master
- Create a new branch on your copy of the git repository (see below). Replace
<new-branch-name>with the ID of the dataset you are working with (the ID will be listed under the issue).
git checkout -b <new-branch-name>
-
Create a new directory within your branched repository; the name of this directory should also be the ID of the dataset you are working with.
-
Now
cdinto the new directory. -
Write a bash script called
download.shthat downloads the data file(s) from the source location to the current directory. You can see an example here. -
Open the data file(s) in a text editor and examine them to understand how they are structured. (If the data file is too large for a text editor, use commands such as head, tail, and less to examine the file.)
-
Using a text editor, create test files called
test_metadata.tsvandtest_data.tsv. Below you can learn about the purpose of these files and how they should be structured. You can see examples here. -
Write a bash script called
parse.sh. This script should parse the downloaded data file(s) and reformat the data (as needed) into the output format described below. In most cases,parse.shwill invoke script(s) written in Python. The name of the output files must bemetadata.tsv.gzanddata.tsv.gz. Recommendation: work with a smaller version of the data file(s) initially, so it is easier to test. You can see an example here. -
Write a bash script called
install.shthat installs any software that are necessary to executeparse.sh. If no extra software must be installed, it can be blank. You can see an example here. -
Compare
metadata.tsv.gzagainsttest_metadata.tsv. Make sure the metadata values were parsed correctly. -
Compare
data.tsv.gzagainsttest_data.tsv. Make sure the data values were parsed correctly. -
Create a bash script called
cleanup.sh. Within that script, use thermcommand to deletemetadata.tsv.gz,data.tsv.gz, and any other non-script files. Please do not commit (see next step) any data files to GitHub. -
Create a Markdown-formatted file called
description.mdthat provides a brief description of the dataset. The first line of the file should be a 2nd-level header (starting with##that briefly describes the dataset. The rest of the file should contain additional details about the dataset, including its source, what the data can be used for, etc. Please separate each paragraph with 2 newline characters. You can see an example here -
Add, commit, and push your changes to the branch that you created earlier. Replace
<message>with a brief messages that describes the work you have done. Replace<new-branch-name>with the name of the branch you created previously.
git add --all
git commit -m "<message>"
git push origin <new-branch-name>
- Go here to create a GitHub pull request. Put “master” as the base branch and your new branch as the compare branch. Click on “Create pull request”. We will then check to make sure your code is working properly. If it is, we will integrate your code into the WishBuilder repository.
Notes
- Python 3.5 is installed on the Supercomputer; use
module load python/3/5. - R is also installed on the Supercomputer; use
module load r/3/3. - As you write your parsing scripts, please make sure they use no more than 4 GB of memory.
- For larger datasets, avoid reading the whole file into memory. You can test your parse.sh script on the Supercomputer. But please request no more than 4 GB of memory.
- You can download files when you are executing code on the interactive nodes of the Supercomputer. But the compute nodes do not have access to the Internet.
- If you create temporary files, please store these within the same directory as your scripts (or a subdirectory). This will ensure that everything needed to process each dataset is contained within the same location.
- When you specify file or directory paths in your scripts, please use relative rather than absolute paths.
Test files
We will use your test files to verify that your scripts are working properly. We will execute your scripts and then verify that the data values produced by your scripts match the data values in the test files, even though the format of these files will be different. You will need to create the test files using a text editor.
The following table shows how your test files should be structured. The files should be tab delimited and should contain a header line with column names as shown below.
| Sample | Variable | Value |
|---|---|---|
| TCGA-01-1234 | Age | 34 |
| TCGA-01-1234 | Sex | M |
| TCGA-01-1234 | BRCA1 | 1 |
| TCGA-01-1234 | BRCA2 | 0 |
| TCGA-02-5678 | Age | 92 |
| TCGA-02-5678 | Sex | F |
| TCGA-02-5678 | BRCA1 | 0 |
| TCGA-02-5678 | BRCA2 | 1 |
| … | … | … |
You should create two test files: test_metadata.tsv and test_data.tsv. The first (test_metadata.tsv) should contain metadata values as described in the GitHub issue for your dataset. The second (test_data.tsv) should contain regular data values as described in the GitHub issue.
Each of these files should have at least 8 lines of data (not including the header). These lines should contain data values that you have extracted by hand from the input file(s). Please include data values for at least two different samples and at least two different variables in each input file. Include at least one sample/variable from the beginning of each file and at least one from the ending of each file. Also include as least one sample/variable from the far-left side of each input file and at least one from the far-right side of each input file.
Output file format
Your scripts should produce two tab-delimited text files: metadata.tsv.gz and data.tsv.gz (see below). Geney will import these data files.
metadata.tsv.gz should be structured the same as test_metadata.tsv, except that it should contain all metadata values and should be gzipped.
The table below illustrates how data.tsv.gz should be structured. All of the sample names should be unique. All of the column names should be unique. The name of the first column should be “Sample”. This file should be gzipped.
| Sample | Age | Sex | BRCA1 | BRCA2 | … |
|---|---|---|---|---|---|
| TCGA-01-1234 | 34 | M | 1 | 0 | … |
| TCGA-02-5678 | 92 | F | 0 | 1 | … |
| … | … | … | … | … | … |
The sample identifiers listed in metadata.tsv.gz and data.tsv.gz should overlap with each other. Neither file should contain any sample identifier that is not listed in both files (all non-overlapping samples should be excluded from both files).