split
scriptFirst, we will split our .fastq files using the run_BestRadSplit_PstI.sh
bash
script provided by the Miller Lab.
The script gets fed the R1 and R3 .fastq files and we give our output a name, typically the standard name. (e.g. SOMM185_ACTTGA
below)
sbatch -p high -t 24:00:00 ../run_BestRadSplit_PstI.sh SOMM185_R1_ACTTGA.fastq SOMM185_R3_ACTTGA.fastq SOMM185_ACTTGA
In a new screen, compress the output of the split
script:
srun -t 24:00:00 gzip *ACTTGA.fastq
align
scriptNext, we will make two lists of all of the .fastq file names and then paste them together. We’ll then feed this list to a run_align
script with a reference genome (available online).
ls *RA* | sed "s/\.fastq\.gz//g" > listA
ls *RB* | sed "s/\.fastq\.gz//g" > listB
Here, selected all of the files with RA and RB located inside the file name (ls *RA*
) and then substituted (s
) .fastq and .gz for a blank space (//
). This gives us just the file names without the extensions.
paste listA listB > list_ACTTGA
sbatch -t 24:00:00 ../run_align.sh list_ACTTGA ../../Mola/B_imp_ref_genome/GCF_000188095.1_BIMP_2.0_genomic.fna
To align, we combine the two lists into one list (in this case for plate ACTTGA
). We then sbatch
the run_align.sh
script providing it with our list and a reference genome (in this case Bombus impatiens).
To finish the split/align section, I like to organize the output files. This is especially handy when we’re working plate-by-plate rather than in a full run of all plates at once.
mkdir aligned_files filtered_bams slurm_outputs zipped_fastq
Then move the various file types into their respective folders (with mv
).
It’s also good to check the length of the bams file (should be 96) using ls filtered_bams/ | wc
.
In the end, 4 subdirectories, the list, and the two original .fastq files should remain in the plate directory.
There are various ways to filter read quality, and some of them will be done later on at the loci level when we call SNPs. However, we can subset to individuals on a minimum number of aligned reads. This helps filter out poorly sequenced individuals who might mess up downstream loci selection and be unusable anyway.
need to insert info on subsample script
For sibship analysis, we want loci that are the most informative within our specific “population” or sample of interest. For example, we know that workers in 2012 won’t be full-sibs of workers in 2016, so we don’t want to SNP call with all of those individuals mixed together. You can likely skip this step for other types of analyses and call SNPs first, then subset for downstream analyses.
Copy over the bamclst_lists.R file (This handy script was originally written by Ryan Peek).
cp ~dir/bamclst_lists.R /dir/.
Refer to the script for exact details.
AND/OR
.clst
file you can use for PCA plots (this requires output from pca_calc.sh
script as well)After subsetting to our bamlist to the subpopulation of interest, we can call SNPs using angsd
. The binary will need to be installed on your user cluster.
A basic angsd call looks like this:
~/bin/angsd -bam $list -GL 1 -out $output -doMaf 2 -minInd 20 -doMajorMinor 1 -SNP_pval 0.000001 -doGeno 4 -doPost 2 -postCutoff 0.95 -minMaf 0.005
Wrapping this in a shell script is probably the way to go. I have one called ~johnmola/scripts/genoget.sh
.
$list
)$output
).arg
).mafs
).geno
).gz
files. Just unzip.angsd
commands/callsangsd
calls use a numbering system. For instance, for the -doGeno
command, if you want angsd
to write the major and minor alleles, you give it the number “1”, if you want posterior probability, you provide “8”. However, if you want both of those values, you provide it 8+1 = “9”
See the angsd
wiki for more information. LINK
In order to run COLONY successfully, we need to select loci that are at least:
Additionally, it’s preferable that we customize the following as well:
In the Geno2Colony.pl
file, we will customize the following parameters. Nothing else needs to be edited!! We’ll go through them in detail.
$geno = $ARGV[0];
$list = $ARGV[1];
#SNP Filters
$max_mis = 0.25;
$min_maf = 0.01;
$max_chi = 3.84;
#SNP Drawing
$num = 300;
$dist = 10000;
#Colony
$type = 0;
$adrop = 0.05;
$gerror = 0.05;
$geno
is the list of genotypes for all of your individuals. This is the .geno
output from angsd.
$list
is a header file needed for matching the names to their genotypes. It’s basically just your bamlist, but preferably with human-readable names.
$max_mis
The maximum number of individuals that can be missing a retained SNP
$min_maf
The minimum minor allele frequency for a SNP to be retained
$max_chi
Basically assumes HWE if set at 3.84 (don’t change)
$num
Set how many loci you want. The script will randomly pull, but if you set the number really high, it will go until it runs out.
$dist
How far apart in base pairs should SNPs be. Typically 10,000. (To ensure linkage equilibrium)
$type
The type of marker. In this case 0 is the only appropriate answer for SNPs.
$adrop
The allele dropout rate. Assuming 0.05 is usually good.
$gerror
The genotype error rate. Assuming 0.05 is usually good.
Save the output of this perl script, we’ll feed it to COLONY next. I usually name it with a convention based on the first 5 settings. E.g. The name for the code chunk above would be: projectname_m25_maf01_n300_d10k.DAT
meaning the name of the project max missing = 0.25, maf = 0.01, 300 loci, 10000 bp apart.
Download COLONY here.
We are almost ready to feed our file to COLONY!
We first need to finish the .DAT file that we outputted from the Geno2Colony.pl
script. COLONY requires a header and footer of options to be appended onto our offspring genotypes, and then all fed in with one file.
While this step can be automated, I actually prefer some level of interaction here to ensure everything is correct before proceeding. I have saved a typical COLONY header and footer, and then copy and paste them using vim
to the beginning and end of the file, respectively. I then ensure the settings are what I want before running COLONY with the folowing command:
srun -t 24:00:00 ./colony2s.ifort.out IFN:./m25_maf10_n500_r1.DAT
./
in the front of your file location.PLACEHOLDER: I’ve noticed that I use subset, subsample, and subpopulation interchangeably…but to mean different things. Go back through and make clear. Might also be handy to have a glossary to terms that are unique (like bamlist)/weird.
bamlist
-Split
sbatch -p high -t 24:00:00 ../run_BestRadSplit_PstI.sh SOMM185_R1_ACTTGA.fastq SOMM185_R3_ACTTGA.fastq SOMM185_ACTTGA
Compress
srun -t 24:00:00 gzip *ACTTGA.fastq
Make lists
ls *RA* | sed "s/\.fastq\.gz//g" > listA ls *RB* | sed "s/\.fastq\.gz//g" > listB
paste listA listB > list_ACTTGA
Align
sbatch -t 24:00:00 ../run_align.sh list_ACTTGA ../../Mola/B_imp_ref_genome/GCF_000188095.1_BIMP_2.0_genomic.fna
Tidy
mkdir aligned_files filtered_bams slurm_outputs zipped_fastq
then mv
Check
ls filtered_bams/ | wc
Remove extension
ls *flt.bam | sed 's/\.bam//g' > sublist
Run sub sample
sbatch -t 24:00:00 sub_sample.sh sublist 100000
Make new bamlist
ls *_100000.bam > _subbamlist
Follow
bamclst_lists.R
Edit and run genoget.sh
sbatch -p high --mem 32G -t 72:00:00 ~/scripts/genoget.sh vm_30_bam vm3_30_out
Edit and run Geno2Colony.pl
perl ~/scripts/Geno2Colony.pl ../mp_v5_2017_03_16_noclones.geno ../v3_perl_headers > m25_maf01_n75
Add header and footer
vim
Run COLONY
srun -t 24:00:00 ./colony2s.ifort.out IFN:./m25_maf10_n500_r1.DAT