bams
samtools
Subsampling is a way to filter your data. The subsampling script (below) will select bamfiles that meet a certain threshold of alignments, and then randomly subsample to meet that exact number of alignments. So you are comparing all individuals with the exact same number of alignments.
_.sort.bam
), etc.
samtools flagstat ALAME_AH_2_R1.sort.flt.bam
94696 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
94696 + 0 mapped (100.00%:-nan%)
94696 + 0 paired in sequencing
47492 + 0 read1
47204 + 0 read2
94696 + 0 properly paired (100.00%:-nan%)
94696 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
1458 + 0 with mate mapped to a different chr
655 + 0 with mate mapped to a different chr (mapQ>=5)
Look at sizes of alignments
du -hs *.sort.flt.bam | sort -h
Need to pick a threshold for sizes, we generally like alignments over 100,000, as it’s conservative but robust.
Example output (notice first few samples didn’t have very many reads)
332K NFAME_IH_3_R1.sort.flt.bam
468K NFAME_IH_2_R1.sort.flt.bam
8.0M ALAME_AH_2_R1.sort.flt.bam
25M NFMFA_SC_3_R1.sort.flt.bam
31M NFAME_IH_1_R1.sort.flt.bam
33M MFAME_GC_1_R1.sort.flt.bam
34M NFAME_RR_1_R1.sort.flt.bam
36M ALAME_AH_1_R1.sort.flt.bam
37M RUBIC_PH_2_R1.sort.flt.bam
38M ALAME_AC_2_R1.sort.flt.bam
38M ALAME_AH_3_R1.sort.flt.bam
38M MFAME_GC_2_R1.sort.flt.bam
38M NFMFA_SC_2_R1.sort.flt.bam
39M NFAME_RR_2_R1.sort.flt.bam
44M RUBIC_PH_3_R1.sort.flt.bam
#!/bin/bash -l
#SBATCH -o slurm_outs/02c_subsample-%j.out
#SBATCH -J subsample
list=$1
num=$2
wc=$(wc -l ${list} | awk '{print $1}')
x=1
while [ $x -le $wc ]
do
string="sed -n ${x}p ${list}"
str=$($string)
var=$(echo $str | awk -F"\t" '{print $1}')
set -- $var
c1=$1
count=$(samtools view -c ${c1}.bam)
if [ $num -le $count ]
then
frac=$(bc -l <<< $num/$count)
samtools view -bs $frac ${c1}.bam > ${c1}_${num}.bam
fi
x=$(( $x + 1 ))
done
Make a bam sublist
Run a few PCA with these different thresholds
ls *flt.bam | sed 's/\.bam//g' > sublist
Subset to a certain number, here 100k reads
sbatch sub_sample.sh sublist 100000
Create a new subsampled bamlist
ls *_100000.bam > _subbamlist