Illumina assembly using velvet

From Bioinformatics Core Wiki
Jump to: navigation, search

Single End assembly

Here we will assemble some Illumina E. coli data that are single-end and paired-end reads.

1. First, make your way into the Illumina directory:

cd bioinfo_course/illumina

2. For Illumina, ideally you want 20-30X kmer coverage, so sometimes you have to only take a subset of all the reads. In order to do that, we will use a custom script:

subset_fastq.pl

3. We need the number of records in the file, so we count the lines....

wc -l SRR001666_1.fastq

4. ...and we divide by four. Then we run the command:

subset_fastq.pl --input SRR001666_1.fastq --output SRR001666.single_subset.fastq --length 7047668 --fraction 2

5. Let's check the average quality values per base:

fastq_sanger_qual_hist.pl SRR001666.single_subset.fastq SRR001666.single_subset.histogram.out

6. Now let's run the first stage of the velvet assembly. We will use a kmer length of 25:

velveth
velveth assembly_single 25 -fastq -short SRR001666.single_subset.fastq

7. Now run velvetg and let it infer the expected coverage and coverage cutoff, and add the amos file option:

velvetg
velvetg assembly_single -exp_cov auto -cov_cutoff auto -amos_file yes

8. Use the count_fasta.pl script to get statistics about the fasta file of the assembly:

count_fasta.pl assembly_single/contigs.fa

9. Now let's look at the assembly using EagleView. First we would convert the velvet amos file (velvet_asm.afg) to ace format. Here is the command you would run:

amos2ace assembly_single/velvet_asm.afg

BUT DON'T RUN IT! The command takes 30 minutes to run. Instead, I have already run the command and the output file (velvet_asm.ace) is in this directory. Let's look at it using EagleView.

10. Now, let's do a lastz alignment (to E. coli) to display on the Genome Browser. First go into the directory:

cd lastz_single

11. Now run the lastz command, using a relative path for the file:

lastz ecoli.fasta assembly_single/contigs.fa --format=maf --output=velvet_assembly_single.maf --verbosity=1

12. We have to convert some of the lines in the maf file to work in the Genome Browser:

sed -i 's/chr1/escCol1.chr1/g' velvet_assembly_single.maf

13. Now we convert the maf file to bed format in order to display it on the browser:

maf2bed.pl velvet_assembly_single.maf velvet_assembly_single.bed escCol1

14. We have to add a header line to the bed file so we can upload it as a custom track:

nano velvet_assembly.bed

Add this line to the top: track name="velvet single assembly"

15. Now we copy the file over to our machine and load the file into the UCSC genome browser as a custom track.

Paired End assembly

16. First we do the subsetting. Start by counting the lines, as above:

wc -l SRR001666_1.fastq SRR001666_2.fastq

17. Do the subsetting. Soon we will compare the single ended assembly to the paired-end assembly. In order for the comparison to be fair, we must use the same total number of reads. Therefore each paired end file will contain 1/4 of the reads:

subset_fastq.pl --input SRR001666_1.fastq --output SRR001666_1.paired_subset.fastq --length 7047668 --fraction 4
subset_fastq.pl --input SRR001666_2.fastq --output SRR001666_2.paired_subset.fastq --length 7047668 --fraction 4

18. Velvet expects a single fastq file for paired-end reads, so we have to shuffle (forward, reverse, forward, reverse, etc.) the two files together:

shuffleSequences_fastq.pl SRR001666_1.paired_subset.fastq SRR001666_2.paired_subset.fastq SRR001666.paired_all_subset.fastq

19. Now do the assembly as above, except using shortPaired:

velveth assembly_paired 25 -fastq -shortPaired SRR001666.paired_all_subset.fastq

20. Run velvetg, with an insert length of 500. This length is from the Illumina run itself. It is necessary for paired-end reads to work.

velvetg assembly_paired -exp_cov auto -cov_cutoff auto -ins_length 500

21. Run count_fasta.pl to gets stats:

count_fasta.pl assembly_paired/contigs.fa

22. Now run lastz. Go into the lastz_paired directory:

cd lastz_paired

23. Run lastz:

lastz ecoli.fasta assembly_paired/contigs.fa --format=maf --output=velvet_assembly_paired.maf --verbosity=1

24. Correct the maf file for the Genome Browser:

sed -i 's/chr1/escCol1.chr1/g' velvet_assembly_paired.maf

25. Convert maf file to bed file:

maf2bed.pl velvet_assembly_paired.maf velvet_assembly_paired.bed escCol1

26. Add track line for Genome Browser to bed file:

nano velvet_assembly_paired.bed

27. Upload track to Genome Browser.