Data Analysis
From Bioinformatics Core Wiki
Contents |
Perl Scripts
UAYOR (use at your own risk)! YMMV (your mileage may vary)! (email ucdbio at gmail d0t com with bug reports)
- fastqForensics.pl .. simple report on possible quality encoding formats for a fastq file (Joe Fass)
- count_fasta.pl .. obtain length histogram, GC-content, etc. for sequences in a fasta-format file (Brad Sickler / Joe Fass)
- Nx.pl .. calculate "Nx" stat for a set of sequences in fasta format (Joe Fass)
- fasta1line.pl .. remove newlines to put all sequence on one line following header line, for all sequences in a fasta-format file (Joe Fass)
- fakefastq.pl .. need fastq, and you only have fasta? fake it! {Joe Fass)
- rc.pl .. reverse complement a set of fasta-format sequences (Joe Fass)
- rcFastq.pl .. reverse complement a set of single-line fastq sequences (Joe Fass)
- trim.pl .. trim paired-end fastq files based on quality using a variety of trimming methods (Nikhil Joshi)
- subsequence.pl .. cut out a sub-sequence from sequences and qualities in a fasta/q-format file (Joe Fass)
- SeqQA.pl .. Sequence qualitative analysis for fasta and fastq files (Hans Vasquez-Gross)
- IllQ2SanQ.pl .. Convert Illumina (pipeline 1.3 and above) fastq format to Sanger fastq format (cat sequence.txt | ./IllQ2SanQ.pl > sequence.fastq) (Joe Fass)
- illTrim.pl .. trim Illumina read 3' ends at the first "bad" base .. takes and produces fastq (cat sequence.fastq | ./illTrim.pl > sequence.trimmed.fastq) (Joe Fass)
- trimBWAstyle.pl .. trimming script for oneline fastq, based on Heng Li's clipping algorithm implemented in bwa (for all-bad reads, substitutes one "N") (Joe Fass)
- trim.slidingWindow.pl .. trimming script for oneline fastq, using a sliding window; chucks reads that get trimmed too short (Joe Fass)
- subset_fastq.pl .. get subset of fastq records based on fraction or number of records desired (Nikhil Joshi)
- export2fastq.pl .. convert Illumina's "export.txt" format into fastq (no quality conversion, so equivalent to their "sequence.txt" files) (Joe Fass)
- fastq3pAdapterTrim.pl .. rudimentary 3'-adapter trimming; allows 1-mismatch down to a minimum length of adapter 5'-end (Joe Fass)
- SNPseqRetrieve.pl .. generate SNP sequences in a tab-separated-value format, including flanking sequence from read consensus or reference genome when no reads mapped (Joe Fass)
Complex Data Analysis
- Targeting data sets across a large biological spectrum from DNA, protein, to complex, family, system and population as well as dynamic features such as expression, simulation. The particular activities include data mining, statistics and functional and evolutionary analysis.
- Building Custom tool and database and special programming/algorithm support to facilitate data analysis.
- MPI Blast
Work-Arounds
- for *nix command line
- for bioinformatics tools
- convert Solexa realign files to UCSC BED format
- loop in bash
- running qsub on every node but one
Projects
- Large scale, novel, pipeline using 454 data to mine SNP's from the Macaque genome. SNP Pipeline
- Custom program designed to show gene expression and annotation overlayed on a circular representation of a bacterial genome. Circular Microarray
- Antibody family classification
- calculate protein family based on BLASTCLUST
- BASE (microarray database) .. quick start
- iProtein
- iCitrus
- Motif Analysis of Solexa Data
- Finding motifs using sissrs.pl
- Popular IDs
- Arabidopsis arenosa whole genome assembly
- base-calling using phred, sequence trimming using lucy, t_coffee and blastclust
- Using TheGPM
- GBrowse
- Blast2Go
- blastpipe
- next generation (454/solexa) sequence assembly using velvet
- next generation (solexa/ABI) sequence alignment using maq
- 454 result files
- phaseolus project
- consensus/SNP/indel calling with samtools
Tactics
- HOWTO grab sequences in a particular taxonomic branch from nr
- Loading large tab or comma-delimited files in R
One-Liners
- create fake sequences (e.g. for CAP3 / PCAP) from a fasta file
- remove blank spaces from filenames
- sort sequence in fasta format by their length
- quick/dirty simulation of Illumina reads from a fasta reference
- remove extended ASCII characters from files (like genepix microarray .gpr files!
- reformat fasta sequences or qualities to have fixed width
- find single character frequencies in a text file
- print every k-mer in a sequence
Software Tips
FAQs
- RNAseq experimental design
- A guide to running the next-gen assemble Velvet, written by its author, Daniel Zerbino