Puneet Wadhwa's BIOINFORMATICS BLOG

Friday, March 24, 2006

Extracting ORF (CDS portion) from Sequence using Bioperl

Hey Readers:

I have been recently wrestling with creation of an ORF Database from Refseq Human Sequences, by extracting out the CDS portion of the sequence from the Genbank record. I needed this database for my project since we were analyzing our proprietary Incyte Gene Collection (IGC) for exact matches to Refseq sequences over the ORF (CDS) region.

Building this database was not easy, and I hope others could learn from this experience of mine...

I first downloaded the RAW Genbank files (.gbff) and then extracted them into a folder. I then searched these files using unix commands like grep etc. to find out which files contained Human species. And then, I wrote a regular expressions based C#.NET GBFF Parser, which used regular expressions to extract out the features from the record like gene name, definition, locus, CDS, species etc. I then grabbed the sequence from the Genbank record using my parser, and then stripped out the CDS part using my program, from the CDS begin to the CDS end locations on the sequence. This was NOT very easy and certainly not the best way.

I now have a surefire way of extracting ORF's from the Genbank records, and it is very elegant, uses very less amount of code, and has been written in Bioperl. Attached below, is the code to do the same. However, is you use or distribute this code, please leave its header and author information intact, and provide a link back to my website.

!/usr/bin/perl

#===========================================
# ORF DATABASE BUILDER UTILITY - Puneet Wadhwa #
# http://puneetwadhwa.blogspot.com * pwadhwa@gmail.com #
#-------------------------------------------
# The purpose of this utility is to fetch the Genbank #
# record live from the Genbank website, and strip out #
# and display the CDS sequence from the record. #
# Takes a list of NCBI Accessions as input and outputs a #
# fasta record. #
# USAGE: cat input_file ./GenerateOrfDatabase.pl #
# Input file contains NCBI accessions (one per line) for #
# which the CDS is to be extracted. #
#===============================================

use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::SeqIO::swiss;
use Bio::Seq;
use Bio::SeqIO::FTHelper;
use Bio::SeqFeature::Generic;
use Bio::Species;
use Bio::AnnotationI;
use Bio::FeatureHolderI;
use Bio::SeqFeatureI;

my $gb = new Bio::DB::GenBank;
my $seqout = new Bio::SeqIO(-fh => \*STDOUT, -format => 'fasta');

%acc_hash = 0;

while (<>)
{
chomp;
$acc = $_;

my $seq_obj = $gb->get_Seq_by_acc($acc);

foreach my $feat ( $seq_obj->top_SeqFeatures )
{
if ( $feat->primary_tag eq 'CDS' )
{
my $cds_obj = $feat->spliced_seq;
print ">".$seq_obj->display_id()."\n".$cds_obj->seq."\n";
}
}
}
0;

Wednesday, March 15, 2006

A thousand clones...

Hey Readers:

Following is the latest article about my project which appeared on my company website's blog. We have now finished the complete study and analysis of annotating the Incyte Gene Collection and have discovered 1072 novel genes which are not available in any other commercial collections.

Here is the article:

A thousand clones

To be exact,1072. That is the number of human Incyte Gene Collection (IGC) clones that we found to contain an exact match to an entire RefSeq CDS (January 6, 2006 release), but for which there is no exact match to an MGC clone (February 23, 2006 release). At the risk of sounding self-congratulatory: I told you so! (See previous blog.)In this collection, which I will refer to as the IGC Non-MGC Set, you will find both novel clones and NOVEL CLONES. In many cases, there are one or more close MGC counterparts that differ from the IGC/RefSeq sequence by only one or two base pairs. Some may well be legitimate single nucleotide polymorphisms (SNPs) that did not have the good fortune to be included in RefSeq. These may or may not be functionally distinct from the IGC clone. In other cases, the closest MGC counterpart is a splice variant of the IGC/RefSeq sequence and likely codes for a polypeptide with a distinct function. In still other cases, there is no MGC counterpart within the same UniGene cluster.Never mind all that, you might be thinking with some impatience. How many druggable genes are there? I have not yet attempted grouping into gene families, but I did quickly spot some caspases, cytochrome p450s, and an adenylate cyclase. I encourage you to have a look for yourself by downloading the new spreadsheet from our website by again navigating to Genomics > Mammalian Resources > cDNAs > Incyte Gene Collection and clicking on the data file icon under the ordering information for IHS1380. You will note that there are actually 1135 line items, because there are sometimes multiple RefSeq accessions corresponding to the same CDS.By the way, all of the IGC clones containing an exact match to an entire RefSeq CDS (a total of 4116 clones) can now be found using our online clone query when searching by RefSeq accession, gene symbol, or UniGene cluster. So next time you search on our website, you are even more likely to turn up a useful clone. More to come…

Thursday, March 02, 2006

Waking a sleeping giant: annotating the Incyte Gene Collection (IGC)

Hey Readers:

Following is an article about my project which appeared on my company website's blog. We have got some great results from mining the IGC collection, and I am very very excited to share them with you.

Here it goes:
--------------
With over one million cDNAs for human, rat, monkey, and dog, the IGC is a monster collection that, on statistical grounds, is certain to contain some good stuff. Consider this: of the more than 400,000 human cDNAs, Incyte has categorized 11,377 as full length and 16,756 as potentially full length. However, while these latter clone sets have been fully sequenced, they were never submitted to GenBank and no annotations (if they ever existed) were passed along to Open Biosystems. Currently, the only way to mine the IGC is by BLASTing query sequences online against our Incyte clone database. The IGC is a potentially valuable resource, but with largely (and frustratingly) unknown content.

To enable better exploitation of the IGC, we have begun a high throughput annotation program with these goals: (1) to associate human RefSeq accession numbers with IGC clones when this can be done with high confidence and (2) to discover which human RefSeq-associated IGC sequences are not found in the Mammalian Gene Collection (MGC). I’d like to share with you some preliminary results from our pilot study.

We began by filtering the ~28,000 full-length and potential full-length human cDNAs by size and selecting the set of 367 sequences that are 3 kilobases or longer. These sequences were then BLASTed against every CDS in human RefSeq and filtered for 100% identity. Even at this high stringency there were 118 “hits”, that is, IGC sequences that contain a complete human RefSeq CDS. The 118 coding sequences were then BLASTed against the MGC, yielding 47 hits at 100% identity. Taking into account a few cases in which the same IGC sequence corresponded to two or more RefSeq accessions, there were 64 RefSeq-certified IGC cDNAs not found in the MGC.

For 25 of the brave new 64, there was no MGC clone within the same UniGene cluster; for the 39 others, one or more MGC clones were found in the same cluster. However, spot-checking of these MGC sequences confirmed that they are either apparent splice variants of the RefSeq CDS or contain single nucleotide substitutions. So far, so good*. If you are curious about these preliminary results, a spreadsheet can be downloaded from our website by navigating to www.OpenBiosystems.com > Genomics > Mammalian Resources > cDNAs > Incyte Gene Collection and clicking on the data file icon under the ordering information for IHS1380.

We have already begun BLASTing the larger set of 28,000 against the RefSeq coding sequences. Our goal is to identify 1000 RefSeq-certified human IGC cDNAs that are not in the MGC. Eventually we hope to dig deeper by identifying IGC cDNAs that are not perfect matches to any RefSeq CDS, but represent plausible splice variants or SNPs. I am also intrigued by the possibility of identifying IGC cDNAs that are completely outside RefSeq—representing entirely novel genes. We shall see!

Thursday, February 09, 2006

What is the HAPMAP?

What Is the HapMap?

This is an article that recently appeared on The International HapMap project, http://www.hapmap.org/.

The HapMap is a catalog of common genetic variants that occur in human beings. It describes what these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world. The International HapMap Project is not using the information in the HapMap to establish connections between particular genetic variants and diseases. Rather, the Project is designed to provide information that other researchers can use to link genetic variants to the risk for specific illnesses, which will lead to new methods of preventing, diagnosing, and treating disease.
Figure 1: When DNA sequences on a part of chromosome 7 from two random individuals are compared, two single nucleotide polymorphisms (SNPs) occur in about 2,200 nucleotides.
The DNA in our cells contains long chains of four chemical building blocks -- adenine, thymine, cytosine, and guanine, abbreviated A, T, C, and G. More than 6 billion of these chemical bases, strung together in 23 pairs of chromosomes, exist in a human cell. (See http://www.dnaftb.org/dnaftb/ for basic information about genetics.) These genetic sequences contain information that influences our physical traits, our likelihood of suffering from disease, and the responses of our bodies to substances that we encounter in the environment.
The genetic sequences of different people are remarkably similar. When the chromosomes of two humans are compared, their DNA sequences can be identical for hundreds of bases. But at about one in every 1,200 bases, on average, the sequences will differ (Figure 1). One person might have an A at that location, while another person has a G, or a person might have extra bases at a given location or a missing segment of DNA. Each distinct "spelling" of a chromosomal region is called an allele, and a collection of alleles in a person's chromosomes is known as a genotype.

Differences in individual bases are by far the most common type of genetic variation. These genetic differences are known as single nucleotide polymorphisms, or SNPs (pronounced "snips"). By identifying most of the approximately 10 million SNPs estimated to occur commonly in the human genome, the International HapMap Project is identifying the basis for a large fraction of the genetic diversity in the human species.

For geneticists, SNPs act as markers to locate genes in DNA sequences. Say that a spelling change in a gene increases the risk of suffering from high blood pressure, but researchers do not know where in our chromosomes that gene is located. They could compare the SNPs in people who have high blood pressure with the SNPs of people who do not. If a particular SNP is more common among people with hypertension, that SNP could be used as a pointer to locate and identify the gene involved in the disease.

However, testing all of the 10 million common SNPs in a person's chromosomes would be extremely expensive. The development of the HapMap will enable geneticists to take advantage of how SNPs and other genetic variants are organized on chromosomes. Genetic variants that are near each other tend to be inherited together. For example, all of the people who have an A rather than a G at a particular location in a chromosome can have identical genetic variants at other SNPs in the chromosomal region surrounding the A. These regions of linked variants are known as haplotypes (Figure 2).

In many parts of our chromosomes, just a handful of haplotypes are found in humans. [See The Origins of Haplotypes.] In a given population, 55 percent of people may have one version of a haplotype, 30 percent may have another, 8 percent may have a third, and the rest may have a variety of less common haplotypes. The International HapMap Project is identifying these common haplotypes in four populations from different parts of the world. It also is identifying "tag" SNPs that uniquely identify these haplotypes. By testing an individual's tag SNPs (a process known as genotyping), researchers will be able to identify the collection of haplotypes in a person's DNA. The number of tag SNPs that contain most of the information about the patterns of genetic variation is estimated to be about 300,000 to 600,000, which is far fewer than the 10 million common SNPs.

Once the information on tag SNPs from the HapMap is available, researchers will be able to use them to locate genes involved in medically important traits. Consider the researcher trying to find genetic variants associated with high blood pressure. Instead of determining the identity of all SNPs in a person's DNA, the researcher would genotype a much smaller number of tag SNPs to determine the collection of haplotypes present in each subject. The researcher could focus on specific candidate genes that may be associated with a disease, or even look across the entire genome to find chromosomal regions that may be associated with a disease. If people with high blood pressure tend to share a particular haplotype, variants contributing to the disease might be somewhere within or near that haplotype.

Friday, January 27, 2006

Sybase ADO.NET (Internal Error 30002) FIX

Hey Friends:

This post is more relevant to those who are using Sybase ADO.NET drivers in their .NET applications to connect to Sybase databases. I recently came across a wierd SYBASE Exception which had a description: Internal Error 30002, while I was trying to insert records into my Sybase database tables. I could not find any explanation on Sybase tech support pages, and also not much helpful information on the search engines.

I opened up a Tech support case with Sybase as my company has their support subscription, and they advised us to upgrade the .NET PC Client to the latest EBF (ebf13008). This magically solved all these internal error problems.

I hope some of you who may be facing these errors might get help.

Thanks,
Puneet Wadhwa

Puneet Wadhwa's BIOINFORMATICS BLOG

Friday, March 24, 2006

Extracting ORF (CDS portion) from Sequence using Bioperl

Wednesday, March 15, 2006

A thousand clones...

Thursday, March 02, 2006

Waking a sleeping giant: annotating the Incyte Gene Collection (IGC)

Thursday, February 09, 2006

What is the HAPMAP?

Friday, January 27, 2006

Sybase ADO.NET (Internal Error 30002) FIX

About

About Me

Links

Previous

Archives