Protvec

ProtVec is a representation of proteins through protein sequences. First, we need a large corpus to train distributed representation of biological sequences. Then, to break the sequences into sub sequences and we can generate 3 lists of shifted non-overlapping words.

../../_images/split_prot_seqs.png

Finally, we train the embedding based on 1,640,370 (546,790 × 3) sequences of 3-grams through a Skip-gram model.

../../_images/skip_gram.png

Utilization

ProtVec can be utilized to various situations such as protein family classification.

  • Each sequence is represented as the summation of the vector representation of overlapping 3-grams

  • For each family type, the same number of instances from Swiss-Prot are selected randomly for negative examples