Statistics

Problem Statement for "DNAStrand"

Problem Statement

DEFINITION

One of the many things that people look at when they sequence DNA is the
varying occurences of the four nucleotides (adenine, guanine, thymine, and
cytosine) in different regions. Because of the chemical structure and
mutational properties of cytosine and guanine, they are rarely found next to
each other in a DNA strand, except in certain regions. These regions are known
as "CpG islands", and the other regions are called "Normal regions".

Almost always (and in this problem), the nucleotides are represented by a
single capital letter signifying their first letter: A = adenine, G = guanine,
T = thymine, C = cytosine. The regions are classified as: N = normal, R = CpG
island.

One of the puzzles facing geneticists and genomicists is identifying these CpG
islands because they often indicate the presence of a gene in that neighborhood
of the DNA strand. To aid the determination of such region boundaries, they
create a statistical model based on current data. The normal regions have a
certain probability of emitting A, another probability of emitting G, another
of T, and another of G. Similarly, the CpG islands have their own distinct
probabilities of emitting A, G, T, and C. For the purposes of this problem, if
a region "emits" a nucleotide at a position, then that nucleotide is the
nucleotide found at that position in the sequence upon analysis.

The other two probabilities involved are the probability that the DNA strand
will switch from a normal region to a CpG island, and vice versa.

Create a class DNAStrand which contains a method mostLikely, which takes as
parameters the probabilities of a normal region emitting A, G, T, and C, of an
island emitting A, G, T, and C, and the probabilities of a switch between
region types at a specific nucleotide in the sequence. This method should
determine the most likely state sequence that will emit this particular
sequence of DNA, and return that as a String, with 'N' representing a normal
region and 'R' representing a CpG island.

PROBLEM STATEMENT
Class name: DNAStrand
Method name: mostLikely
Parameters: int[], int[], String
Returns: String

The method signature is (make sure it is declared public): String mostLikely
(int[] normal, int[] island, String dnastrand);

normal will contain (in order):
Probability of emitting A in a normal region
Probability of emitting G in a normal region
Probability of emitting T in a normal region
Probability of emitting C in a normal region
Probability of switching to a CpG island for the next nucleotide

island will contain (in order):
Probability of emitting A in a island region
Probability of emitting G in a island region
Probability of emitting T in a island region
Probability of emitting C in a island region
Probability of switching to a normal region for the next nucleotide

dnastrand will contain only capital letters A,G,T,C and represents the sequence
of DNA that you will be examining.

NOTES
* Always assume that the first state is a normal region 'N'.

TopCoder will ensure the validity of the inputs.  Inputs are valid if all of
the following criteria are met:
* All probabilities will be between 1 and 97, inclusive.
* The sum of probabilities of emitting A,G,T,C (for each region) will be
exactly 100.
* dnastrand will contain only capital letters A,G,T,C.
* dnastrand will be between 1 and 25 characters in length, inclusive.
* There will be exactly one correct answer. (i.e., it will not be the case that
two state sequences will have the same probability of emitting the given
sequence).
* The difference between the probability of the correct answer and the
probability of the second-most-likely answer will be greater than 0.00001% of
the correct answer.

EXAMPLES

Example 1:
normal = {30, 20, 30, 20, 10} - 30% chance of emitting A or T, 20% chance of
emitting G or C and a 10% chance of switching to island state for the next
nucleotide (thus there is a 90% chance of not switching states)
island = {10, 40, 10, 40, 20} - 10% chance of emitting A or T, 40% chance of
emitting G or C and a 20% chance of switching to island state for the next
nucleotide (thus there is an 80% chance of not switching states)
dnastrand = "TGCC"
Sequence	Probability
"NNNN"		0.30*0.90*0.20*0.90*0.20*0.90*0.20 = 0.0017496
"NNNR"		0.30*0.90*0.20*0.90*0.20*0.10*0.40 = 0.0003888
"NNRN"		0.30*0.90*0.20*0.10*0.40*0.20*0.20 = 0.0000864
"NNRR"		0.30*0.90*0.20*0.10*0.40*0.80*0.40 = 0.0006912
"NRNN"		0.30*0.10*0.40*0.20*0.20*0.90*0.20 = 0.0000864
"NRNR"		0.30*0.10*0.40*0.20*0.20*0.10*0.40 = 0.0000192
"NRRN"		0.30*0.10*0.40*0.80*0.40*0.20*0.20 = 0.0001392
"NRRR"		0.30*0.10*0.40*0.80*0.40*0.80*0.40 = 0.0012288
Thus, the most likely sequence is "NNNN", and we return "NNNN".

Example 2:
normal = {4,14,62,20,44}
island = {39,15,4,42,25}
dnastrand =     "CCTGAGTTAGTCGT"
Method returns: "NNNRRRNNRRNRRN"

Example 3:
normal = {45,36,6,13,25}
island = {24,18,12,46,25}
dnastrand =     "CCGTACTTACCCAGGACCGCAGTCC"
Method returns: "NRRRRRRRRRRRNNNNRRRRRRRRR"

Example 4:
normal = {75,3,1,21,45}
island = {34,11,39,16,15}
dnastrand =     "TTAGCAGTGCG"
Method returns: "NRRRRRRRRRR"

Example 5:
normal = {26,37,8,29,16}
island = {31,13,33,23,25}
dnastrand =     "T"
Method returns: "N"

Definition

Class:
DNAStrand
Method:
mostLikely
Parameters:
int[], int[], String
Returns:
String
Method signature:
String mostLikely(int[] param0, int[] param1, String param2)
(be sure your method is public)

Constraints

    Examples


      This problem statement is the exclusive and proprietary property of TopCoder, Inc. Any unauthorized use or reproduction of this information without the prior written consent of TopCoder, Inc. is strictly prohibited. (c)2024, TopCoder, Inc. All rights reserved.
      This problem was used for: