How to Solve a Problem, with Examples

One of the fundamental tasks of a programmer is how to solve a programming problem, based on the specifications. Here is my attempt to solve such a problem, finally arriving at a validated answer. 

If you've been on hackerrank.com, you may have seen this problem before. It's Fraudulent Activity Notifications

TL;DR: given an array of arbitrary length (let's say it can go to 1x10^6). Each number in the array is expenditures for each day... It can range from 0 to 200 (and is always an integer). You are also given a number "d", that's the number of days you need to LOOK BACK to calculate the window. You need the median value of that window. If the next value is larger than or equal to twice the median, client is notified. Now, given arbitrary d, how many notifications had been generated, for a given array? 
Example: you're given
6 4
    10 20 30 40 50 60
6 is the size of the array: 6 entries.
Window size is 4.
First run, "10 20 30 40"
Median in this case is 20+30/2, so 25. Twice that is 50.
Since 50 = 50, there is ONE notification so far.
Next iteration, "20 30 40 50" compared to 60.
Median here is 70/2=35, so it's 70 compared to 60. No notification. 
And that's the end of the array. So total notification is 1. 
Keep in mind that you will have to calculate median for a subset of an array that has 100000 items... and a window of 10000, 20000, even 40000 items. (Though just to point out, if these are DAILY expenditures, this guy must be an alien, since 100000 days turns out to be 273 years, and 40000 prior daily expenditures are equal to 109 years!)

It would NOT be practical to actually declare the array of that size and actually do it in memory. Sorting an array of 40000 items is NOT practical, esp. if you need to do it 60000 times, even if you sort it with the best algorithms like QuickSort or MergeSort.

If you don't believe me, you are welcome to try it. Try the naive/brute-force solution and see how you fare. You will pass maybe... 3 out of 8 tests, because your solution will take minutes to run. Whereas it should run in less than 5 seconds.

So what do we do?


Let's analyze the data given. Expenditure each day is between 0 and 200. What if... we can sort it by the values themselves, instead of by their positional order?

For those of you who remember their computer science, this is called a bucket sort. Actually, we will use counting sort, an alternate version that uses the entire value. As there can only be values from 0 to 200, we simply declare a 201 item (counting) array, which I will abbreviate "cArr" and set everything to zero. We will use that to COUNT the values.

    let cArr = new Array(201); for (let t=0; t<=201; t++) cArr[t] = 0;

NOTE: There are many different ways to write this part. You can use fill(0) as well. It does not really matter.  As for why 201, remember arrays are 0-based. So 0 to 200 is 201 items. 

Now we need to populate the count array, with the initial "window" of values. For every value we see, we increment the count. In the end, we have a count of every value in the window. 
for (let j=0;j<d;j++) {
        cArr[expenditure[j]]++;
    }

The main loop can be written a couple different ways, but let's do the straight-forward way... we're counting from 0 to (length - d).  As we know the start and end, this would be another for loop. 

(Write your code here)

You can also start from d to expenditure.length. It comes out to the same thing, as long as you keep your variables straight. 

So what do we do in the main loop?

Consider this... cArr has a count of all the values in the array, and it's already sorted. We just need to pop it back out... to recreate the "window" of "d" items.

Or do we? Let's think about the problem... We don't actually need a separate array to get the median, as long as we know the start and end value of the windows. 

To do that, we can simply move the window making two simple changes to cArr at every iteration of the main loop (i)... We remove the number that's going OUTSIDE the window (i.e. the item way to the left, expenditure[i]) from cArr, and add the number that's coming INTO the window (expenditure [d+i]) to the cArr

    cArr[expenditure[i]]--;
    cArr[expenditure[d+i]]++

So we can iterate the window accurately, without copying arrays or such. That ought to save us a lot of time, as we don't have to copy 10000 (or more values) each and every time. 

But how do we calculate median from the counting sort array, cArr?

Median is a) if d is an "odd" number, we take the center value of the sorted array, else b) d is an "even" number, so we take the center TWO values of the sorted and average them.

So if the "window" is [10 20 30 40] (d=4)  then median is 20+30 / 2 = 25. If d=3, and the window is [10 20 30], then the median is simply 20.

We are comparing 2 * median to the current value (the value just right of the window), so there is no need to divide by 2 at all. That saves us at least two lines of code and many clock cycles, when d is even. If d is odd, we multiply by 2. That's a simplification also. 

Furthermore, we do NOT need to know any of the previous numbers. We just need to know if the array is sorted. And a counting array is "sorta" sorted. We can arrive at the median "count" (not value) very quickly... by actually counting the entries.

Let us use the 10 20 30 example again, cArr would be

[0,  //0
0, //1
...
1, // 10
0,// 11
...
1, // 20
0, // 21
...
1, //30
0, // 31
...
0 // 201  ]

d=3. To get the median, we divide d by 2 (integer) and add 1, and we have 2 (3/2 = 1.5, integer =1, add 1 = 2), which is the entry we're looking for, which in this case, is value 20.

If we just write a loop, finding where the count added up to 2 ((d/2)+1), that's the median count we're looking for, and index is the median.

But that's for d being an odd number. What if d is an even number?

Here we have a couple different ways to find this...  we can use the loop twice, once to find before, and once to find after. By before and after, I mean the two separate values that will be used to calculate the median.

Let us use the 10 20 30 40 example, cArr would be

[0,  //0
0, //1
...
1, // 10
0,// 11
...
1, // 20
0, // 21
...
1, //30
0, // 31
...
1, //40
0, // 41
...
0 // 201  ]

So "before" is 20, and "after" is 30. The logic still works. d=4, so d/2+1 = 3, that's "after". We need "before", and that's simply d/2.

We have two potential approaches here...

One is to simply calculate d/2+1 and d/2 every time with separate loops... It's just a sum loop, ran twice for slightly different targets.

Or we can try to make one loop do both by targeting (d/2)+1, but keeping a "previous count" variable that would have pointed at (d/2).  Either approach will work.

Food for thought: If d is an odd number, it's simple. If d is an even number, it's more complicated. Can we use the same algorithm for BOTH even and odd, instead of two separate logic branches? 

Now that we have the general algorithm, it is time to test the edge cases. Specifically, when there are only 201 bins, some bins are bound to have more than one number. 

Consider this dataset, [10 10 20 20 30 30 40 40 50 50], d=4

[0,  //0
0, //1
...
2, // 10
0,// 11
...
2, // 20
0, // 21
...
2, //30
0, // 31
...
2, //40
0, // 41
...
2, //50
0, // 51
...
0 // 201  ]

First iteration: [10 10 20 20] compared to 30... 2*median = 10+20 = 30

Second iteration: [ 10 20 20 30] compared to 30... 2* median = 40

And so on. But what does our algorithm from above say? Let's try the separate loop method:

d=4, so (d/2)+1 = 3 

we loop through and found 2+2=4, which is >=3. So 20 is correct.

and d/2 = 2, and loop search would have given us 10.

Looks like this algorithm will work fine if we wrote it as separate loops.

Can you finish the solution now?

Can you figure out a way to do it in a single loop by keeping a "previous" variable so you know where the previous one is? And what if you have a special case there?

Comments

Popular posts from this blog

Yet another take on recursion

Problem Solving for Programmers: a Neglected Topic?