About

I'm Mike Pope. I live in the Seattle area. I've been a technical writer and editor for over 30 years. I'm interested in software, language, music, movies, books, motorcycles, travel, and ... well, lots of stuff.

Read more ...

Blog Search


(Supports AND)

Google Ads

Feed

Subscribe to the RSS feed for this blog.

See this post for info on full versus truncated feeds.

Quote

If they can get you asking the wrong questions, they don't have to worry about the answers.

— Thomas Pynchon



Navigation





<November 2014>
SMTWTFS
2627282930311
2345678
9101112131415
16171819202122
23242526272829
30123456

Categories

  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  
  RSS  

Contact

Email me

Blog Statistics

Dates
First entry - 6/27/2003
Most recent entry - 10/16/2014

Totals
Posts - 2312
Comments - 2502
Hits - 1,677,996

Averages
Entries/day - 0.56
Comments/entry - 1.08
Hits/day - 405

Updated every 30 minutes. Last: 1:33 AM Pacific


  12:45 AM

I got interested today in an issue that involves counting word frequency in an arbitrary list of words, so I played around with it a little. I believe that the general algorithm is not very complex:
  1. Parse text to find individual words.
  2. Add words to list.
  3. Sort list.
  4. Walk list, counting words and accumulating totals.
  5. Sort results by accumulated total.
  6. Serve and enjoy.
I realized, however, that I wasn't clear on what .NET structures to use to implement this algorithm. Steps 1 I can do[2]; for Steps 2 and 3, I add the words to an array and then use Array.Sort(array).

Step 5 is the tricky one, it seems. You need a structure that will accommodate data like this:

the 5
jumped 4
fox 3
brown 2
quick 1
etc.

in other words, a two-field structure that allows sorting by one of the fields. The Array.Sort method supports only 1-dimensional arrays. SortedList looked promising, but it sorts only by key (word), not value (count), and you can't use count as key, since it's not unique.

The only structure that came to mind was DataTable, which supports a DataView that allows sorting. So that's what I've used. I'd love to hear from folks about better ways to accomplish this task.

You can give my primitive experiment a whirl here. Here's the code I'm using (except the sample formats the output slightly):
Dim i As Integer
Dim s As String


Dim punctuation() As Char = {".", ",", "!", "=", "-", "_", ";", ":", _
"(", ")", "[", "]", """"}
Dim t As String = TextBox1.Text
t = t.ToLower()
t = t.Trim()
For i = 0 to punctuation.Length - 1
t = t.Replace(punctuation(i), " ")
Next i
t = t.Replace(vbcrlf, " ")
t = t.Replace(vbtab, " ")


While t.indexOf(" ") > -1
t = t.Replace(" ", " ") ' double spaces
End While


' Create array of all words
Dim wordArray() As String
wordArray = t.split
Array.Sort(wordArray)


' Create data table with two columns, word and count
Dim dt As New System.Data.DataTable("temp")
dt.Columns.Add("word", Type.GetType("System.String"))
dt.Columns.Add("count", Type.GetType("System.Int32"))


Dim dr As System.Data.DataRow


' Walk through word array, accumulating count of (sorted)
' words. As we get to each new word, write out the previous word
' and its accumulator to a data table
Dim arrayLength As Integer = wordArray.Length - 1
Dim accumulator As Integer = 0
Dim nextWord As String = ""
Dim currentWord As String = ""


For i = 0 To arrayLength
nextWord = wordArray(i)
If nextWord = currentWord Then
accumulator += 1
Else
If i > 0 Then
dr = dt.NewRow
dr("word") = currentWord
dr("count") = accumulator
dt.Rows.Add(dr)
End If
currentWord = nextWord
accumulator = 1
End If
Next
' This should be in a sub, since it's repeated ...
dr = dt.NewRow
dr("word") = currentWord
dr("count") = accumulator
dt.Rows.Add(dr)


' Sort entries in data table by count (desc), then word
dt.DefaultView.Sort = "count DESC,word"


' Display results
Dim drv As System.Data.DataRowView
For Each drv In dt.DefaultView
s &= "<br>" & drv("count") & " = " & drv("word")
Next
Label1.text = s

[2] Corey will be disappointed, but I'm not using regular expressions for this ...

[categories]  

[5] |