mike's web log

 

Blog Search


(Supports AND)

 

Google Ads

 

Feed

Subscribe to the RSS feed for this blog.

See this post for info on full versus truncated feeds.

 

Quote

Editing is great fun, but it’s not the only fun.

John McIntyre



 

Navigation






<April 2014>
SMTWTFS
303112345
6789101112
13141516171819
20212223242526
27282930123
45678910


 

25 Most-Visited Entries

 

Categories

  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
 

Blogs I Read

 

Contact

Email me
 

Blog Statistics

Dates
First entry - 6/27/2003
Most recent entry - 4/3/2014

Totals
Posts - 2298
Comments - 2480
Hits - 1,620,427

Averages
Entries/day - 0.58
Comments/entry - 1.08
Hits/day - 410

Update every 30 minutes. Last: 12:41 PM Pacific

 
   |  Word frequency

posted at 12:45 AM | | [5] |

I got interested today in an issue that involves counting word frequency in an arbitrary list of words, so I played around with it a little. I believe that the general algorithm is not very complex:
  1. Parse text to find individual words.
  2. Add words to list.
  3. Sort list.
  4. Walk list, counting words and accumulating totals.
  5. Sort results by accumulated total.
  6. Serve and enjoy.
I realized, however, that I wasn't clear on what .NET structures to use to implement this algorithm. Steps 1 I can do[2]; for Steps 2 and 3, I add the words to an array and then use Array.Sort(array).

Step 5 is the tricky one, it seems. You need a structure that will accommodate data like this:

the 5
jumped 4
fox 3
brown 2
quick 1
etc.

in other words, a two-field structure that allows sorting by one of the fields. The Array.Sort method supports only 1-dimensional arrays. SortedList looked promising, but it sorts only by key (word), not value (count), and you can't use count as key, since it's not unique.

The only structure that came to mind was DataTable, which supports a DataView that allows sorting. So that's what I've used. I'd love to hear from folks about better ways to accomplish this task.

You can give my primitive experiment a whirl here. Here's the code I'm using (except the sample formats the output slightly):
Dim i As Integer
Dim s As String


Dim punctuation() As Char = {".", ",", "!", "=", "-", "_", ";", ":", _
"(", ")", "[", "]", """"}
Dim t As String = TextBox1.Text
t = t.ToLower()
t = t.Trim()
For i = 0 to punctuation.Length - 1
t = t.Replace(punctuation(i), " ")
Next i
t = t.Replace(vbcrlf, " ")
t = t.Replace(vbtab, " ")


While t.indexOf(" ") > -1
t = t.Replace(" ", " ") ' double spaces
End While


' Create array of all words
Dim wordArray() As String
wordArray = t.split
Array.Sort(wordArray)


' Create data table with two columns, word and count
Dim dt As New System.Data.DataTable("temp")
dt.Columns.Add("word", Type.GetType("System.String"))
dt.Columns.Add("count", Type.GetType("System.Int32"))


Dim dr As System.Data.DataRow


' Walk through word array, accumulating count of (sorted)
' words. As we get to each new word, write out the previous word
' and its accumulator to a data table
Dim arrayLength As Integer = wordArray.Length - 1
Dim accumulator As Integer = 0
Dim nextWord As String = ""
Dim currentWord As String = ""


For i = 0 To arrayLength
nextWord = wordArray(i)
If nextWord = currentWord Then
accumulator += 1
Else
If i > 0 Then
dr = dt.NewRow
dr("word") = currentWord
dr("count") = accumulator
dt.Rows.Add(dr)
End If
currentWord = nextWord
accumulator = 1
End If
Next
' This should be in a sub, since it's repeated ...
dr = dt.NewRow
dr("word") = currentWord
dr("count") = accumulator
dt.Rows.Add(dr)


' Sort entries in data table by count (desc), then word
dt.DefaultView.Sort = "count DESC,word"


' Display results
Dim drv As System.Data.DataRowView
For Each drv In dt.DefaultView
s &= "<br>" & drv("count") & " = " & drv("word")
Next
Label1.text = s

[2] Corey will be disappointed, but I'm not using regular expressions for this ...


[categories]