mike's web log

 

Blog Search


(Supports AND)

 

Google Ads

 

Feed

Subscribe to the RSS feed for this blog.

See this post for info on full versus truncated feeds.

 

Quote

Its amazing how much I can remember of the worst music of the seventies when I struggle to remember the passwords to my many different computer accounts.

— "Ancarett"



 

Navigation






<April 2014>
SMTWTFS
303112345
6789101112
13141516171819
20212223242526
27282930123
45678910


 

25 Most-Visited Entries

 

Categories

  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
  RSS
 

Blogs I Read

 

Contact

Email me
 

Blog Statistics

Dates
First entry - 6/27/2003
Most recent entry - 4/3/2014

Totals
Posts - 2298
Comments - 2480
Hits - 1,619,403

Averages
Entries/day - 0.58
Comments/entry - 1.08
Hits/day - 410

Update every 30 minutes. Last: 5:12 AM Pacific

 
   |  Word frequency revisited

posted at 12:02 AM | | [4] |

Having finished up a pile of work-work, I can now return to the interesting suggestions raised by my recent word frequency entry. Simon suggested using a custom class that implements IComparable. That was new to me, so I gave that a try. It wasn't immediately obvious to me what to do, but with a little poking around, I found a number of examples, including in the .NET QuickStarts, who knew?

To jump ahead a moment, you can see the result of the effort first. I now have two word frequency pages, the original that uses a DataTable and a new one that uses the custom class and array sorting. Try them out:

Word frequency with data table
Word frequency with custom class implementing IComparable

The version with a data table is no longer interesting from an implementation POV, but I was curious about timings.[1]

The custom class is a simple class with a textbook implementation of CompareTo. The mildly fun twist is that I coded the CompareTo method to sort by two values (first frequency, descending, then word, ascending).

Eric had some other suggestions. One was to use a HashTable, which I couldn't figure out how to do that; putting instances of the custom class into a HashTable worked, but HashTable does not seem to support the Sort method. He's also got an implementation using generics, which is new in Whidbey, and thus not possible yet in 1.1. If you're curious, though, have a look at his second comment.

Anyway, here's the code:
Sub Button1_Click(sender As Object, e As EventArgs)


Dim startTime As DateTime = DateTime.Now
Dim endTime As DateTime


Dim i As Integer
Dim s As String


Dim punctuation() As Char = {".", ",", "!", "=", "-", _
", "_", ";", ":", "(", ")", "[", "]", """", "?", "/", "\", _
"@", "#", "$", "%", "&", "*", "=", "<", ">", "|", _
"~", "‘", "`"}
Dim t As String = TextBox1.Text
t = t.ToLower()
t = t.Trim()
For i = 0 to punctuation.Length - 1
t = t.Replace(punctuation(i), " ")
Next i
t = t.Replace(vbcrlf, " ")
t = t.Replace(vbtab, " ")


' Dumb old so-called smart quotes, grrr
t = t.Replace(Chr(145), " ")
t = t.Replace(Chr(146), "'") ' smart apostrophe
t = t.Replace(Chr(147), " ")
t = t.Replace(Chr(148), " ")
t = t.Replace(Chr(151), " ")


t = t.Replace(vbcrlf, " ")
t = t.Replace(vbtab, " ")


While t.indexOf(" ") > -1
t = t.Replace(" ", " ") ' double spaces
End While


' Create array of all words
Dim wordArray() As String
wordArray = t.split
Array.Sort(wordArray)


Dim WordsByCount As New ArrayList()
' Walk through word array, accumulating count of (sorted)
' words. When we run out of words, write word and accumulator
' to new array of custom WordFrequency objects.
Dim arrayLength AS Integer = wordArray.Length - 1
Dim accumulator As Integer = 0
Dim nextWord As String = ""
Dim currentWord As String = ""


For i = 0 to arrayLength
nextWord = wordArray(i)
If nextWord = currentWord Then
accumulator += 1
Else
If i > 0 Then
WordsByCount.Add(New WordFrequency(currentWord, accumulator))
End If
currentWord = nextWord
accumulator = 1
End If
Next
WordsByCount.Add(New WordFrequency(currentWord, accumulator))


' Sort method invokes custom comparison method of objects in array
WordsByCount.Sort()


' Display results
s = "<table cellpadding=4>"
For Each wf As WordFrequency in WordsByCount
s &= "<tr>"
s &= "<td>" & wf.Frequency & "</td>"
s &= "<td>" & wf.Word & "</td>"
s &= "</" & "tr>"
Next
s &= "</table>" Literal1.Text = s
labelWordCount.Text = wordarray.length
endTime = DateTime.Now
Dim timeDiff As TimeSpan = endTime.Subtract(startTime)
Dim totalSeconds As Double = (timeDiff.TotalMilliSeconds / 1000)
labelTime.text = totalSeconds.ToString("g")
End Sub


Class WordFrequency: Implements IComparable
Dim WordValue As String
Dim FrequencyValue As Integer


Public Sub New()
End Sub


Public Sub New(word As String, freq As Integer)
Me.Word = word
Me.Frequency = freq
End Sub


Public Property Word As String
Get
Return WordValue
End Get
Set (value As String)
WordValue = value
End Set
End Property


Public Property Frequency As Integer
Get
Return FrequencyValue
End Get


Set (value As Integer)
FrequencyValue = value
End Set
End Property


Public Function CompareTo (ByVal ObjectToCompare as Object) As Integer _
Implements IComparable.CompareTo
Dim WordFrequencyObject As WordFrequency = _
CType(ObjectToCompare, WordFrequency)
CompareTo = WordFrequencyObject.Frequency - Me.Frequency
If CompareTo = 0 Then
' Word frequencies are the same, so now compare words
If WordFrequencyObject.Word < Me.Word Then
CompareTo = 1
ElseIf WordFrequencyObject.Word > Me.Word
CompareTo = -1
ElseIf WordFrequencyObject.Word > Me.Word
CompareTo = 0
End If
End If
End Function
End Class

[1] The data table implementation seems to be marginally faster than the custom class, at least, as implemented by me. I used a test of 10,996 words (the first two chapters of Dickens's David Copperfield), and in three trials got these timings: Datatable - 7.23/6.0156/6.1093; Custom class - 8.1098/6.578/6.48.

[categories]