A comparison of string cleaning methods

While looking at a few methods for parsing a string I came across a method for removing unwanted characters from a string. It involves looping through a string and using the Like Operator to determine if any given character is wanted.

It goes something like this:

Function CleanString(inputString As String) As String
  Dim thisCharacter As String
  Dim i As Long

  For i = 1 To Len(inputString)
    ' grab each character in turn
    thisCharacter = Mid$(inputString, i, 1)
    If thisCharacter Like "[0-9,A-Z,a-z,"" ""]" Then
      CleanString = CleanString & thisCharacter
    End If
  Next i
End Function

I like this because it doesn't require any workarounds I usually use, like creating an array of "bad" characters and comparing each character in the input string with each "bad" character to see if there's a match. That could be a lot of looping.

I thought there might be a faster way to do the same thing. Building the cleaned string one character at a time must be (relatively) time consuming, right?

So here's what I came up with. It assumes that the input string is valid, then uses the Replace function to strip out unwanted characters. It still loops through the string — that's the only way to avoid having to specify which characters I don't want, which is everything but letters, numbers and spaces.

Function CleanString2(inputString As String) As String

  Dim i As Long
  Dim thisCharacter As String

  ' assume valid string
  CleanString2 = inputString
  i = 1
  Do While i <= Len(CleanString2)
    thisCharacter = Mid$(CleanString2, i, 1)
    If Not thisCharacter Like "[0-9,A-Z,a-z,"" ""]" Then
      CleanString2 = Replace(CleanString2, thisCharacter, "")
    Else
      i = i + 1
    End If
  Loop

End Function

A Do While Loop is required because the string changes size whenever the Replace Function is used. If we used a For Loop, eventually the loop counter would be greater than the size of the string (assuming we did some replacements) and the Mid Function would throw an error. The Len Keyword dynamically changes with the size of the string.

This should be faster, because the more bad characters there are, the less iterations, right? Let's find out!

The Test

I reused a simple procedure that runs both of the above functions and times the result.

Public Declare Function timeGetTime Lib "winmm.dll" () As Long

Sub TestStringClean()

  Dim startTimeBuildString As Long
  Dim endTimeBuildString As Long
  Dim startTimeReplaceString As Long
  Dim endTimeReplaceString As Long
  Dim i As Long
  Dim retVal As String

  Const numberofLoops As Long = 100000
  Const stringtobefixed As String = "abcdefgeuhuep9Y*&#PO#M@##}{{:><?><}"

  startTimeBuildString = timeGetTime
  For i = 1 To numberofLoops
    retVal = CleanString(stringtobefixed)
  Next i
  endTimeBuildString = timeGetTime

  startTimeReplaceString = timeGetTime
  For i = 1 To numberofLoops
    retVal = CleanString2(stringtobefixed)
  Next i
  endTimeReplaceString = timeGetTime

  MsgBox "Time using Build String Method: " & ((endTimeBuildString - startTimeBuildString) / 1000) & vbNewLine & _
    "Time using Replace Method: " & ((endTimeReplaceString - startTimeReplaceString) / 1000)
  
End Sub

The Results

My assumption is that my method will be faster. Let's start at 10,000 iterations.

Build vs Replace at 10,000 iterations

Huh? Looks like the method I found is faster than the one I wrote. Let's try 100,000 iterations.

Build vs Replace at 100,000 iterations

Clearly the "build-as-you-go" method is faster. Just for kicks let's try 1,000,000 iterations, but at this point I assume my method will be slower.

Build vs Replace at 1,000,000 iterations

Conclusion

My method was considerably slower. I still like it as an example for checking for negative conditions, but in general I'm going to stick with the method I found somewhere online. Maybe it would be faster if rewritten to build the string instead of using the Replace Function?

About JP

I'm just an average guy who writes VBA code for a living. This is my personal blog. Excel and Outlook are my thing, with a sprinkle of Access and Word here and there. Follow this space to learn more about VBA. Keep Reading »


Related Articles:


Share This Article:

Share and bookmark this articledelicious buttonfacebook buttonlinkedin buttonstumbleupon buttontwitter button

comment bubble 11 Comment(s) on A comparison of string cleaning methods:

  1. I wonder if it is slower your way because Instr already loops through each character? In which case, you would be looping an additional number of times equal to the number of "dirty" characters (10 in your example if I count correctly).

    I wonder, have you ever played around with the timing methods themselves? The researcher in me wonders if it would not be better to randomize the order of calling while encapsulating as much as possible. so your code might be something like:

    Sub CompareMethods()
    
      Dim i As Long
      Dim tMethod1 As Single, tMethod2 As Single
      Dim Method1Count As Long, Method2Count As Long
      Dim normMethod1 As Single, normMethod2 As Single
      Dim whichMethod As Single
      
      For i = 1 To 20000
        
        whichMethod = Rnd
        
        ' 50/50 chance of method 1 or method 2
        If whichMethod > 0.5 Then
          tMethod1 = tMethod1 + Method1
          Method1Count = Method1Count + 1
        Else
          tMethod2 = tMethod2 + Method2
          Method2Count = Method2Count + 1
        End If
            
      Next i
      
      ' Normalize totals to account for randomness
      normMethod1 = tMethod1 / Method1Count * 10000
      normMethod2 = tMethod2 / Method2Count * 10000
      
      MsgBox "Time using Method 1: " & (normMethod1 / 1000) & vbNewLine & _
             "Time using Method 2: " & (normMethod2 / 1000)
    
      
    End Sub
    
    Function Method1() As Single
      Dim t As Single
      
      t = Timer
      
      'Stuff
      
      Method1 = Timer - t
      
    End Function
    
    Function Method2() As Single
      Dim t As Single
      
      t = Timer
      
      'Stuff
      
      Method2 = Timer - t
      
    End Function

    Since you won't get an exactly 50/50 distribution you can't simply show the sum of the times, which is why I normalized it. It wouldn't be too difficult to force it to be even, though that could invalidate the whole point of randomizing it in the first place.

    I don't imagine it will make a difference for this particular comparison, but you do so many speed tests that eventually you are bound to find one where order could make a difference. Just curious.

  2. I wonder if it would be faster and safer to just convert the string to a byte array then loop through the array. Well, I checked it and it is much slower. I was a bit surprised. I think doing the concatenating might get pretty slow with very large strings – but from doing what I did, it didn't seem to be that big of a deal. So the first code seems best.

    Here's what I threw together. (So you will know not to bother).

    Function CleanString3(inputString As String) As String
      
        Dim chrCharacters() As Byte
        Dim chrNewString() As Byte
        Dim i As Long, iChrIndex As Integer, iChr As Integer
        
        chrCharacters = inputString
        ReDim chrNewString(Len(inputString) - 1)
        
        iChrIndex = -1
        For i = 0 To UBound(chrCharacters)
            ' grab each character in turn
            If chrCharacters(i) > 0 Then
                iChr = chrCharacters(i)
                If (iChr > 47 And iChr < 58) Or (iChr > 64 And iChr < 91) _
                    Or (iChr > 96 And iChr < 123) Or (iChr = 32) Then
                    iChrIndex = iChrIndex + 1
                    chrNewString(iChrIndex) = chrCharacters(i)
                End If
            End If
        Next i
        
        ReDim Preserve chrNewString(iChrIndex)
        CleanString3 = StrConv(chrNewString, vbUnicode)
      
    End Function
  3. Two things… First, consider this line from your originally posted CelareString function…

    If thisCharacter Like "[0-9,A-Z,a-z,"" ""]" Then

    First off, the text inside the square brackets is not supposed to be a comma delimited list… doing that simply allows thisCharacter to contain a comma. Second, if the "" "" was meant to allow space characters, there is no need to put the space character inside doubled up the quote marks. So, this line of codde will do the intended parsing check…

    If thisCharacter Like "[0-9A-Za-z ]" Then

    Now, with that said, I think the follow code will be faster…

    Function CleanStringRick(ByVal S As String) As String
      Dim X As Long
      Const GoodCharacters As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" & _
                                       "abcdefghijklmnopqrstuvwxyz" & _
                                       "0123456789 "
      For X = 1 To Len(S)
        If InStr(GoodCharacters, Mid(S, X, 1)) = 0 Then Mid(S, X) = "@"
      Next
      CleanStringRick = Replace(S, "@", "")
    End Function

    You can recombine the GoodCharacters constant statement into a single, non-contiued line of code if you want (I only did that to stop word wrapping by this blog's comment processor.

  4. I'm a little late to the party here, but would like to propose a third option.

    Regular Expressions allow a more granular check and do away with the looping over characters. The drawback is the time it takes to create the object in the first place, however, for matching over a large number of iterations (whilst maintaining the initialised object), the performance of the object far outweighs the overhead of the creation.

    The performance benefit of the removal of the looping of the string is more obvious as we increase the string length:

    Dim c As Object
    Function CleanStringKyle(s As String) As String
        If c Is Nothing Then Set c = CreateObject("vbscript.regexp")
        With c
            .Pattern = "[^a-z0-9A-Z ]*"
            .ignorecase = True
            .Global = True
          CleanStringKyle = .Replace(s, vbNullString)
        End With
    End Function

    Using the initial test, but doubling the string (I've also increased the granularity of the timing):

    Over 1k iterations:

    Time using Build String Method: 34.087379923108
    Time using replace Method: 32.5274066247104
    Time using Rick: 35.7700627916026
    Time using Kyle: 15.0754284571971

    Over 10k iterations:

    Time using Build String Method: 227.984564474732
    Time using replace Method: 320.648584779373
    Time using Rick: 361.783509975619
    Time using Kyle: 139.360143217686
    

    Over 100k iterations:

    Time using Build String Method: 2246.0398275105
    Time using replace Method: 3271.33138544905
    Time using Rick: 3595.46983040199
    Time using Kyle: 1385.56685574352
    

    Since there is no loop in the VBA, predictably, this method favours longer strings, for example if we quadruple the original string:
    for 1k iterations:

    Time using Build String Method: 68.5683229037636
    Time using replace Method: 50.8209493347803
    Time using Rick: 70.2795636358444
    Time using Kyle: 21.3296005825804
  5. I get different results from both JP and Kyle's tests. I used the setup that JP posted and ran the test 3 times coming up with these same results (varying a miniscule amount test-to-test.

    Time using Build String Method: 1.108
    Time using Replace Method: 2.232
    Time using Rick's Method: 1.683
    Time using Kyle's Method: 1.687

    So I get your build string method to be twice as fast as teh replace method, not marginally slower and both Kyle and my codes are virtually equal in speed about midway between the string and replace methods. I'm guessing the speed difference have to do with our particular computer systems. For comparison, I am using Win7 (64-bit) SP1 on a system using an Intel Core2 Extreme CPU Q6800 running at 2.93GHz.

    This is the exact code I used for my tests

    Public Declare Function timeGetTime Lib "winmm.dll" () As Long
    
    Sub TestStringClean()
      Dim startTimeBuildString As Long
      Dim endTimeBuildString As Long
      Dim startTimeReplaceString As Long
      Dim endTimeReplaceString As Long
      Dim startTimeRick As Long
      Dim endTimeRick As Long
      Dim startTimeKyle As Long
      Dim EndTimeKyle As Long
      Dim i As Long
      Dim retVal As String
      Const numberofLoops As Long = 100000
      Const stringtobefixed As String = "abcdefgeuhuep9Y*&#PO#M@##}{{:><?><}"
      
      startTimeBuildString = timeGetTime
      For i = 1 To numberofLoops
        retVal = CleanString(stringtobefixed)
      Next i
      endTimeBuildString = timeGetTime
      
      startTimeReplaceString = timeGetTime
      For i = 1 To numberofLoops
        retVal = CleanString2(stringtobefixed)
      Next i
      endTimeReplaceString = timeGetTime
      
      startTimeRick = timeGetTime
      For i = 1 To numberofLoops
        retVal = CleanStringRick(stringtobefixed)
      Next i
      endTimeRick = timeGetTime
      
      startTimeKyle = timeGetTime
      For i = 1 To numberofLoops
        retVal = CleanStringRick(stringtobefixed)
      Next i
      EndTimeKyle = timeGetTime
      
      MsgBox "Time using Build String Method: " & _
          ((endTimeBuildString - startTimeBuildString) / 1000) & vbNewLine & _
          "Time using Replace Method: " & _
          ((endTimeReplaceString - startTimeReplaceString) / 1000) & vbNewLine & _
          "Time using Rick's Method: " & _
          ((endTimeRick - startTimeRick) / 1000) & vbNewLine & _
          "Time using Kyle's Method: " & _
          ((EndTimeKyle - startTimeKyle) / 1000)
    End Sub
    
    Function CleanString(inputString As String) As String
      Dim thisCharacter As String
      Dim i As Long
      For i = 1 To Len(inputString)
        ' grab each character in turn
        thisCharacter = Mid$(inputString, i, 1)
        If thisCharacter Like "[0-9,A-Z,a-z,"" ""]" Then
          CleanString = CleanString & thisCharacter
        End If
      Next i
    End Function
    
    Function CleanString2(inputString As String) As String
      Dim i As Long
      Dim thisCharacter As String
      ' assume valid string
      CleanString2 = inputString
      i = 1
      Do While i <= Len(CleanString2)
        thisCharacter = Mid$(CleanString2, i, 1)
        If Not thisCharacter Like "[0-9,A-Z,a-z,"" ""]" Then
          CleanString2 = Replace(CleanString2, thisCharacter, "")
        Else
          i = i + 1
        End If
      Loop
    End Function
    
    Function CleanStringRick(ByVal s As String) As String
      Dim X As Long
      Const GoodCharacters As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" & _
                                       "abcdefghijklmnopqrstuvwxyz" & _
                                       "0123456789 "
      For X = 1 To Len(s)
        If InStr(GoodCharacters, Mid(s, X, 1)) = 0 Then Mid(s, X) = "@"
      Next
      CleanStringRick = Replace(s, "@", "")
    End Function
    
    Function CleanStringKyle(s As String) As String
      Dim c As Object
      If c Is Nothing Then Set c = CreateObject("vbscript.regexp")
      With c
        .Pattern = "[^a-z0-9A-Z ]*"
        .ignorecase = True
        .Global = True
        CleanStringKyle = .Replace(s, vbNullString)
      End With
    End Function
    • Are you sure you're calling my function and not your own again Rick? ;)

      Also, mine is only faster when re-using the initialized object, so if you don't want to keep c out of the function, you can use static.

      Function CleanStringKyle(s As String) As String
        Static c As Object
        If c Is Nothing Then Set c = CreateObject("vbscript.regexp")
        With c
          .Pattern = "[^a-z0-9A-Z ]*"
          .ignorecase = True
          .Global = True
          CleanStringKyle = .Replace(s, vbNullString)
        End With
      End Function
      
      • The timings I get with static and your code:
        Time using Build String Method: 1.138
        Time using Replace Method: 2.329
        Time using Rick's Method: 1.775
        Time using Kyle's Method: 1.104

      • >> Are you sure you're calling my function
        >> and not your own again Rick?

        The code I posted is the code I used for my tests… if you look at the Sub named CleanStringKyle, you will see it is your code although I had modified it to move the "Dim c As Object" statement inside the Sub… you originally had it declared globally. I just copy/pasted the code I posted into a module again and ran it… I got pretty close to the results I posted originally. Then I changed the Dim to Static, ran it again and got this result…

        Time using Build String Method: 1.102
        Time using Replace Method: 2.27
        Time using Rick's Method: 1.71
        Time using Kyle's Method: 1.715

        I do not have an explanation why, but I get your code and my code coming in at a virtual tie on every test I run. Besides the statistics I gave earlier about my computer, here is the stats about my copy of Excel from the Excel Options/Resources/about Microsoft Office Excel 2007…

        Microsoft Office Excel 2007 (12.0.6661.5000)
        SP2 MSO (12.0.6662.5000)
        Part of Microsoft Office Ultimate 2007

        • Rick, you are calling your own function twice ;):

            startTimeRick = timeGetTime
            For i = 1 To numberofLoops
              retVal = CleanStringRick(stringtobefixed)
            Next i
            endTimeRick = timeGetTime
             
            startTimeKyle = timeGetTime
            For i = 1 To numberofLoops
              retVal = CleanStringRick(stringtobefixed)
            Next i
            EndTimeKyle = timeGetTime

          Should be:

            startTimeRick = timeGetTime
            For i = 1 To numberofLoops
              retVal = CleanStringRick(stringtobefixed)
            Next i
            endTimeRick = timeGetTime
             
            startTimeKyle = timeGetTime
            For i = 1 To numberofLoops
              retVal = CleanStringKyle(stringtobefixed)
            Next i
            EndTimeKyle = timeGetTime

          That's why in your tests our functions provide almost identical results ;)

          The declaring of c at object level was intentional since it makes the code faster for >1 iteration, static does the same thing – the object is held in memory and not re-initalised. Without doing this, my code would be quite slow due to the initialization overhead. My function also increases in performance in comparison to the other methods due to the lack of a loop.

          My code comes in at < 1 second if you also remove the redundant

          .ignoreCase=True
        • My function also increases in performance in comparison to the other methods due to the lack of a loop.

          This should read: My function also increases in performance in comparison to the other methods as the string length increases due to the lack of a loop.

        • @Kyle,

          Well don't I feel like an idiot. I cannot tell you how many times I looked at that code and it just never registered with me that I used the same subroutine call line twice… even after you told me I had! Anyway, I no longer have things set up to test (and I am too lazy to set it up again), but I accept your timing result as posted. Thanks for having kept knocking me on the side of my head until I finally noticed what you were saying.

This article is closed to any future comments.
Random Data Generator