While looking at a few methods for parsing a string I came across a method for removing unwanted characters from a string. It involves looping through a string and using the Like Operator to determine if any given character is wanted.
It goes something like this:
Function CleanString(inputString As String) As String
Dim thisCharacter As String
Dim i As Long
For i = 1 To Len(inputString)
' grab each character in turn
thisCharacter = Mid$(inputString, i, 1)
If thisCharacter Like "[0-9,A-Z,a-z,"" ""]" Then
CleanString = CleanString & thisCharacter
End If
Next i
End FunctionI like this because it doesn't require any workarounds I usually use, like creating an array of "bad" characters and comparing each character in the input string with each "bad" character to see if there's a match. That could be a lot of looping.
I thought there might be a faster way to do the same thing. Building the cleaned string one character at a time must be (relatively) time consuming, right?
So here's what I came up with. It assumes that the input string is valid, then uses the Replace function to strip out unwanted characters. It still loops through the string — that's the only way to avoid having to specify which characters I don't want, which is everything but letters, numbers and spaces.
Function CleanString2(inputString As String) As String
Dim i As Long
Dim thisCharacter As String
' assume valid string
CleanString2 = inputString
i = 1
Do While i <= Len(CleanString2)
thisCharacter = Mid$(CleanString2, i, 1)
If Not thisCharacter Like "[0-9,A-Z,a-z,"" ""]" Then
CleanString2 = Replace(CleanString2, thisCharacter, "")
Else
i = i + 1
End If
Loop
End FunctionA Do While Loop is required because the string changes size whenever the Replace Function is used. If we used a For Loop, eventually the loop counter would be greater than the size of the string (assuming we did some replacements) and the Mid Function would throw an error. The Len Keyword dynamically changes with the size of the string.
This should be faster, because the more bad characters there are, the less iterations, right? Let's find out!
The Test
I reused a simple procedure that runs both of the above functions and times the result.
Public Declare Function timeGetTime Lib "winmm.dll" () As Long
Sub TestStringClean()
Dim startTimeBuildString As Long
Dim endTimeBuildString As Long
Dim startTimeReplaceString As Long
Dim endTimeReplaceString As Long
Dim i As Long
Dim retVal As String
Const numberofLoops As Long = 100000
Const stringtobefixed As String = "abcdefgeuhuep9Y*&#PO#M@##}{{:><?><}"
startTimeBuildString = timeGetTime
For i = 1 To numberofLoops
retVal = CleanString(stringtobefixed)
Next i
endTimeBuildString = timeGetTime
startTimeReplaceString = timeGetTime
For i = 1 To numberofLoops
retVal = CleanString2(stringtobefixed)
Next i
endTimeReplaceString = timeGetTime
MsgBox "Time using Build String Method: " & ((endTimeBuildString - startTimeBuildString) / 1000) & vbNewLine & _
"Time using Replace Method: " & ((endTimeReplaceString - startTimeReplaceString) / 1000)
End SubThe Results
My assumption is that my method will be faster. Let's start at 10,000 iterations.

Huh? Looks like the method I found is faster than the one I wrote. Let's try 100,000 iterations.

Clearly the "build-as-you-go" method is faster. Just for kicks let's try 1,000,000 iterations, but at this point I assume my method will be slower.

Conclusion
My method was considerably slower. I still like it as an example for checking for negative conditions, but in general I'm going to stick with the method I found somewhere online. Maybe it would be faster if rewritten to build the string instead of using the Replace Function?
I wonder if it is slower your way because Instr already loops through each character? In which case, you would be looping an additional number of times equal to the number of "dirty" characters (10 in your example if I count correctly).
I wonder, have you ever played around with the timing methods themselves? The researcher in me wonders if it would not be better to randomize the order of calling while encapsulating as much as possible. so your code might be something like:
Sub CompareMethods() Dim i As Long Dim tMethod1 As Single, tMethod2 As Single Dim Method1Count As Long, Method2Count As Long Dim normMethod1 As Single, normMethod2 As Single Dim whichMethod As Single For i = 1 To 20000 whichMethod = Rnd ' 50/50 chance of method 1 or method 2 If whichMethod > 0.5 Then tMethod1 = tMethod1 + Method1 Method1Count = Method1Count + 1 Else tMethod2 = tMethod2 + Method2 Method2Count = Method2Count + 1 End If Next i ' Normalize totals to account for randomness normMethod1 = tMethod1 / Method1Count * 10000 normMethod2 = tMethod2 / Method2Count * 10000 MsgBox "Time using Method 1: " & (normMethod1 / 1000) & vbNewLine & _ "Time using Method 2: " & (normMethod2 / 1000) End Sub Function Method1() As Single Dim t As Single t = Timer 'Stuff Method1 = Timer - t End Function Function Method2() As Single Dim t As Single t = Timer 'Stuff Method2 = Timer - t End FunctionSince you won't get an exactly 50/50 distribution you can't simply show the sum of the times, which is why I normalized it. It wouldn't be too difficult to force it to be even, though that could invalidate the whole point of randomizing it in the first place.
I don't imagine it will make a difference for this particular comparison, but you do so many speed tests that eventually you are bound to find one where order could make a difference. Just curious.
I wonder if it would be faster and safer to just convert the string to a byte array then loop through the array. Well, I checked it and it is much slower. I was a bit surprised. I think doing the concatenating might get pretty slow with very large strings – but from doing what I did, it didn't seem to be that big of a deal. So the first code seems best.
Here's what I threw together. (So you will know not to bother).
Function CleanString3(inputString As String) As String Dim chrCharacters() As Byte Dim chrNewString() As Byte Dim i As Long, iChrIndex As Integer, iChr As Integer chrCharacters = inputString ReDim chrNewString(Len(inputString) - 1) iChrIndex = -1 For i = 0 To UBound(chrCharacters) ' grab each character in turn If chrCharacters(i) > 0 Then iChr = chrCharacters(i) If (iChr > 47 And iChr < 58) Or (iChr > 64 And iChr < 91) _ Or (iChr > 96 And iChr < 123) Or (iChr = 32) Then iChrIndex = iChrIndex + 1 chrNewString(iChrIndex) = chrCharacters(i) End If End If Next i ReDim Preserve chrNewString(iChrIndex) CleanString3 = StrConv(chrNewString, vbUnicode) End FunctionTwo things… First, consider this line from your originally posted CelareString function…
First off, the text inside the square brackets is not supposed to be a comma delimited list… doing that simply allows thisCharacter to contain a comma. Second, if the "" "" was meant to allow space characters, there is no need to put the space character inside doubled up the quote marks. So, this line of codde will do the intended parsing check…
Now, with that said, I think the follow code will be faster…
Function CleanStringRick(ByVal S As String) As String Dim X As Long Const GoodCharacters As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" & _ "abcdefghijklmnopqrstuvwxyz" & _ "0123456789 " For X = 1 To Len(S) If InStr(GoodCharacters, Mid(S, X, 1)) = 0 Then Mid(S, X) = "@" Next CleanStringRick = Replace(S, "@", "") End FunctionYou can recombine the GoodCharacters constant statement into a single, non-contiued line of code if you want (I only did that to stop word wrapping by this blog's comment processor.
I'm a little late to the party here, but would like to propose a third option.
Regular Expressions allow a more granular check and do away with the looping over characters. The drawback is the time it takes to create the object in the first place, however, for matching over a large number of iterations (whilst maintaining the initialised object), the performance of the object far outweighs the overhead of the creation.
The performance benefit of the removal of the looping of the string is more obvious as we increase the string length:
Dim c As Object Function CleanStringKyle(s As String) As String If c Is Nothing Then Set c = CreateObject("vbscript.regexp") With c .Pattern = "[^a-z0-9A-Z ]*" .ignorecase = True .Global = True CleanStringKyle = .Replace(s, vbNullString) End With End FunctionUsing the initial test, but doubling the string (I've also increased the granularity of the timing):
Over 1k iterations:
Over 10k iterations:
Over 100k iterations:
Since there is no loop in the VBA, predictably, this method favours longer strings, for example if we quadruple the original string:
for 1k iterations:
I get different results from both JP and Kyle's tests. I used the setup that JP posted and ran the test 3 times coming up with these same results (varying a miniscule amount test-to-test.
Time using Build String Method: 1.108
Time using Replace Method: 2.232
Time using Rick's Method: 1.683
Time using Kyle's Method: 1.687
So I get your build string method to be twice as fast as teh replace method, not marginally slower and both Kyle and my codes are virtually equal in speed about midway between the string and replace methods. I'm guessing the speed difference have to do with our particular computer systems. For comparison, I am using Win7 (64-bit) SP1 on a system using an Intel Core2 Extreme CPU Q6800 running at 2.93GHz.
This is the exact code I used for my tests
Public Declare Function timeGetTime Lib "winmm.dll" () As Long Sub TestStringClean() Dim startTimeBuildString As Long Dim endTimeBuildString As Long Dim startTimeReplaceString As Long Dim endTimeReplaceString As Long Dim startTimeRick As Long Dim endTimeRick As Long Dim startTimeKyle As Long Dim EndTimeKyle As Long Dim i As Long Dim retVal As String Const numberofLoops As Long = 100000 Const stringtobefixed As String = "abcdefgeuhuep9Y*&#PO#M@##}{{:><?><}" startTimeBuildString = timeGetTime For i = 1 To numberofLoops retVal = CleanString(stringtobefixed) Next i endTimeBuildString = timeGetTime startTimeReplaceString = timeGetTime For i = 1 To numberofLoops retVal = CleanString2(stringtobefixed) Next i endTimeReplaceString = timeGetTime startTimeRick = timeGetTime For i = 1 To numberofLoops retVal = CleanStringRick(stringtobefixed) Next i endTimeRick = timeGetTime startTimeKyle = timeGetTime For i = 1 To numberofLoops retVal = CleanStringRick(stringtobefixed) Next i EndTimeKyle = timeGetTime MsgBox "Time using Build String Method: " & _ ((endTimeBuildString - startTimeBuildString) / 1000) & vbNewLine & _ "Time using Replace Method: " & _ ((endTimeReplaceString - startTimeReplaceString) / 1000) & vbNewLine & _ "Time using Rick's Method: " & _ ((endTimeRick - startTimeRick) / 1000) & vbNewLine & _ "Time using Kyle's Method: " & _ ((EndTimeKyle - startTimeKyle) / 1000) End Sub Function CleanString(inputString As String) As String Dim thisCharacter As String Dim i As Long For i = 1 To Len(inputString) ' grab each character in turn thisCharacter = Mid$(inputString, i, 1) If thisCharacter Like "[0-9,A-Z,a-z,"" ""]" Then CleanString = CleanString & thisCharacter End If Next i End Function Function CleanString2(inputString As String) As String Dim i As Long Dim thisCharacter As String ' assume valid string CleanString2 = inputString i = 1 Do While i <= Len(CleanString2) thisCharacter = Mid$(CleanString2, i, 1) If Not thisCharacter Like "[0-9,A-Z,a-z,"" ""]" Then CleanString2 = Replace(CleanString2, thisCharacter, "") Else i = i + 1 End If Loop End Function Function CleanStringRick(ByVal s As String) As String Dim X As Long Const GoodCharacters As String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" & _ "abcdefghijklmnopqrstuvwxyz" & _ "0123456789 " For X = 1 To Len(s) If InStr(GoodCharacters, Mid(s, X, 1)) = 0 Then Mid(s, X) = "@" Next CleanStringRick = Replace(s, "@", "") End Function Function CleanStringKyle(s As String) As String Dim c As Object If c Is Nothing Then Set c = CreateObject("vbscript.regexp") With c .Pattern = "[^a-z0-9A-Z ]*" .ignorecase = True .Global = True CleanStringKyle = .Replace(s, vbNullString) End With End FunctionAre you sure you're calling my function and not your own again Rick?
Also, mine is only faster when re-using the initialized object, so if you don't want to keep c out of the function, you can use static.
Function CleanStringKyle(s As String) As String Static c As Object If c Is Nothing Then Set c = CreateObject("vbscript.regexp") With c .Pattern = "[^a-z0-9A-Z ]*" .ignorecase = True .Global = True CleanStringKyle = .Replace(s, vbNullString) End With End FunctionThe timings I get with static and your code:
Time using Build String Method: 1.138
Time using Replace Method: 2.329
Time using Rick's Method: 1.775
Time using Kyle's Method: 1.104
>> Are you sure you're calling my function
>> and not your own again Rick?
The code I posted is the code I used for my tests… if you look at the Sub named CleanStringKyle, you will see it is your code although I had modified it to move the "Dim c As Object" statement inside the Sub… you originally had it declared globally. I just copy/pasted the code I posted into a module again and ran it… I got pretty close to the results I posted originally. Then I changed the Dim to Static, ran it again and got this result…
Time using Build String Method: 1.102
Time using Replace Method: 2.27
Time using Rick's Method: 1.71
Time using Kyle's Method: 1.715
I do not have an explanation why, but I get your code and my code coming in at a virtual tie on every test I run. Besides the statistics I gave earlier about my computer, here is the stats about my copy of Excel from the Excel Options/Resources/about Microsoft Office Excel 2007…
Microsoft Office Excel 2007 (12.0.6661.5000)
SP2 MSO (12.0.6662.5000)
Part of Microsoft Office Ultimate 2007
Rick, you are calling your own function twice
:
startTimeRick = timeGetTime For i = 1 To numberofLoops retVal = CleanStringRick(stringtobefixed) Next i endTimeRick = timeGetTime startTimeKyle = timeGetTime For i = 1 To numberofLoops retVal = CleanStringRick(stringtobefixed) Next i EndTimeKyle = timeGetTimeShould be:
startTimeRick = timeGetTime For i = 1 To numberofLoops retVal = CleanStringRick(stringtobefixed) Next i endTimeRick = timeGetTime startTimeKyle = timeGetTime For i = 1 To numberofLoops retVal = CleanStringKyle(stringtobefixed) Next i EndTimeKyle = timeGetTimeThat's why in your tests our functions provide almost identical results
The declaring of c at object level was intentional since it makes the code faster for >1 iteration, static does the same thing – the object is held in memory and not re-initalised. Without doing this, my code would be quite slow due to the initialization overhead. My function also increases in performance in comparison to the other methods due to the lack of a loop.
My code comes in at < 1 second if you also remove the redundant
This should read: My function also increases in performance in comparison to the other methods as the string length increases due to the lack of a loop.
@Kyle,
Well don't I feel like an idiot. I cannot tell you how many times I looked at that code and it just never registered with me that I used the same subroutine call line twice… even after you told me I had! Anyway, I no longer have things set up to test (and I am too lazy to set it up again), but I accept your timing result as posted. Thanks for having kept knocking me on the side of my head until I finally noticed what you were saying.