Lies, Damn Lies, Benchmarks and Carlos Perez

Carlos Perez is at it again. To up one on Cameron's benchmark (which found Java to be 10-15% faster than .NET), he schemes to create a benchmark that truly leaves .NET in the dust. From previous experience he knew .NET's Regex implementation is much slower than Java's. (though not quite as much as he made it seem). He also knows that .NET Regex's that use RegexOptions.Compiled cannot be unloaded from memory. I would assume he's smart enough to know that building a compiled Regex takes much more time than you gain back by running it just once.

The result: Carlos Perez' "Ultimate Java versus C# Benchmark".

Come on Carlos, do you really have to stoop this low to convince people to choose Java over .NET?

TrackBack URL for this entry: http://www.hutteman.com/scgi-bin/mt/mt-tb.cgi/52
Comments

What workaround would you propose to fix this problem? Don't use RegexOptions.Compiled? Okay, try that and run the code again.

So, how did C# fair?

I don't think I've stooped any lower than this: http://crowbar.dnsalias.com:443/crowbar/000213.html

Posted by Carlos E. Perez at May 21, 2003 2:26 PM

I dont see how he stooped low?

POint One, he used the slowest JVMs(Sun's is always slower than IBMs in recent months) to test against MS NET not the fastest as one of your links claimed..

He is testing against the fastest JVMs in the next version of the tests however :)

Even your owned linked sotires state a factor of performance of double for java over MS NET so where is the stooping, huh?

Now I could see the claim if Java was 5 times faster than MS NET accroding to Carlos tests and then it got reduce to double because that woudl be statiscally significant but a difference of factor of one is that statistically significant enough to base your claim on stooping? come on now..:)

Posted by Fred Grott at May 21, 2003 2:43 PM

I completely agree with you that the petstore debacle does not prove anything except for the fact that MSFT did a better job optimizing the problem domain. The fact that they still use that benchmark is indeed pretty low.

Workaround? Create a more real-life scenario. An application that creates a million semi-random Regexs to apply to the same string is hardly real-life. Real life would be a single Regex applied to a million different strings. Real life would be an application that also does non-computationally intensive work (like network- or database-access). How about a Java-app that reads pages from freeroller.net and uses a regex to find all links, and a C# app that reads pages from dotnetweblogs.com that does the same? ;-)

You created that benchmark not as a "real-life" scenario, but to expose a specific part of .NET you already knew performed poorly compared to Java. Then you exaggerated that performance difference by introducing a known memory-leak in the main loop. That's pretty low in my book.

Posted by Luke Hutteman at May 21, 2003 2:47 PM

Matching a document, or possible several documents against a multitude of patterns, in this case regex patterns, is a real-life scenario.

The fact that I randomly generated the patterns, is coincidentally, afterall I needed a test dataset.

This technique is one way you can build a generalized spam filter, a news feed aggregator etc. You a matching documents against a set of patterns not the otherway around. That's because of the nature of the problem, the documents are streamed to you so you don't have the opportunity to do it the traditional way as you've suggestion.

So, in short its real and it ain't contrived.

Posted by Carlos E. Perez at May 21, 2003 3:03 PM

Here's another question you've got to ask yourself.

The Regex class is a part of the core libraries of .NET. It's been known that its been a poor performer previously (i.e. 1.0). Why wasn't it fixed with version 1.1?

Matter of fact I benchmarked several other Java based regex libraries (all opensource) and all of them beat Microsoft's implementation. Can you explain to me why a paid microsoft employee can't do better than a bunch of volunteers?

Face it, the technical sophistication of the CLR VM and the .NET libraries are completely lacking.

Posted by Carlos E. Perez at May 21, 2003 3:20 PM

I know you sometimes need to match documents against multiple patterns, but typically the number of patterns is constant, while the number of documents is not (and is therefore typically an order of magnitude greater than the number of patterns).

In case you didn't know - I've built a fairly popular news feed aggregator, and while it uses Regex's in several places, they're all pre-created with a constant pattern. If you dare try and use a real .NET application, you may want to give it a try - you might just like it ;-) (though I doubt you'll admit to that)

I'm not denying Java performs better than .NET in many cases. I'm just saying your benchmark that concludes Java is 7000x faster than C# is contrived.

Posted by Luke Hutteman at May 21, 2003 3:41 PM

"typically the number of patterns is constant"

So what's the value of that constant? Consider say DNA sequence matching, how many patterns do you think needs a match? Doesn't 1,000,000 sound like a reasonable number?

"I'm just saying your benchmark that concludes Java is 7000x faster than C# is contrived."

Well duh!

Posted by Carlos E. Perez at May 21, 2003 3:53 PM

I have to side a little with Carlos here. Carlos chose an example that used M$'s natoriously slow regex package. However, I don't see why that is any worse than M$ choosing a problem space and then contraining the java developers to optimize, not rewrite, a demo application. I think that the simple answer here is that Carlos' benchmark is perfectly valid for showing that Java's regex is 7k faster that .nets. Also, I think that it is fairy conclusive that in fields that require a lot of string manipulation and matching (like biotech) that java is currently a clear choice over .net. I am sure there are things that .net does that are better than what java does. However, I have yet to see a convincing argument (and I have tried it several times) to choose a .net implementation over java. I have, however, seen compelling cases to move from M$ technologies to java. I think that all Carlos is attempting to do is fire a shot across the bow of the one attempt that M$ did make to make a case (the pet store).

BTW -- for a DNA application the patterns could easily vary as much as the dataset. I think, Luke, you are asking for an example that resembles the traditional garbage in garbage out IT application. You are essentially wanting a serious petstore comparison. I suggest you check out http://xpetstore.sourceforge.net and some of the other petstore (performance oriented impls) for that comparison.

Posted by Les at May 21, 2003 7:31 PM

Now lets see someone do carlos' test in a comparitive speced machine in perl running on linux/bsd...

Posted by Mark Allanson at May 22, 2003 1:54 AM

Hmmm.. Why would one compile regular expressions to match literal strings? Seems like overkill to me.

Posted by Doug at May 22, 2003 8:08 AM

At first I thought the alledged benchmark was a joke to show how things can get wrong when results are extrapolated, but unfortunately (for the author, I'll say), it seems to be sincere.

This code would never have survived any walk-throughs if I attended; real-world code would more likely have compiled the goal outside of the loop and then matched against the generated solution candidate inside.

Apart from sleezy overuse of boxing, .NET CLR 1.1 beats both Sun VMs with 61% and 66% for 1.4.1 and 1.4.2-beta respectively, by moving one line and swapping the use of _doc and matchthis.

And yes, the Java program is smaller, mostly due to removal of whitespace. Disgusting.

Posted by Roland Kaufmann at May 22, 2003 1:27 PM

:]

That is definetly the "The Ultimate Java Versus C# Benchmark". I better go switch to Java now...

Seriously Carlos, this is childish. A better name would be "The Ultimate Java Regex Versus C# Regex Benchmark"

But i bet you love the attention. FUD++;

Posted by Max at May 22, 2003 3:21 PM

"swapping the use of _doc and matchthis."

It's just unbelievable what people are suggesting to fix this problem.

(1) Changing the regex to string search.
(2) Matching a single regex multiple times.

Now this!

Simply unbelievable!

Posted by Carlos E. Perez at May 22, 2003 6:20 PM

"And yes, the Java program is smaller, mostly due to removal of whitespace. Disgusting."

Unbelievable! You can't see the sarcasm in the original statement? "BTW, did anyone notice? The Java version actually took less lines of code than the C# version, hmmm?"

Simply unbelievable!

Posted by Carlos E. Perez at May 22, 2003 6:22 PM

So Carlos.. Did you have a point? Or are you just wasting blogspace and peoples time?

Posted by Doug at May 22, 2003 6:54 PM

Doug,

Looks like I overlooked your question:

"Hmmm.. Why would one compile regular expressions to match literal strings? Seems like overkill to me."

No, the literal strings are just an example data set. I could change it to match something like A*C{3}(G|T)A+ and it'll work just fine. Sorry about that confusion.

Posted by Carlos E. Perez at May 23, 2003 10:50 AM

Carlos,
Excellent job! Contact me immediately!

Johnny Rocket
Sun Marketroid BS Division

Posted by Johnny Rocket at May 23, 2003 11:53 AM

Let AMAZON be the referee in the 'Java vs. .NET' debate -
http://dotnetweblogs.com/sbchatterjee/posts/6581.aspx

Posted by SBC at May 26, 2003 8:35 PM

I managed to speed up the .NET version by using Regex.IsMatch (static) and removing the Compiled option as it should never be used for this scenario anyway. The .NET regex engine still isn't as fast, but I'll live with that, it's acceptable for what I need to do most of the time and I get the MatchEvaluator which is incredibly useful.

What I think is interesting is using the right tool for the job, for this particular instance IndexOf would be far better suited in both languages. On my machine the Java version returns 6299 milliseconds and the .NET version 3054 milliseconds. So for general usage .NET is fast enough for me :)

Posted by Duncan Godwin at May 28, 2003 7:14 PM

Duncan,

Care to share with us the results of your regex test?

Posted by Carlos E. Perez at May 29, 2003 11:22 AM

Why not do a benchmark with Swing vs Windows.Forms?

Posted by Brian Takita at August 26, 2003 5:58 PM
This discussion has been closed. If you wish to contact me about this post, you can do so by email.