First off, you should know that this program still needs a lot of work. What it should do, after some work, is help you find cases of cut-and-paste programming, especially in large programming projects. In fact it's already helped me a lot in my current project, so it's not completely useless. It's just that its output isn't very readable yet, and it still uses too much memory.
If you have ever maintained larger pieces of code, I'm sure you've run into the evils of replicated code. Copying a small piece of code once or twice may be a convenient way to get your program running, but having duplicates of complex or frequently recurring code to maintain can seriously bog down development and debugging.
Unpaste aims to detect such replication, including those little changes that sometimes get made in one place but not in another such as usage of "const" in C and C++; indentation; renaming of variables; and even straightforward porting between similar languages. The program accounts for such changes by ignoring as many details as it can. Input is converted to a canonical, simplified C++-like internal format that hides differences between languages and ignores variable names for its first rough matches between lines of code.
Scalability is a prime goal for the matching algorithm. This means that the program is slow (it reads its input files many times over), but should eventually be able to handle inputs that are much larger than available memory.
For more information about unpaste, see its SourceForge page. The source code is also available from there.
Jeroen's Home Page