Towards a Deduplication Framework utilising Apache Spark
Niklas Wilcke
This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection.