← Back to Blog

Tools and Tips for Removing Duplicates in Large Data Sets

Discover tools and tips for efficiently removing duplicates in large data sets, with Australian tax and accounting examples for financial data cleaning.

Kate Cui, CPA

Introduction

Working with large data sets often comes with the challenge of handling duplicate entries. Whether you're managing accounts payable data for BAS preparation, cleaning a CRM export for a client database, or reconciling bank transaction files, efficiently removing duplicates is crucial for data accuracy and reliability. This guide covers the essential tools and techniques with Australian business use cases.


Why Deduplication Matters for Australian Businesses

Duplicate data doesn't just waste storage - it causes real financial errors:

  • BAS and GST reporting: Duplicate invoices inflate input tax credits, leading to incorrect BAS lodgements and potential ATO penalties
  • Payroll: Duplicate employee records can result in overpaid wages or incorrect PAYG withholding
  • Accounts receivable: Duplicate customer invoices mean overstating revenue and sending clients duplicate bills - not a good look
  • Inventory management: Duplicate SKU entries lead to incorrect stock counts and reorder triggers

The cost of bad data is conservatively estimated at 15-25% of revenue for companies with poor data hygiene. For an Australian SME turning over $2M, that's $300K-$500K in hidden costs from incorrect decisions based on dirty data.


Tools for Removing Duplicates

Microsoft Excel

  • Remove Duplicates Feature: Excel offers a straightforward way via Data > Remove Duplicates. Excellent for small to medium data sets (up to 10,000 rows).
  • Power Query (Get & Transform): For advanced users, Power Query provides robust deduplication options including fuzzy matching - ideal for Australian company name variations ("ABC Pty Ltd" vs "ABC Pty Limited").
  • Conditional Formatting: Use Home > Conditional Formatting > Highlight Cell Rules > Duplicate Values to visually review before removing. This is particularly helpful when you're unsure whether certain records are genuine duplicates.

Python with Pandas

The drop_duplicates() method is highly effective for large data sets:

import pandas as pd
df = pd.read_csv("your_data.csv")
df_cleaned = df.drop_duplicates(subset=["Invoice_Number"])
df_cleaned.to_csv("cleaned_data.csv", index=False)

For fuzzy matching, combine with the fuzzywuzzy library:

from fuzzywuzzy import fuzz, process

# Find near-duplicate company names
matches = process.extract("ABC Pty Ltd", company_names, limit=5, scorer=fuzz.token_sort_ratio)
matches_above_80 = [m for m in matches if m[1] >= 80]

SQL

The DISTINCT keyword filters out duplicate rows. For more complex scenarios, use ROW_NUMBER() with partitioning:

WITH deduped AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY invoice_no ORDER BY created_at DESC) as rn
  FROM transactions
)
SELECT * FROM deduped WHERE rn = 1;

This approach is particularly useful for Australian GST reporting where you need to deduplicate supplier invoices but keep the most recent version (in case of amendments or credit notes).

OpenRefine

An open-source tool designed for data cleaning. Its faceting and clustering features make it easy to spot and eliminate duplicates with a visual interface. Particularly effective at catching fuzzy duplicates that exact-match approaches miss.


Choosing the Right Tool

Data SizeRecommended ToolBest For
< 10,000 rowsExcel Remove DuplicatesOne-off cleaning, non-technical users
10K - 100K rowsExcel Power QueryRegular monthly cleaning workflows
100K - 1M rowsPython PandasAutomated data pipelines, repeatable jobs
1M+ rowsSQL databaseProduction databases, transactional data
Any size (visual)OpenRefineExploration and fuzzy matching

Worked Example: Cleaning an Australian Accounts Payable Database

Consider a Melbourne-based construction business processing 50,000 supplier invoices per year. The data has:

  • 4,200 duplicate invoice numbers: Same supplier invoice entered twice due to MYOB/Xero import glitches - inflating input tax credits on BAS by approximately $52,500 (assuming 10% GST on average $1,250 invoices)
  • 800 near-duplicate supplier names: "ABC Electrical Pty Ltd" vs "ABC Electrical" vs "A.B.C. Electrical" - causing the AP team to set up duplicate supplier records
  • 300 transactions with duplicate line items but different batch IDs: Same materials cost recorded against both the project cost code and the general expense code

The Solution

Step 1: Use Excel's Remove Duplicates on the Invoice Number column to remove the 4,200 exact duplicates. This immediately corrects the BAS position.

Step 2: Export supplier names to Python and use fuzzywuzzy's token_sort_ratio to identify near-duplicate names. Merge the 800 near-duplicates into 620 unique suppliers, consolidating 180 duplicate supplier records.

Step 3: Use SQL's ROW_NUMBER() to handle the batch ID scenario - keep only the most recent entry per invoice-project combination.

Result: The AP team saves approximately 8 hours per month in reconciliation time. The BAS is now accurate. Supplier payment runs no longer include duplicate amounts.

Note: The above figures are illustrative. Actual duplication rates depend on data entry processes and system integration quality.


Tips for Effectively Removing Duplicates

  1. Understand Your Data: Before removing, understand which columns should be unique and which can have repeated values. An invoice number should be unique; a customer name may legitimately appear multiple times.
  2. Backup Your Data: Always create a backup before performing deduplication. For critical financial data, maintain a version history of your clean data set.
  3. Use Conditional Formatting to Preview: Visually review duplicates before removal - especially when you're uncertain about what constitutes a true duplicate.
  4. Validate Results: Cross-check against the original data set to ensure nothing was incorrectly removed. For financial data, run a control total (sum of invoice amounts before vs after) as a sanity check.
  5. Establish Prevention: Set up data validation rules to prevent duplicates at the point of entry. In Xero or MYOB, enable duplicate invoice detection. In SQL, add unique constraints on invoice number columns. Preventing duplicates is always more efficient than cleaning them later.
  6. Document Your Logic: When running deduplication on financial data, document which columns you deduplicated on and the rule (keep first, keep most recent, etc.). This matters if the ATO asks about your data controls.

Frequently Asked Questions

How can I remove duplicates in Excel without losing any data?

Use the Remove Duplicates feature under the Data tab. Select only the columns that define uniqueness to avoid removing legitimate records. Always back up your data first and validate results against the original set.

What's the advantage of using Python's Pandas library for removing duplicates?

Pandas handles large data sets efficiently (100K to 1M+ rows), provides extensive customisation via the subset and keep parameters, and integrates well with other data processing workflows.

Can Google Sheets handle large data sets for removing duplicates?

Google Sheets works well for small to moderately sized data sets (under 50,000 rows) but may struggle with very large ones. For larger data sets, use Excel Power Query, Python Pandas, or a SQL database.

How does the SQL ROW_NUMBER function help in removing duplicates?

ROW_NUMBER assigns a unique sequential number to each row within a partition, making it easy to identify duplicates by partition and keep only the first (or most recent) record.

Is there a tool that provides a visual approach to deduplication?

Yes, OpenRefine offers a visual and interactive approach to cleaning and deduplicating data with faceting and clustering features that catch fuzzy matches.

How do I handle near-duplicates like 'ABC Pty Ltd' vs 'ABC Pty Limited'?

Use Excel Power Query's Fuzzy Grouping, OpenRefine's clustering algorithms, or a Python fuzzy matching library like fuzzywuzzy. These tools handle variations in company names, addresses, and other text fields.

What are the risks of incorrect deduplication for Australian BAS or tax reporting?

Removing legitimate transactions during deduplication can cause BAS or tax return errors - overstated or understated GST, incorrect PAYG withholding totals, and unreconcilable ATO lodgements. Always validate against original source data.


Conclusion

Removing duplicates from large data sets is a critical skill for any business handling significant volumes of transaction data. The right tool depends on your data size, technical capability, and how often you need to run the process. For Australian businesses in particular, accurate deduplication directly impacts BAS accuracy, GST reporting, and financial statement integrity - making it a compliance concern as much as a data quality one.

By leveraging these tools and tips, you can ensure your data is clean, accurate, and ready for analysis or reporting. Start with the simplest tool that fits your data volume, and level up to Python or SQL when your data grows beyond Excel's capabilities.