Breder.org

Designing a Tool for Searching Identifiers Across 100K+ Text Files

Currently I'm working with a large git repository. The local clone has a couple of gigabytes in size, and it contains in the order of hundreds of thousands of files, mostly plain-text source code. I expect it to have in the order of millions of line of code, although I haven't ran cloc on it.

I'd like to quickly search the whole repository for all the occurrences of a given identifier. This would allow me to be sure that I've addressed all the references to the component I'm changing in case of refactorings, besides making it easier to draw a mental picture of the overall dependency chain.

I tried the usual approaches

I've tried the usual tools: searching with VS Code, grep from the command line, and BareGrep on Windows. All of these tools easily choke on the raw amount of files and size. BareGrep has performed the best, but it still takes dozens of minutes to yield the full result.

Given that the whole repository can easily fit in RAM and how fast and highly-parallel modern processors are, it's kind of absurd that I can't get my results under a second. Well, I understand the “why”: Opening and reading each file--no matter how small they are--requires calls to the OS, which in turn will have to hit the disk, and the aggregate overhead over hundreds of thousands of files is significant.

Thoughts and the solution

The best approach that I can think of is indexing all the identifiers across all the source files ahead of time. While this may take dozens of minutes or half an hour, it should be an infrequent operation.

Also, the index should support incremental updates. The vast majority of the files doesn't change over time, so I'd like to make use of that to make keeping the index in sync with the source files as painless as possible.

Indexing a particular file able is mostly independently of indexing the other files, so this can be done in parallel across however many cores we have, cutting down the total time taken.

Once the index has been built and is in sync with the source files on disk, querying it will get us the (hopefully) much smaller subset of files which contain the identifier. Now we could trivially open those (hundreds?) of files and find where exactly the occurrences are.