Use hash func to boost file creation and lookup
Previously, SimpleFS used a sequential insertion method to create files, which worked efficiently when the filesystem contained only a small number of files. However, in real-world use cases, filesystems often manage a large number of files, making sequential search and insertion inefficient. Inspired by Ext4’s hash-based directory indexing, this change adopts a hash function to accelerate file indexing and improve scalability.
Change: Implemented hash-based file index lookup Improved scalability for large directory structures
hash_code = file_hash(file_name);
extent index = hash_code / SIMPLEFS_MAX_BLOCKS_PER_EXTENT block index = hash_code % SIMPLEFS_MAX_BLOCKS_PER_EXTENT;
inode
+-----------------------+
| i_mode = IFDIR | 0755 | block 123 (simplefs_file_ei_block)
| ei_block = 123 ----|---> +----------------+
| i_size = 4 KiB | | nr_files = 7 |
| i_blocks = 1 | |----------------|
+-----------------------+ 0 | ee_block = 0 |
(extent index = 0) | ee_len = 8 |
| ee_start = 84 |---> +-------------+ block 84(simplefs_dir_block)
| nr_file = 2 | |nr_files = 2 | (block index = 0)
|----------------| |-------------|
1 | ee_block = 8 | 0 | inode = 24 |
(extent index = 1) | ee_len = 8 | | nr_blk = 1 |
| ee_start = 16 | | (foo) |
| nr_file = 5 | |-------------|
|----------------| 1 | inode = 45 |
| ... | | nr_blk = 14 |
|----------------| | (bar) |
341 | ee_block = 0 | |-------------|
(extent index = 341) | ee_len = 0 | | ... |
| ee_start = 0 | |-------------|
| nr_file = 12 | 14 | 0 |
+----------------+ +-------------+ block 85(simplefs_dir_block)
|nr_files = 2 | (block index = 1)
|-------------|
0 | inode = 48 |
| nr_blk = 15 |
| (foo1) |
|-------------|
1 | inode = 0 |
| nr_blk = 0 |
| |
|-------------|
| ... |
|-------------|
14 | 0 |
+-------------+
Performance test
- Random create 30600 files into filesystem
legacy:
168140.12 msec task-clock # 0.647 CPUs utilized
111367 context-switches # 662.346 /sec
40917 cpu-migrations # 243.351 /sec
3736053 page-faults # 22.220 K/sec
369091680702 cycles # 2.195 GHz
168751830643 instructions # 0.46 insn per cycle
34044524391 branches # 202.477 M/sec
768151711 branch-misses # 2.26% of all branches
259.842753513 seconds time elapsed
23.000247000 seconds user
150.380145000 seconds sys
full_name_hash
167926.13 msec task-clock # 0.755 CPUs utilized
110631 context-switches # 658.808 /sec
43835 cpu-migrations # 261.037 /sec
3858617 page-faults # 22.978 K/sec
392878398961 cycles # 2.340 GHz
207287412692 instructions # 0.53 insn per cycle
42556269864 branches # 253.423 M/sec
840868990 branch-misses # 1.98% of all branches
222.274028604 seconds time elapsed
20.794966000 seconds user
151.941876000 seconds sys
- Random remove 30600 files into filesystem
legacy:
104332.44 msec task-clock # 0.976 CPUs utilized
56514 context-switches # 541.672 /sec
1174 cpu-migrations # 11.252 /sec
3796962 page-faults # 36.393 K/sec
258293481279 cycles # 2.476 GHz
153853176926 instructions # 0.60 insn per cycle
30434271757 branches # 291.705 M/sec
532967347 branch-misses # 1.75% of all branches
106.921706288 seconds time elapsed
16.987883000 seconds user
91.268661000 seconds sys
full_name_hash
83278.61 msec task-clock # 0.967 CPUs utilized
52431 context-switches # 629.585 /sec
1309 cpu-migrations # 15.718 /sec
3796501 page-faults # 45.588 K/sec
199894058328 cycles # 2.400 GHz
110625460371 instructions # 0.55 insn per cycle
20325767251 branches # 244.069 M/sec
490549944 branch-misses # 2.41% of all branches
86.132655220 seconds time elapsed
19.180209000 seconds user
68.476075000 seconds sys
- Random check (ls -la filename) 30600 files into filesystem
Use perf stat ls -la
to measure the query time for each file and sum up all elapsed times to calculate the total lookup cost.
Legacy : min time: 0.00171 s max time: 0.03799 s avg time: 0.00423332 s tot time: 129.539510 s
full_name_hash: min time: 0.00171 s max time: 0.03588 s avg time: 0.00305601 s tot time: 93.514040 s
Summary by cubic
Switched SimpleFS to hash-based directory indexing using full_name_hash to speed up file creation and lookup by mapping filenames to extent/block slots. On 30.6k files: create ~33% faster, delete ~12% faster, lookup ~41% faster.
-
New Features
- Hash-guided placement/lookup in create, link, symlink, rename, and unlink.
- Added fast __file_lookup and early-stop iteration using per-extent counts; reclaims empty extents on unlink/rename.
-
Refactors
- Added hash.c (simplefs_hash via full_name_hash).
Written for commit 304f288ff74da622abb395c43971995327ae8924. Summary will update automatically on new commits.