Setup a bufferpool/cache
Needs some design work too.
It basically needs to satisfy these invariants
-
A song on disk must have at most one copy in memory (this is due to an invariant that Modulo12 is a Read Only query language.
-
A songs under a directory must be in the cache post a query's execution. On the first time it runs, the cache is created.
-
The songs under sub directory must be addressible on their own
To illustrate points two and three, consider this hierarchy
Songs (Directory)
- Led Zep (Directory)
- Stairway to heaven, Rain Song, Over the hills and far away
- Greta Van Fleet
- Highway tune, Edge of Darkness, When the curtain falls
Cache(Songs) -> All the above songs Cache(Led Zep) -> { Stairway to heaven, Rain Song, Over the hills and far away } Cache(Greta Van Fleet) -> { Highway tune, Edge of Darkness, When the curtain falls }
- Cache must be invalidated on size and eviction on recency (least recently used, or least frequently used?). May consider TTL. Ideally these params should be configurable in some kind of spec file.
Some initial thoughts
We need a tree data structure here, since we have to deal with hierarchies, consider this ADT
sealed Trait DirectoryTree
object DirectoryNode {
case class SongFile(song: Song) extends DirectoryTree
case class DirectoryNode(folderPath: String, filesUnder: List[Directory]) extends DirectoryTree
}
This will work for the basic cases. Imagine we have this folder hierarchy
Songs (Directory)
- Led Zep (Directory)
- Stairway to heaven, Rain Song, Over the hills and far away
- Greta Van Fleet
- Highway tune, Edge of Darkness, When the curtain falls
Well the root node would be the songs as a diretory node, and then files under would be led zep, and greta van fleet directory node. for led zep, it would have 3 song files, and the same for greta van fleet.
There are multiple things to be very careful about
- If we have a directory above Songs, we will have to re parent the entire tree.
- If we query a completely different directory that is not in the parent/child hierarchy of songs (before root), then we essentially end up with two such trees. We are looking at a forest.
- Files can be added AFTER we build the cache, so we can have a cache coherence issue. There are ways around it (like do a file system walk after some time T threshold despite having a cache, and load the songs that are new). Regardless this is a trade off that we need to think about.
- It would also help to keep it in a map (basically a linked hash map). So we can have O(1) look ups for songs in case our query engine needs it.