Better logging of pipeline happenings
Is your feature request related to a problem? Please describe. I ran my first indexing pipeline using swiftide today and noticed that the level of detail in the log messages was inconsistent in level.
I wanted to understand which parts of the code are slow, what configuration changes lead to faster or slower use, etc, so I added logging and saw that the info provided is a bit all over the place.
Originally I started out with a large-ish model qwq and then transitioned over to a small quick model (phi4) so I could iterate faster on understanding the logging parts here. Redis is disabled as it filters out nodes which are not actually properly processed.
Code
use swiftide::{
indexing::{
self,
loaders::FileLoader,
transformers::{ChunkCode, Embed, MetadataQACode},
},
integrations::{fastembed::FastEmbed, ollama::Ollama, qdrant::Qdrant},
};
use tracing::{info, info_span};
use tracing_subscriber::EnvFilter;
#[tokio::main]
async fn main() -> anyhow::Result<()> {
dotenvy::dotenv()?;
let directives = "info,swiftide_indexing=debug,swiftide_integrations=debug";
let filter = EnvFilter::builder().parse(directives)?;
tracing_subscriber::fmt().with_env_filter(filter).init();
let _setup = info_span!("setup").entered();
info!("Starting swiftide");
let ollama = Ollama::builder().default_prompt_model("phi4").build()?;
let loader = FileLoader::new(".").with_extensions(&["rs"]);
// let redis_url = std::env::var("REDIS_URL")?;
// let redis = Redis::try_from_url(redis_url, "gusteau")?;
let rust_chunker = ChunkCode::try_for_language_and_chunk_size("rust", 10..2048)?;
let fast_embed = FastEmbed::try_default()?;
let qdrant = Qdrant::builder().batch_size(50).vector_size(1536).build()?;
info!("Setup finished");
drop(_setup);
let _pipeline = info_span!("pipeline").entered();
info!("Running pipeline");
indexing::Pipeline::from_loader(loader)
.with_default_llm_client(ollama)
// .filter_cached(redis)
.then_chunk(rust_chunker)
.then(MetadataQACode::default())
.then_in_batch(Embed::new(fast_embed))
.then_store_with(qdrant)
.run()
.await?;
info!("Pipeline finished");
Ok(())
}
[package]
name = "gusteau"
version = "0.1.0"
edition = "2024"
[dependencies]
anyhow = "1.0.97"
# fix for missing feature flag in swiftide
chrono = { version = "0.4.40", features = ["now"] }
dotenvy = "0.15.7"
swiftide = { version = "0.22.3", features = [
"fastembed",
"ollama",
"qdrant",
"redis",
"tree-sitter",
] }
tokio = { version = "1.44.1", features = ["macros", "rt-multi-thread"] }
tracing = "0.1.41"
tracing-subscriber = { version = "0.3.19", features = ["env-filter"] }
Logs:
Text logs
2025-03-16T08:45:09.998681Z INFO setup: gusteau: Starting swiftide
2025-03-16T08:45:10.122453Z INFO setup: gusteau: Setup finished
2025-03-16T08:45:10.122469Z INFO pipeline: gusteau: Running pipeline
2025-03-16T08:45:10.123034Z INFO pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Starting indexing pipeline with 12 concurrency
2025-03-16T08:45:10.123067Z DEBUG pipeline:indexing_pipeline.run:setup: swiftide_integrations::qdrant::persist: Setting up Qdrant storage
2025-03-16T08:45:10.123071Z INFO pipeline:indexing_pipeline.run:setup: swiftide_integrations::qdrant: Checking if collection swiftide exists
2025-03-16T08:45:10.124705Z WARN pipeline:indexing_pipeline.run:setup: swiftide_integrations::qdrant: Collection swiftide exists
2025-03-16T08:45:10.125017Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::loaders::file_loader: Reading file: DirEntry { dent: Walkdir(DirEntry("./src/main.rs")), err: None }
2025-03-16T08:45:10.125055Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Chunking node chunker="ChunkCode"
2025-03-16T08:45:10.125745Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Transforming node node=Node { id: fcadae46-e237-37d3-b0b9-fc601d7e815f, path: "./src/main.rs", chunk: "use swiftide::{\n indexing::{\n self,\n loaders::FileLoader,\n transformers::{ChunkCode, Embed, MetadataQACode},\n },\n integrations::{fastembed::FastEmbed, ollama::Ollama, qdrant::Qdrant},\n};\nuse tracing::{info, info_span};\nuse tracing_subscriber::EnvFilter;\n\n#[tokio::main]\nasync fn main() -> anyhow::Result<()> {\n dotenvy::dotenv()?;\n let directives = \"info,swiftide_indexing=debug,swiftide_integrations=debug\";\n let filter = EnvFilter::builder().parse(directives)?;\n tracing_subscriber::fmt().with_env_filter(filter).init();\n\n let _setup = info_span!(\"setup\").entered();\n info!(\"Starting swiftide\");\n let ollama = Ollama::builder().default_prompt_model(\"phi4\").build()?;\n let loader = FileLoader::new(\".\").with_extensions(&[\"rs\"]);\n // let redis_url = std::env::var(\"REDIS_URL\")?;\n // let redis = Redis::try_from_url(redis_url, \"gusteau\")?;\n let rust_chunker = ChunkCode::try_for_language_and_chunk_size(\"rust\", 10..2048)?;\n let fast_embed = FastEmbed::try_default()?;\n let qdrant = Qdrant::builder().batch_size(50).vector_size(1536).build()?;\n info!(\"Setup finished\");\n drop(_setup);\n\n let _pipeline = info_span!(\"pipeline\").entered();\n info!(\"Running pipeline\");\n indexing::Pipeline::from_loader(loader)\n .with_default_llm_client(ollama)\n // .filter_cached(redis)\n .then_chunk(rust_chunker)\n .then(MetadataQACode::default())\n .then_in_batch(Embed::new(fast_embed))\n .then_store_with(qdrant)\n .run()\n .await?;\n info!(\"Pipeline finished\");\n\n Ok(())\n}", metadata: {}, vectors: "", sparse_vectors: "", embed_mode: SingleWithMetadata } transformer="MetadataQACode"
2025-03-16T08:45:10.126405Z DEBUG pipeline:indexing_pipeline.run:transformers.metadata_qa_code:prompt: swiftide_integrations::ollama::simple_prompt: [SimplePrompt] Request to ollama model="phi4" messages="{\n \"role\": \"user\",\n \"content\": \"# Task\\n\\nYour task is to generate questions and answers for the given code.\\n\\nGiven that somebody else might ask questions about the code, consider things like:\\n\\n- What does this code do?\\n- What other internal parts does the code use?\\n- Does this code have any dependencies?\\n- What are some potential use cases for this code?\\n- ... and so on\\n\\n# Constraints\\n\\n- Generate only 5 questions and answers.\\n- Only respond in the example format\\n- Only respond with questions and answers that can be derived from the code.\\n\\n# Example\\n\\nRespond in the following example format and do not include anything else:\\n\\n```\\nQ1: What does this code do?\\nA1: It transforms strings into integers.\\nQ2: What other internal parts does the code use?\\nA2: A hasher to hash the strings.\\n```\\n\\n\\n\\n# Code\\n\\n```\\nuse swiftide::{\\n indexing::{\\n self,\\n loaders::FileLoader,\\n transformers::{ChunkCode, Embed, MetadataQACode},\\n },\\n integrations::{fastembed::FastEmbed, ollama::Ollama, qdrant::Qdrant},\\n};\\nuse tracing::{info, info_span};\\nuse tracing_subscriber::EnvFilter;\\n\\n#[tokio::main]\\nasync fn main() -> anyhow::Result<()> {\\n dotenvy::dotenv()?;\\n let directives = \\\"info,swiftide_indexing=debug,swiftide_integrations=debug\\\";\\n let filter = EnvFilter::builder().parse(directives)?;\\n tracing_subscriber::fmt().with_env_filter(filter).init();\\n\\n let _setup = info_span!(\\\"setup\\\").entered();\\n info!(\\\"Starting swiftide\\\");\\n let ollama = Ollama::builder().default_prompt_model(\\\"phi4\\\").build()?;\\n let loader = FileLoader::new(\\\".\\\").with_extensions(&[\\\"rs\\\"]);\\n // let redis_url = std::env::var(\\\"REDIS_URL\\\")?;\\n // let redis = Redis::try_from_url(redis_url, \\\"gusteau\\\")?;\\n let rust_chunker = ChunkCode::try_for_language_and_chunk_size(\\\"rust\\\", 10..2048)?;\\n let fast_embed = FastEmbed::try_default()?;\\n let qdrant = Qdrant::builder().batch_size(50).vector_size(1536).build()?;\\n info!(\\\"Setup finished\\\");\\n drop(_setup);\\n\\n let _pipeline = info_span!(\\\"pipeline\\\").entered();\\n info!(\\\"Running pipeline\\\");\\n indexing::Pipeline::from_loader(loader)\\n .with_default_llm_client(ollama)\\n // .filter_cached(redis)\\n .then_chunk(rust_chunker)\\n .then(MetadataQACode::default())\\n .then_in_batch(Embed::new(fast_embed))\\n .then_store_with(qdrant)\\n .run()\\n .await?;\\n info!(\\\"Pipeline finished\\\");\\n\\n Ok(())\\n}\\n```\\n\"\n}"
2025-03-16T08:45:24.053127Z DEBUG pipeline:indexing_pipeline.run:transformers.metadata_qa_code:prompt: swiftide_integrations::ollama::simple_prompt: [SimplePrompt] Response from ollama response="```markdown\nQ1: What does this code do?\nA1: This code sets up and runs an asynchronous indexing pipeline for files using the swiftide library. It loads files with specific extensions, processes them into chunks and metadata, embeds the data in vectors, and stores it.\n\nQ2: What libraries or frameworks does this code depend on?\nA2: The code depends on several external crates: `swiftide` for indexing functionalities, `tracing` and `tracing_subscriber` for logging, `tokio` as an async runtime, `dotenvy` for environment variable management, and `anyhow` for error handling.\n\nQ3: Which components are used to process the files in this code?\nA3: The components used include `FileLoader` to load files from a directory, `ChunkCode::try_for_language_and_chunk_size` to chunk files into pieces of specific sizes, `MetadataQACode` to create metadata questions and answers, `FastEmbed` for embedding the data into vectors, and `Qdrant` as the vector database where embedded data is stored.\n\nQ4: What external tool or service does this code utilize for natural language processing?\nA4: This code utilizes the Ollama framework through its integration in swiftide, specifically with a pre-built default prompt model \"phi4\" for handling natural language processing tasks.\n\nQ5: What are potential use cases for this code?\nA5: Potential uses include building a search or recommendation system where documents need to be indexed, chunked into logical pieces, and embedded into vectors for efficient similarity searches. The pipeline could serve large datasets or facilitate content retrieval based on queries by computing embeddings from input data.\n```"
2025-03-16T08:45:24.053298Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Batch transforming nodes batch_transformer="Embed" num_nodes=1
2025-03-16T08:45:24.142160Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Batch Storing nodes storage="Qdrant" num_nodes=1
2025-03-16T08:45:24.142234Z DEBUG pipeline:indexing_pipeline.run:storage.qdrant.batch_store: swiftide_integrations::qdrant::persist: Storing batch of 1 nodes
2025-03-16T08:45:24.144132Z WARN pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Processed 1 nodes in 14 seconds elapsed_in_seconds=14
2025-03-16T08:45:24.152188Z INFO pipeline: gusteau: Pipeline finished
Describe the solution you'd like Use TRACE level for detail (e.g. full prompt info), use DEBUG level for summary information about each step, use INFO for the whole pipeline level detail. Some better error messages.
Let's go through these a bit at a time:
2025-03-16T08:45:10.123071Z INFO pipeline:indexing_pipeline.run:setup: swiftide_integrations::qdrant: Checking if collection swiftide exists
2025-03-16T08:45:10.124705Z WARN pipeline:indexing_pipeline.run:setup: swiftide_integrations::qdrant: Collection swiftide exists
This log message should explain why this is happening and why it's relevant as well as why it's logging a warning. What problem is happening here? How can I fix it. Can you link to a doc that explains this?
2025-03-16T08:45:10.125017Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::loaders::file_loader: Reading file: DirEntry { dent: Walkdir(DirEntry("./src/main.rs")), err: None }
Information about the main file file loader walk task should be logged first "discovering files in folder: .`", then perhaps a log message per file. I'm new to swiftide, but assuming that the files map to nodes, add the node id into the message so these can be traced through the pipeline.
2025-03-16T08:45:10.125055Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Chunking node chunker="ChunkCode"
Which node (sure there's only 1 here, but in a larger app that won't be the case). Does it have an id or some details?
2025-03-16T08:45:10.125745Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Transforming node node=Node { id: fcadae46-e237-37d3-b0b9-fc601d7e815f, path: "./src/main.rs", chunk: "use swiftide::{\n indexing::{\n self,\n loaders::FileLoader,\n transformers::{ChunkCode, Embed, MetadataQACode},\n },\n integrations::{fastembed::FastEmbed, ollama::Ollama, qdrant::Qdrant},\n};\nuse tracing::{info, info_span};\nuse tracing_subscriber::EnvFilter;\n\n#[tokio::main]\nasync fn main() -> anyhow::Result<()> {\n dotenvy::dotenv()?;\n let directives = \"info,swiftide_indexing=debug,swiftide_integrations=debug\";\n let filter = EnvFilter::builder().parse(directives)?;\n tracing_subscriber::fmt().with_env_filter(filter).init();\n\n let _setup = info_span!(\"setup\").entered();\n info!(\"Starting swiftide\");\n let ollama = Ollama::builder().default_prompt_model(\"phi4\").build()?;\n let loader = FileLoader::new(\".\").with_extensions(&[\"rs\"]);\n // let redis_url = std::env::var(\"REDIS_URL\")?;\n // let redis = Redis::try_from_url(redis_url, \"gusteau\")?;\n let rust_chunker = ChunkCode::try_for_language_and_chunk_size(\"rust\", 10..2048)?;\n let fast_embed = FastEmbed::try_default()?;\n let qdrant = Qdrant::builder().batch_size(50).vector_size(1536).build()?;\n info!(\"Setup finished\");\n drop(_setup);\n\n let _pipeline = info_span!(\"pipeline\").entered();\n info!(\"Running pipeline\");\n indexing::Pipeline::from_loader(loader)\n .with_default_llm_client(ollama)\n // .filter_cached(redis)\n .then_chunk(rust_chunker)\n .then(MetadataQACode::default())\n .then_in_batch(Embed::new(fast_embed))\n .then_store_with(qdrant)\n .run()\n .await?;\n info!(\"Pipeline finished\");\n\n Ok(())\n}", metadata: {}, vectors: "", sparse_vectors: "", embed_mode: SingleWithMetadata } transformer="MetadataQACode"
This jumped down a level of abstraction in the amount of detail it contains - most of this detail perhaps belongs at the lower trace level. Instead this might look better with just the type of node and transformaer being processed
2025-03-16T08:45:10.126405Z DEBUG pipeline:indexing_pipeline.run:transformers.metadata_qa_code:prompt: swiftide_integrations::ollama::simple_prompt: [SimplePrompt] Request to ollama model="phi4" messages="{\n \"role\": \"user\",\n \"content\": \"# Task\\n\\nYour task is to generate questions and answers for the given code.\\n\\nGiven that somebody else might ask questions about the code, consider things like:\\n\\n- What does this code do?\\n- What other internal parts does the code use?\\n- Does this code have any dependencies?\\n- What are some potential use cases for this code?\\n- ... and so on\\n\\n# Constraints\\n\\n- Generate only 5 questions and answers.\\n- Only respond in the example format\\n- Only respond with questions and answers that can be derived from the code.\\n\\n# Example\\n\\nRespond in the following example format and do not include anything else:\\n\\n```\\nQ1: What does this code do?\\nA1: It transforms strings into integers.\\nQ2: What other internal parts does the code use?\\nA2: A hasher to hash the strings.\\n```\\n\\n\\n\\n# Code\\n\\n```\\nuse swiftide::{\\n indexing::{\\n self,\\n loaders::FileLoader,\\n transformers::{ChunkCode, Embed, MetadataQACode},\\n },\\n integrations::{fastembed::FastEmbed, ollama::Ollama, qdrant::Qdrant},\\n};\\nuse tracing::{info, info_span};\\nuse tracing_subscriber::EnvFilter;\\n\\n#[tokio::main]\\nasync fn main() -> anyhow::Result<()> {\\n dotenvy::dotenv()?;\\n let directives = \\\"info,swiftide_indexing=debug,swiftide_integrations=debug\\\";\\n let filter = EnvFilter::builder().parse(directives)?;\\n tracing_subscriber::fmt().with_env_filter(filter).init();\\n\\n let _setup = info_span!(\\\"setup\\\").entered();\\n info!(\\\"Starting swiftide\\\");\\n let ollama = Ollama::builder().default_prompt_model(\\\"phi4\\\").build()?;\\n let loader = FileLoader::new(\\\".\\\").with_extensions(&[\\\"rs\\\"]);\\n // let redis_url = std::env::var(\\\"REDIS_URL\\\")?;\\n // let redis = Redis::try_from_url(redis_url, \\\"gusteau\\\")?;\\n let rust_chunker = ChunkCode::try_for_language_and_chunk_size(\\\"rust\\\", 10..2048)?;\\n let fast_embed = FastEmbed::try_default()?;\\n let qdrant = Qdrant::builder().batch_size(50).vector_size(1536).build()?;\\n info!(\\\"Setup finished\\\");\\n drop(_setup);\\n\\n let _pipeline = info_span!(\\\"pipeline\\\").entered();\\n info!(\\\"Running pipeline\\\");\\n indexing::Pipeline::from_loader(loader)\\n .with_default_llm_client(ollama)\\n // .filter_cached(redis)\\n .then_chunk(rust_chunker)\\n .then(MetadataQACode::default())\\n .then_in_batch(Embed::new(fast_embed))\\n .then_store_with(qdrant)\\n .run()\\n .await?;\\n info!(\\\"Pipeline finished\\\");\\n\\n Ok(())\\n}\\n```\\n\"\n}"
This contains too much detail and doesn't link to the node meaning it can't be correlated - give the semantic view of this "Generating QA metadata for chunk ... nodeid ... with ollama model ..."
2025-03-16T08:45:24.053127Z DEBUG pipeline:indexing_pipeline.run:transformers.metadata_qa_code:prompt: swiftide_integrations::ollama::simple_prompt: [SimplePrompt] Response from ollama response="```markdown\nQ1: What does this code do?\nA1: This code sets up and runs an asynchronous indexing pipeline for files using the swiftide library. It loads files with specific extensions, processes them into chunks and metadata, embeds the data in vectors, and stores it.\n\nQ2: What libraries or frameworks does this code depend on?\nA2: The code depends on several external crates: `swiftide` for indexing functionalities, `tracing` and `tracing_subscriber` for logging, `tokio` as an async runtime, `dotenvy` for environment variable management, and `anyhow` for error handling.\n\nQ3: Which components are used to process the files in this code?\nA3: The components used include `FileLoader` to load files from a directory, `ChunkCode::try_for_language_and_chunk_size` to chunk files into pieces of specific sizes, `MetadataQACode` to create metadata questions and answers, `FastEmbed` for embedding the data into vectors, and `Qdrant` as the vector database where embedded data is stored.\n\nQ4: What external tool or service does this code utilize for natural language processing?\nA4: This code utilizes the Ollama framework through its integration in swiftide, specifically with a pre-built default prompt model \"phi4\" for handling natural language processing tasks.\n\nQ5: What are potential use cases for this code?\nA5: Potential uses include building a search or recommendation system where documents need to be indexed, chunked into logical pieces, and embedded into vectors for efficient similarity searches. The pipeline could serve large datasets or facilitate content retrieval based on queries by computing embeddings from input data.\n```"
Too much detail - belongs in the trace level. Also doesn't have any correlation to node id / chunk. Would be good to see timing information on this.
2025-03-16T08:45:24.053298Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Batch transforming nodes batch_transformer="Embed" num_nodes=1
2025-03-16T08:45:24.142160Z DEBUG pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Batch Storing nodes storage="Qdrant" num_nodes=1
2025-03-16T08:45:24.142234Z DEBUG pipeline:indexing_pipeline.run:storage.qdrant.batch_store: swiftide_integrations::qdrant::persist: Storing batch of 1 nodes
Would be good to see node ids here
2025-03-16T08:45:24.144132Z WARN pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Processed 1 nodes in 14 seconds elapsed_in_seconds=14
Why is this at WARN level?
At a high level:
- keep node ids and log them
- log details at the trace level (i.e. anything which exists somewhere already, which you're storing into a db, or sending to / receiving from an API)
I wonder if it would be useful to use tracing spans and relate them using Span::follows_from somehow when a node gets chunked or batched? I'm not sure how this gets presented when logging, but if it's got some sort of reasonable default that would be super neat.
Please forgive the dryness of the above comments as fairly unfiltered user reaction on first use. I'm hoping that you see this as constructive rather than pure criticism.
I also understand that there's a larger otel change that you're considering. These ideas seem like they are small enough changes that would make things easier in the short term while the longer term solution takes form.
Hey Josh,
Thanks for pointing this out. I think over the months it has grown to a bit of a mess that is all over the place.
What I'd like to see is how nodes flow through the pipeline, how they get transformed, and when tracing, pinpoint specific nodes.
Do you think it would make more sense to trace the transformation steps, with nodes being part of the log in that step? Or should node transformations be an individual span (like now)? I think the latter just breaks with thousands of nodes anyway, then again the representation is not chronological int the first, if that makes sense.
Node ids are UUIDs lazily generated from the node content and path, this means that when chunked, ids change. Currently besides the path there is no way to trace back the node to its parent (which does not exist anymore). I'm not sure how large the impact is of storing parent ids in nodes on performance, do think it's something to consider (debug/trace only is a fair compromise). That said, with most rag solutions I've seen transformations tend to happen after chunking.
Overall, I fully agree that it's an inconvenient mess right now and the library can be a lot nicer to use to fix this up. Unfortunately I don't have the time right now to sift through it. It feels like quite a lot.
I also understand that there's a larger otel change that you're considering. These ideas seem like they are small enough changes that would make things easier in the short term while the longer term solution takes form.
It's probably less I think? There's been some work on standardizing llm based applications, which would mean emitting some metrics/spans at certain llm points.
Please forgive the dryness of the above comments as fairly unfiltered user reaction on first use. I'm hoping that you see this as constructive rather than pure criticism.
Hah, no worries. Good feedback is rare and I value yours.
Mostly my view was purely external - what do I see as a user, I haven't looked at the implementation details on this one yet.
In implementing this, intuitively there's likely a solution that isn't write only like tracing events. E.g. the concept of a notification stream, or work stream which has a defined format for showing progress through the pipeline. E.g. at all times, it would be handy to be able to see a real time view of how many nodes are waiting to be processed at each step, and drill into the details of each one as much as you'd like. Doing this probably is a big internal change to the way things are processed (but again - I haven't looked at the internals on this yet so maybe it's actually small).
This would allow situations like being able to dive into the pipeline and understand which parts are processed, which are waiting, what items need more power (concurrency, model choice, etc.)
Do you think it would make more sense to trace the transformation steps, with nodes being part of the log in that step? Or should node transformations be an individual span (like now)? I think the latter just breaks with thousands of nodes anyway, then again the representation is not chronological int the first, if that makes sense.
I'm still getting a feel for how swiftide works, I might have a bit of a play with some tracing ideas and see what I can come up with for something that fixes a few of the ideas.
Node ids are UUIDs lazily generated from the node content and path, this means that when chunked, ids change. Currently besides the path there is no way to trace back the node to its parent (which does not exist anymore). I'm not sure how large the impact is of storing parent ids in nodes on performance, do think it's something to consider (debug/trace only is a fair compromise). That said, with most rag solutions I've seen transformations tend to happen after chunking.
Yup - I'd expect the node id to change, you likely don't have to store parent ids (for the logging to work), what you can do is say add a span around any processing that happens on a particular node id, then if it's chunked, create a new span for the chunk that has a follows from relationship with the node that it was generated from. When looking at the node in something like jaeger, you'll see a line coming from the initial node to the new one, but the initial node's span will finish and not include the new one's timing. This seems fairly aligned with how the nodes work as I understand it.
It's probably less I think? There's been some work on standardizing llm based applications, which would mean emitting some metrics/spans at certain llm points.
No problem. Can I assume that there's no big disagreement with my points / rationale for things I've missed that aren't obvious or that are based on incorrect assumptions in my understanding of the log messages?
As a quick question do you have any guidance on the questions about the 2 warn level messages?
In implementing this, intuitively there's likely a solution that isn't write only like tracing events. E.g. the concept of a notification stream, or work stream which has a defined format for showing progress through the pipeline. E.g. at all times, it would be handy to be able to see a real time view of how many nodes are waiting to be processed at each step, and drill into the details of each one as much as you'd like. Doing this probably is a big internal change to the way things are processed (but again - I haven't looked at the internals on this yet so maybe it's actually small).
That's an interesting take. Aside might even be interesting to look at pipeline visualization tools and see if Swiftide can hook in. Just taking a wild shot, but emitting that data might be pretty straightforward by either hooking in on the steps directly in the pipeline, or otherwise wrapping the trait call. I can appreciate the idea of having more live and actionable metric available.
I'm still getting a feel for how swiftide works, I might have a bit of a play with some tracing ideas and see what I can come up with for something that fixes a few of the ideas.
👍 Thanks!
Yup - I'd expect the node id to change, you likely don't have to store parent ids (for the logging to work), what you can do is say add a span around any processing that happens on a particular node id, then if it's chunked, create a new span for the chunk that has a follows from relationship with the node that it was generated from. When looking at the node in something like jaeger, you'll see a line coming from the initial node to the new one, but the initial node's span will finish and not include the new one's timing. This seems fairly aligned with how the nodes work as I understand it.
The stream is not ordered, so that might not quite work. Currently I just throw it in jaeger and hope for the best. It has worked alright so far, agreed that it can be a lot better and consistent.
No problem. Can I assume that there's no big disagreement with my points / rationale for things I've missed that aren't obvious or that are based on incorrect assumptions in my understanding of the log messages?
None at all. I appreciate the time you have put into this. This was very much not top of mind. I feel that your examples illustrate well that the current setup can be confusing / fall short. Your suggestions sound solid.
As a quick question do you have any guidance on the questions about the 2 warn level messages?
2025-03-16T08:45:10.124705Z WARN pipeline:indexing_pipeline.run:setup: swiftide_integrations::qdrant: Collection swiftide exists
If the schema/vector sizes mismatch, data might be weirdly inserted or fail. That was some time ago, I'm not 100% sure what qdrant will give back now. If it's meant to already exist, then it's fine.
2025-03-16T08:45:24.144132Z WARN pipeline:indexing_pipeline.run: swiftide_indexing::pipeline: Processed 1 nodes in 14 seconds elapsed_in_seconds=14
I made it WARN as it means the pipeline stopped processing. If users are on the default log level, they by default wouln't get feedback when the pipeline completed besides their process stopping. I get your point in making WARN consistent only for 'could be a problem'. So then INFO would be a solid choice.
I made it WARN as it means the pipeline stopped processing. If users are on the default log level, they by default wouln't get feedback when the pipeline completed besides their process stopping. I get your point in making WARN consistent only for 'could be a problem'. So then INFO would be a solid choice.
Ah got it. IIRC, the default log level when you don't use the env-filter feature flag enabled is INFO. It bumps up to WARN if you do have env-filter though.
One possible idea might be to give each pipeline element an id of its own and pull the span code up into the code which runs the transform / ... This way if you have multiple of the same type, it's obvious which is causing the log messages. I thought about this after #695 where I wondered what if you want two loaders in the pipeline with different settings for extensions (or other transforms which can be duplicates)?
Node ids are UUIDs lazily generated from the node content and path, this means that when chunked, ids change. Currently besides the path there is no way to trace back the node to its parent (which does not exist anymore). I'm not sure how large the impact is of storing parent ids in nodes on performance, do think it's something to consider (debug/trace only is a fair compromise). That said, with most rag solutions I've seen transformations tend to happen after chunking.
Thinking about this a bit more, this does seem like it's a bit of a pain point in the way the node type is designed which makes a bunch of the ideas about correlating nodes to other nodes a bit of a perf issue. I imagine it would be a pretty large change if the node id was just a uuid generated on ingest rather based on content.
The ids are also used as unique identifiers for both the cache and (most) databases.
One of the things with nodes I really would like to do, is have them be generic over the datatype they carry (for multi modal). I think that combines really well with having a prober api instead of the raw struct values available. And with that, I recon it might be acceptable to set the id the moment the setter is called.
So then it's one (large) breaking change that enables a lot of future improvements. It would also make it a lot easier to use the same datastructure when retrieving, instead of the hacked together 'document'
Yeah, perhaps splitting the id you currently have into assigned id (e.g. a v7 uuid) and a calculated cache key (perhaps just a hash, no uuid) could be worth considering. I don't envy you the ride you'll take to refactor this :D
Yeah, perhaps splitting the id you currently have into assigned id (e.g. a v7 uuid) and a calculated cache key (perhaps just a hash, no uuid) could be worth considering. I don't envy you the ride you'll take to refactor this :D
Haha well for the time being it won't be on my list anyway. It would still be nice to do this without breaking changes on the resulting id. Suddenly having duplicate data or having to re-index is painful for users. I thought I had that settled last time I refactored this :')