indexify Improve Router

Routers are currently separate functions, so they incur the cost of I/O and scheduling loops. We need to make every function decide the next stage in addition to producing outputs.

In progress here -https://github.com/tensorlakeai/tensorlake/pull/79

Apr 09 '25 15:04 diptanu

Routers allow dynamic branching in Graphs. Such as going from file upload to structured extraction if the file is text, otherwise the file upload function can route to layout detector, OCR, Table Detectors, etc.

This is how routers are currently implemented - https://github.com/tensorlakeai/tensorlake/blob/fdfa380328a2dffcbf949a2a2cac9b9c850377aa/tests/tensorlake/test_graph_behaviours.py#L760

They incur additional scheduling and hops in the graph.

We want every function to be able to select some downstream function in the graph without going through some special "router" functions . This is how I am thinking we define the user expereince -

@tensorlake_function(inject_ctx=True)
def file_upload(ctx: GraphContext, file: bytes) -> IntermediateObject:
   if file_type(file) == "text/html":
      # Sets which function is being called next
      ctx.next_function(structured_extraction)
      ...
      return intermediate_object
   
    return intermediate_object

@tensorlake_function()
def layout_detector(intermediate_file: IntermediateObject) -> IntermediateObject:
   # This is where the flow reaches if the file is something other than text
   return intermediate_object

@tensorlake_function()
def structired_extraction(intermediate_file: IntermediateObject) -> Result:
   return result
 

g = Graph(start_node=file_upload)
g.add_edge(file_uplaod, layout_detector)
g.add_edge(layout_detector, structired_extraction)

Here, the data flow is file_upload -> layout_detector -> structired_extraction for all files whihc are not text for text, the data flow is file_upload -> structired_extraction

This saves some time in graph execution by short circuting some nodes

If we do this, may be there is no need to define static edges, we could always use ctx.next_function API to move data around in the graph. This will simplify the task creator logic even more. Also developers probably don't like defining graphs statically anyways.

May 18 '25 04:05 diptanu

Some pointers -

I had started the SDK work here - https://github.com/tensorlakeai/tensorlake/pull/79
We can remove routers in server data structure here - https://github.com/tensorlakeai/indexify/blob/5909fdc4f847c946d855eaee615ade3e56195948/server/data_model/src/lib.rs#L378 This will simplify a bunch of code in storage, and task creator

May 18 '25 04:05 diptanu