Etl.Net icon indicating copy to clipboard operation
Etl.Net copied to clipboard

Getting in-memory list from the database stream

Open mickvikt opened this issue 1 year ago • 13 comments

Hello, is there a way to get an in memory list as an instantiated IList<T> from SQL Server stream and have the database cursor consumed/closed? First I thought extension method ToList() is for this purpose, but it appears it returns an ISingleStream<List<T>>

        var inMemoryCarrierList = contextStream
            .CrossApplySqlServerQuery("carrier stream", builder => builder
                .FromQuery("select carr_code, carr_name from dbo.carr_old order by carr_code")
                .WithMapping(m => new
                {
                    carr_code = m.ToColumn("carr_code"),
                    carr_name = m.ToColumn("carr_name")
                }), "source1")
            .ToList("in memory carrier list");

If I get a list this way and I don't consume this stream by, say, outputting to a text file and then I try a query like this on the same ODBC connection:

        var arch1 = contextStream
            .CrossApplySqlServerQuery("first query from source 1", o => o
                    .FromQuery("select * from dbo.carr")
                    .WithMapping(i => new
                    {
                        carr_code = i.ToColumn("carr_code"),
                        carr_name = i.ToColumn("carr_name")
                    })
                , "source1")
            .Select("create row to save source1 from first query", i => new { i.carr_name, i.carr_code });

I get error:

Unhandled exception. Paillave.Etl.Core.JobExecutionException: Job execution failed
 ---> System.Data.Odbc.OdbcException (0x80131937): ERROR [HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt

mickvikt avatar Mar 22 '23 13:03 mickvikt

@Radamonas found the way as a workaround for the issue related to ODBC drivers. It works with a client cursor, witch is the recommended way to avoid memory overload: https://github.com/paillave/Etl.Net/issues/423#issuecomment-1477397756

paillave avatar Mar 22 '23 16:03 paillave

It does work in provided example, the thing is that it drains both stream cursors to files. What makes me wonder is whether it is possible to have definitions of several streams in the same process without actually outputting them to a file, so that they could be used for combining streams, e.g. look-ups, etc. later on?

mickvikt avatar Mar 23 '23 05:03 mickvikt

It does work in provided example, the thing is that it drains both stream cursors to files. What makes me wonder is whether it is possible to have definitions of several streams in the same process without actually outputting them to a file, so that they could be used for combining streams, e.g. look-ups, etc. later on?

yes, you can do what ever you want from what you issue from any operators, and that counts as well for CrossApply operators.

As maybe there is something I didn't get in your question, give me a sample of the kind of code you would like to achieve, and I will adapt it to work with existing operators.

paillave avatar Mar 23 '23 10:03 paillave

What I was referring to with this question is actually this statement and example in the documentation:

To make a lookup, extensions for Sql Server don't provide any operator out of the box. The work around 
is to use the in memory lookup of ETL.NET core.

var authorStream = contextStream
    .CrossApplySqlServerQuery("get authors", o => o
        .FromQuery("select a.* from dbo.Author as a")
        .WithMapping(i => new
        {
            Id = i.ToNumberColumn<int>("Id"),
            Name = i.ToColumn("Name"),
            Reputation = i.ToNumberColumn<int>("Reputation")
        }));

postStream
    .Lookup("get related author", authorStream,
        l => l.AuthorId,
        r => r.Id,
        (l, r) => new { Post = l, Author = r })
    .Do("show value on console", i => Console.WriteLine($"{i.Post.Title} ({i.Author.Name})"));

The question is: how to you define the postStream here? We're calling a method on postStream, but how do we define it?

mickvikt avatar Mar 23 '23 11:03 mickvikt

you just have to define your postStream exactly in the same way you defined authorStream, and combine them the way you want.

Maybe this near real life example will help you to understand: https://github.com/paillave/Etl.Net/blob/master/src/Tutorials/Paillave.Etl.Samples/TestImport2.cs FYI, in this example, the target data is this one: https://github.com/paillave/Etl.Net/blob/master/src/Tutorials/Paillave.Etl.Samples/DataAccess/DbStructure.md and the input files are here: https://github.com/paillave/Etl.Net/tree/master/src/Tutorials/Paillave.Etl.Samples/InputFiles There are 2 kind of files: files that define portfolios, and files that define positions

paillave avatar Mar 23 '23 13:03 paillave

Sorry to bother you again, but the examples deal with files, and I'm referring to the SQL Server data source and this note in documentation: "To make a lookup, extensions for Sql Server don't provide any operator out of the box. The work around is to use the in memory lookup of ETL.NET core."

How do you actually accomplish this workaround and perform in memory look up if you have two database streams? Thank you

mickvikt avatar Mar 23 '23 13:03 mickvikt

yes but in the same way you produce data from a file, you produce data from an in memory stream.

Does the following example represent what you expected?

var categoriesStream = contextStream.CrossApply("create in memory categories", ctx => new[] {
    new { Code = "a", Label = "label A", Tax = 0.1 },
    new { Code = "b", Label = "label B", Tax = 0.2 },
    new { Code = "c", Label = "label C", Tax = 0.3 },
});

contextStream
    .CrossApplyFolderFiles("list all required files", "*.zip", true).CrossApplyZipFiles("extract files from zip", "*.csv") // could be retrieved from database... anything...
    .CrossApplyTextFile("parse file", FlatFileDefinition.Create(i => new Article
    {
        Name = i.ToColumn("name"),
        Category = i.ToColumn("first name"),
        Price = i.ToNumberColumn<double?>("price", ".")
    }).IsColumnSeparated(','))
    .Lookup("get category label", categoriesStream, l => l.Category, r => r.Code, (l, r) => new
    {
        Name = l.Name,
        CategoryLabel = r.Label,
        PriceWithTax = l.Price * r.Tax,
    })
    [...];

paillave avatar Mar 23 '23 14:03 paillave

It is clearer now, thank you. The one thing I still can't understand is how to declare an in-memory stream from the SQL Database data, because this is, as I understand, currently the only way to combine two database streams.

mickvikt avatar Mar 23 '23 14:03 mickvikt

It is clearer now, thank you. The one thing I still can't understand is how to declare an in-memory stream from the SQL Database data, because this is, as I understand, currently the only way to combine two database streams.

I still don't understand your point. In the previous example, categoriesStream is fully in-memory, and it is combined with something that comes with file (could be database or anything)

FYI, here is the documentation that shows how to combine streams: https://paillave.github.io/Etl.Net/docs/recipes/linkStreams

paillave avatar Mar 23 '23 15:03 paillave

Yes, categoriesStream is in-memory, but it is initialized by using array initializer at compile time. My question is how do you populate an in-memory stream from a database table? Documentation says you cannot combine two database streams unless one of the streams is an in memory stream. Is there actually no way to combine to database streams?

mickvikt avatar Mar 23 '23 15:03 mickvikt

If you want the join to be done by the database engine, your only choice is to make the proper SQL query with the right join statement. If you want to the join/lookup to be made at ETL side, but you want to combine dataset that come from two different databases (for example) then you do the following:

But I still have the feeling this is not what you expect, so I'm a bit lost about what puzzles you.

var categoriesStream = contextStream.CrossApplySqlServerQuery("get categories", o => o
                    .FromQuery("select * from dbo.ProductCategory")
                    .WithMapping(i => new
                    {
                        Code = i.ToColumn("Code"),
                        Label = i.ToColumn("Label"),
                        Tax = i.ToNumberColumn<decimal>("Tax")
                    }));

contextStream
    .CrossApplyFolderFiles("list all required files", "*.zip", true).CrossApplyZipFiles("extract files from zip", "*.csv") // could be retrieved from database... anything...
    .CrossApplyTextFile("parse file", FlatFileDefinition.Create(i => new Article
    {
        Name = i.ToColumn("name"),
        Category = i.ToColumn("first name"),
        Price = i.ToNumberColumn<double?>("price", ".")
    }).IsColumnSeparated(','))
    .Lookup("get category label", categoriesStream, l => l.Category, r => r.Code, (l, r) => new
    {
        Name = l.Name,
        CategoryLabel = r.Label,
        PriceWithTax = l.Price * r.Tax,
    })
    [...];

paillave avatar Mar 23 '23 16:03 paillave

This sentence made everything clear,

If you want the join to be done by the database engine, your only choice is to make the proper SQL query with the right join statement.

Thank you, @paillave, and sorry for bombarding you with questions ;)

mickvikt avatar Mar 23 '23 16:03 mickvikt

leave it opened for documentation purpose till the answers are on the documentation website

paillave avatar Mar 23 '23 16:03 paillave