pyapacheatlas
pyapacheatlas copied to clipboard
Atlas Bulk upload with batch size throwing error
I am trying to upload excel with bulk upload functionality, for smaller size file its working fine , but having more data throws error from server side (504 time-out) , So I tried putting batch size parameter. But then its started throwing below error. Any suggestion to resolve it?
client = AtlasClient(endpoint_url='http://xxxx.xxx.xx/api/atlas/v2',
authentication=auth)
result = client.upload_entities(batch=entities, batch_size=1000)
File "C:\Users\sbagal\PycharmProjects\pythonProject\venv\lib\site-packages\pyapacheatlas\core\client.py", line 1243, in upload_entities payload["entities"], batch_size=batch_size)] File "C:\Users\sbagal\PycharmProjects\pythonProject\venv\lib\site-packages\pyapacheatlas\core\util.py", line 274, in batch_dependent_entities if len(candidate_set) > largest_candidate_size: TypeError: object of type 'NoneType' has no len()
@wjohnson Any suggestion on this. I have excel with around 20K entities. But its not able to process complete at once .. so tried loading with specifying batch_size=1000 , Also one question on same line , do we need all required relationship entities to be available in same batch? for ex. if I create table in first batch , can I use it in second batch for column relationship or it again require to have in second batch ?
Hi, @sbbagal13 Thank you for using PyApacheAtlas!
Can you share which version of PyApaceAtlas you're on? The traceback is in a surprising spot. It's like it couldn't find any of the assets to be grouped together.
Your second question makes we wonder how deeply connected are you assets? For example, are all 20,000 related to each other? The batching algorithm looks for relationships and will create batches that include all related entities together. If you have a group of related entities that exceed your batch size, it should throw an exception.
Hi @wjohnson , Thank you for responding. I have tried with latest version 0.13 and get same exception. As you said it might be issue with how the data in organized in excel. Here is how we have data in excel: 1: rdbms_instance (first/single row) 2: rdbms_db (Second row till 1050 rows) 3: rdbms_db (Second row till 55 rows) 4: rdbms_columns (56 row to ~20k rows)
How can I load this data using batches ? what would be the best approach to making batches and batch size as well Do each batch need to maintain the hierarchy (rdbms_instance => rdbms_db => rdbms_tables => rdbms_colums ) ?
Oh yikes, @sbbagal13 - I might batch this up into a few things and take advantage of the AtlasObjectId feature. You can see more about this feature here
I'd follow this order of operations:
- Upload the rdbms_instance by itself, collect it's guid
- Upload the rdbms_db by themselves and...
- Add a column called
[Relationship] instance
- Fill that column with
AtlasObjectId(guid:123-abc-456)
where123-abc-456
is the guid of the rdbms_instance - Perform an upload with a batch size of 100 or 200 - They should all now be independent.
- Note the guids
- Add a column called
- Upload the rdbms_table and columns and...
- Add a column called
[Relationship] db
and[Relationship] table
- Fill the
db
column for each table row withAtlasObjectId(guid:123-abc-456)
where123-abc-456
is the guid of the rdbms_db - Fill the
table
column for each rdbms_column row with the FQN of the table that is being uploaded with it. - Perform an upload with a batch size of 1000 or 2000 - They should all now be independent.
- This assumes that no one table has more than 1,000 columns
- Add a column called
That SHOULD solve the problem of everything being related to the one root rdbms instance. Please let me know if you give it a try!
Closing this for now (due to > 1 month since last update) but @sbbagal13 please feel free to reopen if you are still having an issue.