TDC
TDC copied to clipboard
Improvements in the context of knowledge graph resources
Suggesting a new data function: Knowledge Graph Mastery, https://github.com/mims-harvard/TDC/discussions/23#discussioncomment-7364648
Changes:
- Add
knowledge_graph
module - Introduce
KnowledgeGraph
python class - Include
build_KG
function to enable creating graph from scratch - Add
to_KG
method to currentprimekg
in resource module to getPrimeKG
asKnowledgeGraph
object
Features:
- Generalized format to create, handle, and manipulate
KnowledgeGraph
datasets (mimiking the PrimeKG data frame) - Enhance maintenance and data integration for future knowledge graphs , e.g. https://github.com/mims-harvard/PrimeKG/pull/12
Hi @abearab thanks for the PR! I think this looks like a lightweight approach to accommodate all the future KGs that could be added to the TDC. I have two comments:
- The KG class should be probably defined under the tdc/utils folder. Since it is a data function.
- Is it possible to modify PrimeKG to inherit from the KG class you defined? This way all the changes made in the KG class can also be applied to PrimeKG.
Otherwise, looks great - happy to discuss more about these two comments if you have any question!
Thanks @kexinhuang12345
The KG class should be probably defined under the tdc/utils folder. Since it is a data function.
done – https://github.com/mims-harvard/TDC/pull/207/commits/5eee5a6d974f3680920a9073024513dc8ae7b960
Is it possible to modify PrimeKG to inherit from the KG class you defined? This way all the changes made in the KG class can also be applied to PrimeKG.
Yeah, I agree. I used a method option for the current PrimeKG class for converting it to KnowledgeGraph
class. I just removed that so the .KG
will keep the data for PrimeKG and inherits from the KG class by default. Thanks for the suggestion.
see https://github.com/mims-harvard/TDC/pull/207/commits/354cd50c98b5e795be5ca205570b5b6bd844966d
hi. can you run the new conda-test GH Action tests? They should be run if you create a new pull request. Also looks like your branch has conflicts. You should resolve those. Closing this for now.
hey @amva13 – thanks for your response.
I got this error, any thoughts?
OK (skipped=1)
/home/runner/work/TDC/TDC
All base tests passed
--- tdc/resource/primekg.py (original)
+++ tdc/resource/primekg.py (reformatted)
@@ -12,6 +12,7 @@
from ..utils.knowledge_graph import KnowledgeGraph
warnings.filterwarnings("ignore")
+
class PrimeKG:
"""PrimeKG data loader class to load the knowledge graph with additional support functions.
@@ -30,9 +31,9 @@
G = nx.Graph()
for i in self.KG.df.relation.unique():
- G.add_edges_from(
- self.KG.df[self.df.relation == i][["x_id", "y_id"]].values, relation=i
- )
+ G.add_edges_from(self.KG.df[self.df.relation == i][["x_id",
+ "y_id"]].values,
+ relation=i)
return G
def get_features(self, feature_type):
@@ -43,7 +44,5 @@
def get_node_list(self, node_type):
df = self.KG.df
- return np.unique(
- df[(df.x_type == node_type)].x_id.unique().tolist()
- + df[(df.y_type == node_type)].y_id.unique().tolist()
- )
+ return np.unique(df[(df.x_type == node_type)].x_id.unique().tolist() +
+ df[(df.y_type == node_type)].y_id.unique().tolist())
--- tdc/utils/knowledge_graph.py (original)
+++ tdc/utils/knowledge_graph.py (reformatted)
@@ -5,13 +5,13 @@
from copy import copy
kg_columns = [
- 'relation', 'display_relation',
- 'x_id', 'x_type', 'x_name', 'x_source',
+ 'relation', 'display_relation', 'x_id', 'x_type', 'x_name', 'x_source',
'y_id', 'y_type', 'y_name', 'y_source'
]
class KnowledgeGraph:
+
def __init__(self, df=None):
if df is not None:
self.df = df
@@ -29,26 +29,32 @@
def get_nodes_by_source(self, source):
# extract x nodes
x_df = self.df.query(
- f"x_source == '{source}' | y_source == '{source}'"
- )[[col for col in self.df.columns if col.startswith("x_")]]
+ f"x_source == '{source}' | y_source == '{source}'")[[
+ col for col in self.df.columns if col.startswith("x_")
+ ]]
for col in x_df.columns:
x_df = x_df.rename(columns={col: col[2:]})
# extract y nodes
y_df = self.df.query(
- f"x_source == '{source}' | y_source == '{source}'"
- )[[col for col in self.df.columns if col.startswith("y_")]]
+ f"x_source == '{source}' | y_source == '{source}'")[[
+ col for col in self.df.columns if col.startswith("y_")
+ ]]
for col in y_df.columns:
y_df = y_df.rename(columns={col: col[2:]})
# merge x and y nodes and keep only unique nodes
- out = pd.concat([x_df, y_df], axis=0).query(f'source == "{source}"').drop_duplicates().reset_index(drop=True)
+ out = pd.concat([
+ x_df, y_df
+ ], axis=0).query(f'source == "{source}"').drop_duplicates().reset_index(
+ drop=True)
return out
-def build_KG(indices, relation, display_relation, x_id, x_type, x_name, x_source, y_id, y_type, y_name, y_source):
- df = pd.DataFrame('',columns=kg_columns,index=indices)
+def build_KG(indices, relation, display_relation, x_id, x_type, x_name,
+ x_source, y_id, y_type, y_name, y_source):
+ df = pd.DataFrame('', columns=kg_columns, index=indices)
df.relation = relation
df.display_relation = display_relation
Error: Process completed with exit code 1.
Yes this is an error from the linter. It's from your code's formatting. In theory, you should be able to fix by simply replacing the contents of your file with the output from
yapf --style=google tdc/utils/knowledge_graph.py
for example.
here's more info on the linter if helpful
https://github.com/google/yapf
your tdc/resource/primekg.py
file is also failing so make sure to do the same there.
reopening
Yes this is an error from the linter. It's from your code's formatting. In theory, you should be able to fix by simply replacing the contents of your file with the output from
yapf --style=google tdc/utils/knowledge_graph.py
for example. here's more info on the linter if helpful https://github.com/google/yapf
Cool, I fixed it.
running tests
looks good @abearab will complete review this week.
Hey @abearab . From the looks of it you should be able to separate your edits to PrimeKG into a different class. Can you go ahead and do that? It can inherit from existing PrimeKG if useful. As there are no dependencies on PrimeKG in the rest of the repo, you and anyone wishing to leverage the code you implemented in KnowledgeGraph should be able to use your class directly. As there is not yet extensive testing on PrimeKG, I'd prefer we proceed in this manner. I can approve when this is done.
Hi @amva13 – we have a discussion with @kexinhuang12345 here https://github.com/mims-harvard/TDC/pull/207#issuecomment-1784357449 which motivated me to modify PrimeKG class. This is what Kexin suggested:
Is it possible to modify PrimeKG to inherit from the KG class you defined? This way all the changes made in the KG class can also be applied to PrimeKG.
I'm not sure if I understand your concern, can you clarify please? I'll be more than happy to add a few more commits if needed.
Yes, i think it would be great to create a separate KG class and PrimeKG can inherit from it. The reason is that in the future if there is other KG, then we can inherit from that class and reuse it instead of reimplementing the KG loading/processing/analysis shared functions. Let me know if that makes sense!
@abearab your implementation does not currently have PrimeKG inheriting from KnowledgeGraph/KG (i.e. would look like PrimeKG(KG) for example)
Regardless - please do not change the code in PrimeKG. Please create a separate class, say, PrimeKGDev. There is no need for you to inherit from KG. You can use the exact same code you are using for PrimeKG and copy that over to PrimeKGDev.
PrimeKGDev should offer exactly the same interface as PrimeKG (this should not be difficult if you just copy over the code).
For conflict resolution, I am currently the authority on TDC, but you may ask @marinkaz for clarification if needed.
@kexinhuang12345 please feel free to implement testing on PrimeKG and we can consolidate the two classes, say by changing PrimeKG to inherit directly from PrimeKGDev, accordingly.
Thanks!
@ abearab your implementation does not currently have PrimeKG inheriting from KnowledgeGraph/KG (i.e. would look like PrimeKG(KG) for example)
That's true, now I fixed it. Thanks for the suggestion.
Regardless - please do not change the code in PrimeKG. Please create a separate class, say, PrimeKGDev. There is no need for you to inherit from KG. You can use the exact same code you are using for PrimeKG and copy that over to PrimeKGDev.
PrimeKGDev should offer exactly the same interface as PrimeKG (this should not be difficult if you just copy over the code).
I'm not sure if I understood your points for having both PrimeKGDev and PrimeKG. Can you clarify, please? Or if it's easier for you, maybe you can add that yourself? Let me know what you think.
For conflict resolution, I am currently the authority on TDC, but you may ask @marinkaz for clarification if needed.
Sounds great, I don't think any clarification is needed in this regard. I just referred to our prior conversations with @kexinhuang12345 to make sure we are all on the same page. Looking forward to seeing all upcoming features in TDC!
@abearab feel free to merge this at your convenience. lgtm