R: graph.data.frame converts factors to character
From @gaborcsardi on July 26, 2014 3:42
Add an option to keep factors as factors. See http://stackoverflow.com/questions/24965840/igraph-graph-data-frame-silently-converts-factors-to-character-vectors
Copied from original issue: igraph/igraph#665
From @elbamos on January 6, 2015 5:25
I'm writing to join in this request... In the first place, as a matter of R, it shouldn't be altering a variable type to or from factor silently, because the factor data definition contains information that's important in, e.g., regression. Similarly, factors are the natural data type for some graph-relevant data, like community membership.
Setting vertex colors should also be through factors in vertex attributes; if the graph is going to be visualized with ggplot2 or ggvis or the like, there's a whole framework for factor aesthetics.
This seems like a super-easy thing to fix/add/change. if I just do this, will you take the pull request? And if so, how would you prefer it implemented -- I'm thinking its a graph-level "stringsAsFactors" preference set at graph creation.
There are several problems with factors. One is that you cannot write them to standard file formats. I mean, you can, but the fact that they are factors is lost. (There are no factors in GraphML, GML, etc.)
Another one is that you cannot even easily create a factor attribute in igraph currently:
g <- make_ring(10)
V(g)$foo <- factor(letters[1:10])
V(g)$foo
#> [1] 1 2 3 4 5 6 7 8 9 10
g <- set_vertex_attr(g, "bar", value = factor(letters[1:10]))
g
#> IGRAPH U--- 10 10 -- Ring graph
#> + attr: name (g/c), mutual (g/l), circular (g/l), foo (v/n), bar
#> | (v/n)
V(g)$bar
#> [1] 1 2 3 4 5 6 7 8 9 10
So at least this needs to be changed, but there are a lot of potential hiccups. In general, vertex/edge attributes that are not atomic builtin classes are not handled well in igraph.
igraph does not use ggplot for graph drawing, so I don't really see how factors would help with graph drawing. Also, why are factors natural for community membership? Maybe if you name your communities. Otherwise simple consecutive integer numbers are just as natural, and making them factors is just an unnecessary complication inho.
From @elbamos on January 6, 2015 5:54
Well, one function of the igraph package is plotting. Another is generation of certain statistics. A third, though, is that its a data structure with a very convenient, well-thought-out syntax for creating, editing, manipulating, etc. graphical data.
igraph doesn't use ggplot for plotting. igraph objects, though, can be fed into plotting systems other than igraph's built-in plotting. This is what GGally::ggnet does and I've tried to do with ggnetwork.
Why are factors natural for community membership? Well, because community membership is categorical data. More practically, consider this workflow:
vinfo <- data.frame(bunch of data about nodes including dat1 and factor2)
graph <- graph.data.frame(edges, vertices = vinfo)
V(graph)$astat <- igraph::a_stat_function(graph)
V(graph)$comm <- igraph::a_community_membership_function(graph)
graph %>% get.data.frame("vertices") %>% glm(dat1 ~ astat + comm + factor2)
or even
graph %>% get.data.frame("vertices") %>% glm(dat1 ~ astat + comm)
Without factors, that obviously will produce gobbledygook. This is a simple contrived example. Doing a lot of analysis to see how network structure relates to some other variables, being able to store factors in igraph would really simplify the workflow.
From @elbamos on January 6, 2015 5:56
I'm not sure I caught exactly what you meant about the implementation issues. I see where file formats are an issue, but that's not really a solveable one, and doesn't seem like a show-stopped to me. The other issues, I understood from the stackoverflow discussion about this, that it seemed that igraph was simply checking variables and converting all the factors to characters. So the project seemed to be going through the code, picking all that out, and then flyspecking whatever broke.
Is it a lot more than I was thinking?
These are some good points.
What I meant by the code above is that if factors are first class data types in igraph, then there should be ways to create them. Other than graph.data.frame, which is just a special case. set.*.attribute should support factors.
Another potential error that comes to mind immediately is the name vertex attribute, that is treated specially, and I am not sure if everything works if it is a factor. Probably not.
As for representing community membership as factors, that is probably OK, because it is represented by 1:k anyway, and factor levels would match their internal representation.
In general I am a bit ambivalent with factors. They are definitely a good idea, but the way they are implemented in R, you can get some surprising behaviour out of of them. E.g. the way data.frame converts strings to factors, is just wrong.
In summary, I don't mind trying to
- [ ] change
graph.data.frameto keep factors, and - [ ] change
set.vertex.attributesandset.edge.attributeso that factors are actually kept.
From @elbamos on January 6, 2015 6:24
I agree with you on all counts. Its easiest to just not let names be factors, I think. That is a special case, as you say. I also agree that R can sometimes be surprising about them. But once one gets used to them and their purpose, that funny variable type is really invaluable.
Thank you for your attention to this.
I saw that you closed this... does that mean you're dropping it? Is there any way I can help?
As you can see, it is open. Just moved the R package in a separate repo.
Is there any work on this? It is especially pertinent for ggraph, in terms of allowing people to order scales as they would normally do in ggplot2...
reprex from the original Stack Overflow example.
library("igraph")
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
actors <- data.frame(
name = c("Alice", "Bob", "Cecil", "David", "Esmeralda"),
age = c(48, 33, 45, 34, 21),
gender = factor(c("F", "M", "F", "M", "F"))
)
relations <- data.frame(
from = c(
"Bob", "Cecil", "Cecil", "David",
"David", "Esmeralda"
),
to = c("Alice", "Bob", "Alice", "Alice", "Bob", "Alice"),
same.dept = c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
friendship = c(4, 5, 5, 2, 1, 1), advice = c(4, 5, 5, 4, 2, 3)
)
g <- graph_from_data_frame(relations, directed = TRUE, vertices = actors)
g_actors <- as_data_frame(g, what = "vertices")
# Compare type of gender (before and after)
is.factor(actors$gender)
#> [1] TRUE
is.factor(g_actors$gender)
#> [1] FALSE
Created on 2024-02-26 with reprex v2.1.0
Old implementation by @thomasp85: #193.
While graph_from_data_fram() does remove factors, set_vertex_attr() supports them
library("igraph")
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
actors <- data.frame(
name = c("Alice", "Bob", "Cecil", "David", "Esmeralda"),
age = c(48, 33, 45, 34, 21),
gender = factor(c("F", "M", "F", "M", "F"))
)
relations <- data.frame(
from = c(
"Bob", "Cecil", "Cecil", "David",
"David", "Esmeralda"
),
to = c("Alice", "Bob", "Alice", "Alice", "Bob", "Alice"),
same.dept = c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),
friendship = c(4, 5, 5, 2, 1, 1), advice = c(4, 5, 5, 4, 2, 3)
)
g <- graph_from_data_frame(relations, directed = TRUE, vertices = actors)
g_actors <- as_data_frame(g, what = "vertices")
# Compare type of gender (before and after)
is.factor(actors$gender)
#> [1] TRUE
is.factor(V(g)$gender)
#> [1] FALSE
is.factor(g_actors$gender)
#> [1] FALSE
g <- set_vertex_attr(g,"test_set",value=factor(LETTERS[1:5]))
V(g)$test_V <- factor(letters[1:5])
is.factor(V(g)$test_set)
#> [1] TRUE
is.factor(V(g)$test_V)
#> [1] TRUE
Created on 2025-01-21 with reprex v2.1.1
To make graph_from_data_frame() accept factors, it seems like only these rows need to be removed.
https://github.com/igraph/rigraph/blob/fe3b5b89ec6f7daa7f08d595f1a2a55b3cdd8eae/R/data_frame.R#L201-L203
https://github.com/igraph/rigraph/blob/fe3b5b89ec6f7daa7f08d595f1a2a55b3cdd8eae/R/data_frame.R#L222-L224
Am I missing something?
Worth trying with revdeps?
I would try to bundle some things together:
- Strings/NAs in Matrices
- factors in graph_from_data_frame
and see what happens in the revdeps
I lack the expertise to weigh in 😸
I add it to my TODO. Definitely feels like a big chance that really does need some more considerations.