This function takes a data frame as argument and returns the column name (or index) of all columns containing country names.
It can be used to automate the search of country columns in data frames.
For the purpose of this function, a country is any of the 249 territories designated in the ISO standard 3166
.
On large datasets a random sample is used for evaluating the columns.
Usage
find_countrycol(
x,
return_index = FALSE,
allow_NA = TRUE,
min_share = 0.8,
sample_size = 1000
)
Arguments
- x
A data frame object
- return_index
A logical value indicating whether the function should return the index of country columns instead of the column names. Default is
FALSE
, column names are returned.- allow_NA
Logical value indicating whether columns containing
NA
values are to be considered as country columns. Default isallow_NA=FALSE
, the function will not return country column containingNA
values.- min_share
A value between
0
and1
indicating the minimum share of country names in columns that are returned. A value of0
will return any column containing a country name. A value of1
will return only columns whose entries are all country names. Default is0.9
, i.e. at least 90 percent of the column entries need to be country names.- sample_size
Either
NA
or a numeric value indicating the sample size used for evaluating columns. Default is1000
. IfNA
is passed, the function will evaluate the full table. The minimum accepted value is100
(i.e. 100 randomly sampled rows are used to evaluate the columns). This parameter can be tuned to speed up computation on long datasets. Taking a sample could result in inexact identification of key columns, accuracy improves with larger samples.
Value
Returns a vector of country names (return_index=FALSE
) or column indices (return_index=TRUE
) of columns containing country names.
Examples
find_countrycol(x=data.frame(a=c("Brésil","Tonga","FRA"), b=c(1,2,3)))
#> [1] "a"