This function takes a data frame as argument and returns the column names (or indices) of a set of columns that uniquely identify the table entries (i.e. table key). It can be used to automate the search of table keys. Since the function was designed for country data, it will first search for columns containing country names and dates/years. These columns will be given priority in the search for keys. Next, the function prioritises left-most columns in the table. For time efficiency, the function does not test all possible combination of columns, it just tests the most likely combinations. The function will look for the most common country data formats (e.g. cross-sectional, time-series, panel data, dyadic, etc.) and searches for up to 2 additional key columns beyond country and time columns.
Usage
find_keycol(
x,
return_index = FALSE,
search_only = NA,
sample_size = 1000,
allow_NA = FALSE
)
Arguments
- x
A data frame object
- return_index
A logical value indicating whether the function should return the index of country columns instead of the column names. Default is
FALSE
, column names are returned.- search_only
This parameter can be used to restrict the search of table keys to a subset of columns. The default is
NA
, which will result in the entire table being searched. Alternatively, users may restrict the search by providing a vector containing the name or the numeric index of columns to check. For example, search could be restricted to the first ten columns by passing1:10
. This could be useful in speeding up the search in wide tables.- sample_size
Either
NA
or a numeric value indicating the sample size used for evaluating columns. Default is1000
. IfNA
is passed, the function will evaluate the full table. The minimum accepted value is100
(i.e. 100 randomly sampled rows are used to evaluate the columns). This parameter can be tuned to speed up computation on long datasets. Taking a sample could result in inexact identification of key columns, accuracy improves with larger samples.- allow_NA
Logical value indicating whether to allow key columns to have
NA
values. Default isallow_NA=FALSE
. If set toTRUE
,NA
is considered as a distinct value.
Value
Returns a vector of column names (or indices) that uniquely identify the entries in the table. If no key is found, the function will return NULL
. The output is a named vector indicating whether the identified key columns contain country names ("country"
), year and dates ("time"
), or other type of information ("other"
).