This function takes a data frame as argument and returns the column names (or indices) of all columns containing dates and the most likely column containing year information, if any. It can be used to automate the search of date and year columns in data frames.
Arguments
- x
A data frame object
- return_index
A logical value indicating whether the function should return the index of time columns instead of the column names. Default is
FALSE
, column names are returned.- allow_NA
Logical value indicating whether to allow time columns to contain
NA
values. Default isallow_NA=FALSE
, the function will not return time column containingNA
values.- sample_size
Either
NA
or a numeric value indicating the sample size used for evaluating columns. Default is1000
. IfNA
is passed, the function will evaluate the full table. The minimum accepted value is100
(i.e. 100 randomly sampled rows are used to evaluate the columns). This parameter can be tuned to speed up computation on long datasets. Taking a sample could result in inexact identification of key columns, accuracy improves with larger samples.
Value
Returns a vector of names (return_index=FALSE
) or indices (return_index=TRUE
) of columns containing date or year information. Only the most likely year column is returned.
Examples
find_timecol(x=data.frame(a=1970:2020, year=1970:2020, b=rep("2020-01-01",51),c=sample(1:1000,51)))
#> [1] "year" "b"