Quanteda textplot_xray grouped by non-unique docvar as document

I have a Quanteda Corpus of 10 documents several of which are by the same author. I store the author in a separate docvar column – myCorpus$documents[,"author"]

> docvars(myCorpus)

          author   
206035    author1   
269823    author2   
304225    author1   
422364    author2
<...snip..>

I’m charting a Lexical Dispersion Plot with xplot_xray,

textplot_xray(
            kwic(myCorpus, "image"),
            kwic(myCorpus, "one"),
            kwic(myCorpus, "like"),
            kwic(myCorpusus, "time"),
            kwic(myCorpus, "just"),
            scale = "absolute"
          )

textplot_xray

How can I use myCorpus$documents[,"author"] as the document identifier instead of the Document ID?

I’m not trying to group the docs, I just want to identify the document by the author. I recognize that Doc IDs need to be unique so can’t simply rename the docs with docnames(myCorpus)<-

filtering columns has the same elements

I have file that has 4 columns, the first columns is the student name, second his final grade in Math, third is final grade in science,and the fourth column is final grade in art. the final grade is pass or fail. I want to keep only students who passed all subjects (i.e. has pass on all subjects). I read the data using read.csv in R, but I wasn’t able to filter the data.

Col1       col2     col3    col4
Amanda     pass     fail    pass
Mick       pass     pass    pass
Andrew     pass     pass    fail
Mark       pass     pass    pass

form the example above, I need to keep only students who passed all like Mick and Mark

Automatically deleting an image pattern from a PDF [on hold]

I have a historical database that I am digitizing with the help of Abby’s OCR. the database is a 100 pages pdf. An example of the data is given by the previous picture.
I have realized that Abby’s recognition of the Table patterns work a lot better when I delete the points that link the first column to the left to the second (like those in the line “Riporto…..”). I can do this deletion manually and page-by-page in Abby. However, I was wondering whether somebody could give me advice on how I could do this automatically for all the pages in my PDF. I can know how to get around with Python, R, ArcGis, QGis, and I am learning my way aroud Abby.

Thank you!

enter image description here

ggplot fill fails if group contains only NA

I am plotting an boxplot using ggplot and the following data:

plot_data <- structure(list(group = c("a1", "a1", "a1", "a2", "a2", "a2", "b1", "b1", "b1", "b2", "b2", "b2"), value = c(1, 2, 3, 1, 2, 3, NA, NA, NA, 1, 2, 3)), .Names = c("group", "value"), row.names = c(NA, -12L), class = "data.frame")

And the following code:

ggplot(data = plot_data, aes(x = group, y = value))+
    geom_boxplot(fill= c('blue','blue','green','green'))+
    theme_classic()

This results in this error:

Error: Aesthetics must be either length 1 or the same as the data (3): fill

The error occurs because one group has only NA values. ggplot shows this group on the x-axis, what I like but complains about filling the empty boxplot.

I could fix it, by removing one of the fill arguments.
However, this is not really feasable since I use the plotting function multiple times inside a loop, in which sometimes the values of some groups are only NA. I know, that I could create the fill vector dynamically based on the groups contained in the data, but I would like it constant.

So my question is:

Is there a possibility to use always the same fill vector, without any complains by ggplot? Like another aes option?

XIRR function from TVM package in R

I am trying to leverage the tvm package in R to calculate XIRR for a set of cash flows and dates.

I have a moving window, where I start off with i = 1 , CF = CF[1], d = d[1] and as I progress forward , the rest of the cash flow gets involved.

I understand the XIRR function will throw an error in the absence of a sign change in the Cash Flow input.

So, in order to handle that I put it in a tryCatch.

For the reproducible example I am providing below, I intend when to see NA until a positive cash flow value is encountered – but once a positive cash flow value is encountered, I expect the function to return a valid value as Excel does.

 # Reprex
    # Attach desired packages 
    suppressPackageStartupMessages(library(tvm))

    # Provide input data 
    CF <- c(-78662, -32491, -32492, -32492, -32493,
            -32494, 7651, 40300, 10003, 9868,
            7530, 7639, 9939, 9804, 7475)
    d <- as.Date(c("2019-06-30", "2019-09-30", "2019-12-31", "2020-03-31", "2020-06-30",
                   "2020-09-30", "2020-12-31", "2021-03-31", "2021-06-30", "2021-09-30",
                   "2021-12-31", "2022-3-31", "2022-06-30", "2022-09-30", "2022-12-31"))
    test <- xirr(cf = CF, d = d)

    print(test)

Any guidance to fix is appreciated

From character to posixct

I have this character type vector that consists of a year and a month. I want to convert it to a date type, but when I try to do this with the POSIXct function, I get the error:

 Error in as.POSIXlt.character(x, tz, ...) : 
   character string is not in a standard unambiguous format

I can’t seem to figure out why it won’t work. Anyone?

 old <- as.character("201702")

 library(lubridate)
 new <- as.POSIXct(date, origin = "201501")

Figuring out cause of Graphical differences between ggplot and base R plotting

I have plotted my data in both base R and ggplot methods to see how the plots look different, and my graph from ggplot() form looks wrong. It should look like it does when I graph it in base R. Shown below is my base R code and my ggplot code, and the graphs that each produce.

Base R code:

rhnplot <- plot(rhn, depth, main= "Depth. vs. RHN", xlab = "RHN", ylab="Depth (ft)", type="l", col="blue", xlim=c(200, 900), ylim= rev(c(min(dfplot[,1])-5, max(dfplot[,1])+5)), xaxs="i", yaxs="i")

Base R Graph: enter image description here

ggplot() code:

ggplot(dfplot, aes(rhn,depth)) + geom_line() + xlab("RHN") + ylab("Depth (ft)") + labs(title="Depth vs. RHN") + scale_x_continuous(position = "top", limits = c(200,900), expand=c(0,0)) + scale_y_reverse(limits = c(max(dfplot[,1])+5, min(dfplot[,1])-5 ), expand = c(0, 0))

ggplot() graph:

enter image description here

I know that my base R graph is correct, and I need to get my ggplot() graph to look like mine does in base R. Not sure what the problem is though.

r dplyr::left_join using datetime columns does not join properly

I have a large dataset of datetimes for almost a full year for every second of the year. I am trying to dplyr::left_join a second dataset that has a datetime column with values within the time range in the first dataset. When I join the dataset, only a small number of records join (about 100 from about 45k) and I know most records should be joining. The checks I’m doing to ensure the columns are the same include:

dput(df_all_dates$date_time[1])
dput(df_subset_dates$date_time[1])

Both of these produce the following:

structure(1485781200, class = c("POSIXct", "POSIXt"), tzone = "")

I’ve also done the following comparison (the 10 and the 4701 in the following code reflect the same dates in the data):

as.numeric(df_all_dates$date_time[10]) # produces value 1485785900
as.numeric(df_subset_dates$date_time[4701]) # produces value 1485785900

However, in the join, the data from the df_subset_dates does not join into the resulting dataset, even though the datetime values are the same. Is there something else about datetimes that would cause these not to join? Some values do join, but I don’t see any pattern as to why those records are different from the ones that do not join.

Here is the code of the actual join, if helpful:

df_all_dates %>%
 left_join(df_subset_dates, by = 'date_time')

What is the most efficient way (both time and memory) to add a new row to a matrix in R?

I have a matrix and I need dynamically add rows to this matrix (the size is not known). The best practice is to use rbind, i.e.

my_matrix <- matrix(NA, 2, 2)<br /> my_matrix <- rbind(my_matrix, c(1, 2))<br />

However, calling a function will create a new matrix rbind(my_matrix, c(1, 2)) and only then assign it to my_matrix. Furthermore, calling a function will cost some time.

Is there any way to avoid creating a new object / calling a function?