regex - Remove text between comma and dash in R with regular expressions -
i remove text between commas , dashes in long string of variable labels saved comma-separated. here's minimal example of string:
myvarlabels <- ("participant number, how following products-green tea, how following products-beer,\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian green tea\",\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian beer\"")
importantly, variable labels appear in 2 different forms , should shortened in following way:
- how following products-green tea
- should reduced to: green tea
- \"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian green tea\"
- should reduced to: \"japanese, chinese, , indian green tea\"
i tried use gsub , regular expressions identify , delete text between commas , dash (i.e., replacing text "").
has suggestion how use gsub remove text between commas indicate start of new column and dashes followed text want keep while preserving double quotes?
edit 1
to more precise, data include 3 types of comma-separated chunks of text. specify information corresponding variables contain:
short descriptions including 1 or more words (e.g., participant number)
longer descriptions relevant information appears after dash (e.g., how following products-green tea)
same above commas present somewhere before dash (e.g., how much, if @ all, ...); why type of chunk of text preceded , followed \" (otherwise not correctly read)
- same above no commas before dash (e.g., how experience have following products)
the 4 types of text sequences preceded , followed commas , can appear in order.
here's new minimal example more accurately reflects real data first example:
(myvarlabels3 <- ("participant number,age,gender,body mass index,how following products-green tea,how following products-beer,outdoor temperature,season,\"how experience have following products-indian spices\",\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian beer\",email,telephone number"))
cath's code (edit 2) works point. when add more of "simple" type 1 sequences of text @ beginning of string or when add text sequence specified under 4. in above list, code doesn't work anymore.
however, when cath's code edit 2 run in 2 steps, works perfectly:
myvarlabels3 <- gsub("((?<=,\")[^-]*[^-]+-)|((?<=,\")[^-],*[^-]+-)", "", myvarlabels3, perl=true) # step 1: shorten text sequences specified under 3. , 4. in list above [1] "participant number,age,gender,body mass index,how following products-green tea,how following products-beer,outdoor temperature,season,\"indian spices\",\"japanese, chinese, , indian beer\",email,telephone number" gsub("((?<=,)[^-\",]+-)", "", myvarlabels3, perl=true) # step 2: shorten text sequences specified 2. in above list [1] "participant number,age,gender,body mass index,green tea,beer,outdoor temperature,season,\"indian spices\",\"japanese, chinese, , indian beer\",email,telephone number"
i think possible use 1 line of code couldn't figure out how. anyway, facilitate workflow when import messy csv files qualtrics.
i'm not sure understand desired output is, can try spotting "start of new column" based on "how much" , go until "meet" dash:
gsub("(^[^,]+, )|(how much[^-]+-)", "", myvarlabels, perl=true) [1] "green tea, beer,\"japanese, chinese, , indian green tea\",\"japanese, chinese, , indian beer\""
edit
considering patterns, can try following:
gsub("((?<=, )[^-\"]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels, perl=true) [1] "participant number, green tea, beer,\"japanese, chinese, , indian green tea\",\"japanese, chinese, , indian beer\""
i use 2 possible patterns, according 2 possible ones described, behinds specify should there needs kept
edit2
if don't have space between comma , question doesn't begin quote, can do:
myvarlabels_2 <- ("participant number,how following products-green tea, how following products-beer,\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian green tea\",\"how much, if @ all, willing pay these products if ...-japanese, chinese, , indian beer\"") gsub("((?<=,)[^-\",]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels_2, perl=true) [1] "participant number,green tea,beer,\"japanese, chinese, , indian green tea\",\"japanese, chinese, , indian beer\""
Comments
Post a Comment