changing numbers datatype to float

jerome · May 9, 2022

I wanted to know how to have the box “ChangeDataType”, doing nothing for each column where a strict string to number conversion has not been possible for at least one value of the column.
Basically, the columns where at least one value would fail to be converted, would remain (the whole column) not processed, would keep its original type and all its data.
But this option should not affect the other columns where the conversion has been a success.

Basically, here I am trying to convert all the columns of my data set to number (float), even if I know that some columns are not numbers but codes or descriptions (strings, I am referring to columns 2, 3, and 4 here below).

In the end I would like an option where those 3 columns would not change their types whereas the other columns would change their type to FLOAT.

Many thanks in advance for your help.

Support · May 9, 2022

To summarize your request: You want the meta-type of all the columns containing only numbers to be changed to "Float".

The solution is in the two graphs in the attachments.
There are two graphs:
* the first one is when the data to be converted is not inside a .gel file.
* the second one is when the data to be converted is inside a .gel file (this one is prettier to look at).

To summarize: The steps taken inside the "solution" graphs are:
* Create one new variable (named "CheckNumber_"+...) for each column inside the original dataset. This new variable checks if a column on a given row is (strictly) containing a number.
* Aggregate all the new "CheckNumber_..." variables to get one summary value per column. Using this summary value we can answer the question: Is this column containing only numbers or not?
* Use the output of the aggregate box to dynamically select the columns that will be converted to FLOAT (using the changeDataType box).

We do not wish to add a new option on the "ChangeDataType" box' as you can see in the proposed solution: it's required to read two times the input dataset:
* during the first read we compute what are the columns that will be converted.
* during the second read we actually to the conversion.
For efficiency reasons, all Anatella boxes (with the exception of the "Sort" box and the "BagOfWord" box) are reading the data only once, and we would like to keep it that way (so that Anatella stays fast). Initially, Anatella is built to manipulate tables with billions of rows and efficiency is always one of our main concerns, but you can still achieve the desired effect using these same graphs in the attachments.

changing numbers datatype to float

jerome

New member

Support

Administrator

Attachments