entnero.blogg.se - Spacy doc merge

#SPACY DOC MERGE FULL#

Unfortunately, there were still some country names that were not translated properly, and, thus, I had to manually replace them with their english version name.

#SPACY DOC MERGE FULL#

Here is the full code of the above steps: # Translation from googletrans import Translator # get the last name only when "/" or "-" is found # "/" and "-" separates names in different languages countries = for country in countries] countries = for country in countries] # remove white space found at the beginning of a string countries = if country = " " else country for country in countries] # Manually replace locations to their relevant country name countries = else country for country in countries] countries = else country for country in countries] # translate countries in foreign language to english translator = Translator() countries = print(len(countries)) print(np.unique(np.array(countries))) Note that I left locations that contain cities from different countries (e.g. Finally, I used googletrans library to automatically translate the non-english country names into english. I also manually replaced some location names by their relevant country names. I tackled these issues by removing the "/" and "-" from the texts, and using the last name found after these icons. Additionally, some of the results still do not indicate a country name such as ' Toronto Hargeisa' and ' Detroit Las Vegas'. np.unique(np.array(countries))Īs you can obviously notice, some of the results are in languages other than english, and some of them are in more than one language separated by "/" or "-". To display the country names, I used the np.unique() method. # Geocoding Webservices from geopy.geocoders import Nominatim # NLP import spacy nlp = spacy.load('en_core_web_sm') # Get the country name from cities' and states' names countries = for loc in locations: geolocator = Nominatim(user_agent = "geoapiExercises") location = geolocator.geocode(loc, timeout=10000) if location = None: countries.append(loc) continue location_ = nlp(location.address) if "," in location_.text: # Example of a location_.text: "New York, United States" # get the name after the last "," countries.append(location_.text.split(",")) else: countries.append(location.address) As a result, I decided to use the geopy library to obtain the country name from cities' and states' names. The above code served to remove all content that are meaningless in terms of location yet, the user_location column not only includes country names, but also includes cities and states such as ' London' or ' New York, NY'. locations entered by users raw_locations = er_location.unique().tolist() # replace nan by "" - the first element of the list is nan raw_locations = "" # locations list will only include relevant locations locations = # search for relevant locations and add them to the locations list for loc in raw_locations: text = "" loc_ = nlp(loc) for ent in loc_.ents: if ent.label_ = "GPE": text = text + " " + ent.text locations.append(text) else: continue # NLP import spacy nlp = spacy.load('en_core_web_sm') # create a list of raw locations - i.e. This was achieved through named entity recognition using the spaCy library.

Some of the data in the user_location column do not make any sense such as ' Lionel Messi’s Trophy Room' and ' Where are you', therefore, the first step done was to remove any content that is not location. Preprocessing the user_location column is an important task to extract the country names from the data found in this column.

Also, matplotlib.pyplot and geopandas were used to visualise the results in the form of a piechart and geospatial map. In order to achieve that, it was necessary to do some preprocessing on the user_location column. The objective of the location analysis is to obtain a general overview from where the tweets originated by calculating the number of tweets per country.